open-telemetry / opentelemetry-collector-contrib

Contrib repository for the OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
2.78k stars 2.19k forks source link

Otel can't handle messages from Databricks Diagnostic Tool with Event Hubs #33280

Open dannyamaya opened 2 months ago

dannyamaya commented 2 months ago

Describe the bug Messages from databricks sent through Event Hubs doesn't has the Time Grain value, you get this error for every new message.

Steps to reproduce Activate the diagnostic tool for Databricks connect it to a eventhub and then to an Otel instance.

What did you expect to see? Message should arrive with no problem, time grain parameter should be optional with a default value.

What did you see instead?

azureeventhubreceiver@v0.101.0/azureresourcemetrics_unmarshaler.go:104  Unhandled Time Grain    {"kind": "receiver", "name": "azureeventhub", "data_type": "metrics", "timegrain": ""}

What version did you use? Latest Otel.

What config did you use?

extensions:
  health_check:
  zpages:
    endpoint: localhost:55679

receivers:
  otlp:
    protocols:
      grpc:
      http:

  fluentforward:
    endpoint: 0.0.0.0:8006

  prometheus:
    config:
      scrape_configs:
      - job_name: 'otelcol' # Gets mapped to service.name
        scrape_interval: 10s
        static_configs:
        - targets: ['0.0.0.0:8888']

  prometheus/fluentd:
    config:
      scrape_configs:
      - job_name: 'fluentd' # Gets mapped to service.name
        scrape_interval: 10s
        static_configs:
        - targets: ['0.0.0.0:24231']

  hostmetrics:
    collection_interval: 10s
    scrapers:
      cpu:
      disk:
      filesystem:
      memory:
      network:
      # System load average metrics https://en.wikipedia.org/wiki/Load_(computing)
      load:
      # Paging/Swap space utilization and I/O metrics
      paging:
      # Aggregated system process count metrics
      processes:
      # System processes metrics, disabled by default
      # process:  

  azureeventhub:
    connection: Endpoint=xxxxxxxx
    offset:
    format:

processors:
  batch: # Batches data when sending
  resourcedetection:
    detectors: [azure, system]
    timeout: 2s
    override: false
  groupbyattrs:
    keys:
    - service.name
    - service.version
    - host.name

  memory_limiter:
    check_interval: 2s
    limit_mib: 256              

exporters:
  splunk_hec/logs:
    token: "xxxxxxxxxxxxxx"
    endpoint: "xxxxxxxxxxxx"
    index: "telemetry_open_telemetry_log_event_nv"
    # max_connections: 20
    disable_compression: false
    timeout: 10s
    tls:
      insecure_skip_verify: true
      ca_file: ""
      cert_file: ""
      key_file: ""

  splunk_hec/traces:
    token: "xxxxxxxxxxxx"
    endpoint: "xxxxxxxxxxx"
    index: "telemetry_open_telemetry_trace_event_nv"
    # max_connections: 20
    disable_compression: false
    timeout: 10s
    tls:
      insecure_skip_verify: true
      ca_file: ""
      cert_file: ""
      key_file: ""

  splunk_hec/metrics:
    token: "xxxxxxxxxxxxxx"
    endpoint: "xxxxxxxxxxxxxx"
    index: "telemetry_open_telemetry_metric_nv"
    # max_connections: 20
    disable_compression: false
    timeout: 10s
    tls:
      insecure_skip_verify: true
      ca_file: ""
      cert_file: ""
      key_file: ""      

service:  
  extensions: []

  pipelines:
    logs:
      receivers: [otlp]
      processors: [resourcedetection, groupbyattrs, memory_limiter, batch]
      exporters: [splunk_hec/logs]
    metrics:
      receivers: [hostmetrics, azureeventhub]
      processors: [resourcedetection, groupbyattrs, memory_limiter, batch]
      exporters: [splunk_hec/metrics]
    traces:
      receivers: [otlp]
      processors: [resourcedetection, groupbyattrs, memory_limiter, batch]
      exporters: [splunk_hec/traces]
  telemetry:
    logs:
      level: debug

Environment Azure App service running Otel with Latest Otel Version.

Additional context I already tried running Otel in pure Linux & Kubernetes.

github-actions[bot] commented 2 months ago

Pinging code owners for receiver/azureeventhub: @atoulme @cparkins. See Adding Labels via Comments if you do not have permissions to add labels yourself.

atoulme commented 2 months ago

@cparkins could we ingest the data point without setting the start timestamp?

cparkins commented 2 months ago

@atoulme I think this issue may actually be a type mismatch.

@dannyamaya When specifying the Diagnostic Settings for Databricks are there options under 'Metrics' or only 'Logs'?

According to the documentation only Logs are available: https://learn.microsoft.com/en-us/azure/azure-monitor/reference/supported-metrics/metrics-index

Also when I looked I could only see 'Logs'. If this is truly logs data attaching the Event Hub to a log receiver should resolve the issue as it does not require a time grain.

dannyamaya commented 2 months ago

Yes, you're right Databricks can't support metrics by the date of this post so that's probably why otel can't handle those messages and shows that error, my bad, thanks for clarifying.

cparkins commented 2 months ago

No worries, it's probably not exactly clear that the mapping is done by the pipeline data type from the documentation. But that is how I wrote it to work.