open-telemetry / opentelemetry-collector-contrib

Contrib repository for the OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
2.92k stars 2.28k forks source link

The datadog exporter metrics are associated with two service ids incorrectly in Datadog #20535

Closed hamidp555 closed 11 months ago

hamidp555 commented 1 year ago

Component(s)

exporter/datadog

What happened?

Description

The grouping of service metrics (exported by datadog exporter) in Datadog seems incorrect

Steps to Reproduce

1 - Deploy a collector with datadog exporter configured in GKE or AKS cluster 2 - Create a notebook in Datadog for one of the metrics coming from the collector 3 - filter by service_id 4 - You will see one metric associated with TWO service ids

Expected Result

Metric data is filtered by one servive_id and only ONE service_id should be seen for the metric data

Actual Result

Metric data is filtered by one servive_id and TWO different service_ids are seen for the metric data Screenshot 2023-03-31 at 4 20 04 PM

If collector is restarted one of the service_id changes randomly

Collector version

0.71.0

Environment information

Environment

OS: (e.g., "Ubuntu 20.04") Compiler(if manually compiled): (e.g., "go 14.2")

OpenTelemetry Collector configuration

exporters:
  datadog:
    api:
      key: "key"
      site: datadoghq.com
    sending_queue:
      enabled: false
  datadog/logs:
    api:
      key: ${DATADOG_API_KEY}
      site: datadoghq.com
  datadog/metrics:
    api:
      key: ${DATADOG_API_KEY}
      site: datadoghq.com
extensions:
  health_check: {}
  memory_ballast: {}
processors:
  attributes/metrics:
    actions:
    - action: upsert
      key: host
      value: ${MY_POD_NAME}
    - action: upsert
      key: datacenter_access_type
      value: Private
    - action: upsert
      key: datacenter_type
      value: CustomerCloud
    - action: upsert
      key: helm_chart_version
      value: v1.0.0
    - action: upsert
      key: image_version
      value: 0.71.0-1
    - action: upsert
      key: infrastructure_name
      value: nano-sa-staging-epnuz70bi9i
    - action: upsert
      key: maas_datacenter_id
      value: dttesting-gke-useast1
    - action: upsert
      key: maas_id
      value: staging
    - action: upsert
      key: org_id
      value: dttesting
    - action: upsert
      key: peb_type
      value: cloud
    - action: upsert
      key: service
      value: dt_collector
    - action: upsert
      key: service_class
      value: enterprise-250-standalone
    - action: upsert
      key: service_id
      value: ekziwbvbed5
    - action: upsert
      key: service_name
      value: s1-dt-250-sa
    - action: upsert
      key: service_type
      value: enterprise-standalone
  attributes/traces:
    actions:
    - action: upsert
      key: service.datacenter_id
      value: dttesting-gke-useast1
    - action: upsert
      key: service.org_id
      value: dttesting
    - action: upsert
      key: service.service_id
      value: ekziwbvbed5
    - action: upsert
      key: service.service_name
      value: s1-dt-250-sa
    - action: upsert
      key: service.vpn_name
      value: s1-dt-250-sa
  batch:
    send_batch_max_size: 100
    send_batch_size: 100
    timeout: 10s
  filter:
    metrics:
      include:
        match_type: strict
        metric_names:
        - otelcol_processor_refused_spans
        - otelcol_processor_dropped_spans
        - otelcol_receiver_dt_solacereceiver_dropped_span_messages
        - otelcol_receiver_dt_solacereceiver_need_upgrade
        - otelcol_exporter_jaeger_jaegerexporter_conn_state
        - otelcol_receiver_dt_solacereceiver_received_span_messages
        - otelcol_process_cpu_seconds
        - otelcol_process_uptime
        - otelcol_process_runtime_total_alloc_bytes
        - otelcol_process_runtime_total_sys_memory_bytes
        - otelcol_process_memory_rss
        - otelcol_process_runtime_heap_alloc_bytes
        - otelcol_receiver_dt_solacereceiver_receiver_status
        - otelcol_exporter_sent_spans
        - otelcol_exporter_send_failed_spans
        - otelcol_receiver_dt_solacereceiver_reported_spans
        - otelcol_receiver_dt_solacereceiver_failed_reconnections
        - otelcol_receiver_dt_solacereceiver_recoverable_unmarshalling_errors
        - otelcol_receiver_dt_solacereceiver_fatal_unmarshalling_errors
  memory_limiter:
    check_interval: 5s
    limit_mib: 409
    spike_limit_mib: 128
  metricstransform:
    transforms:
    - action: update
      include: otelcol_processor_refused_spans
      new_name: dt.processor.refused.spans
    - action: update
      include: otelcol_processor_dropped_spans
      new_name: dt.processor.dropped.spans
    - action: update
      include: otelcol_receiver_dt_solacereceiver_dropped_span_messages
      new_name: dt.receiver.dropped.span_messages
    - action: update
      include: otelcol_receiver_dt_solacereceiver_need_upgrade
      new_name: dt.receiver.upgrade.status
    - action: update
      include: otelcol_receiver_dt_solacereceiver_received_span_messages
      new_name: dt.collector.receiver.solacereceiver.received.span.messages
    - action: update
      include: otelcol_receiver_dt_solacereceiver_receiver_status
      new_name: dt.collector.receiver.solacereceiver.status
    - action: update
      include: otelcol_exporter_jaeger_jaegerexporter_conn_state
      new_name: dt.exporter.jaegerexporter.connection.state
    - action: update
      include: otelcol_process_runtime_heap_alloc_bytes
      new_name: dt.collector.process.runtime.heap.alloc.bytes
    - action: update
      include: otelcol_process_cpu_seconds
      new_name: dt.collector.process.cpu.seconds
    - action: update
      include: otelcol_process_uptime
      new_name: dt.collector.operational.process_uptime
    - action: update
      include: otelcol_process_runtime_total_alloc_bytes
      new_name: dt.collector.operational.memory.total
    - action: update
      include: otelcol_process_runtime_total_sys_memory_bytes
      new_name: dt.collector.operational.memory.system_total
    - action: update
      include: otelcol_process_memory_rss
      new_name: dt.collector.operational.physical.memory
    - action: update
      include: otelcol_exporter_sent_spans
      new_name: dt.collector.exporter.sent.spans
    - action: update
      include: otelcol_exporter_send_failed_spans
      new_name: dt.collector.exporter.send.failed.spans
    - action: update
      include: otelcol_receiver_dt_solacereceiver_reported_spans
      new_name: dt.collector.receiver.solacereceiver.reported.spans
    - action: update
      include: otelcol_receiver_dt_solacereceiver_failed_reconnections
      new_name: dt.receiver.solacereceiver.failed.reconnections
    - action: update
      include: otelcol_receiver_dt_solacereceiver_recoverable_unmarshalling_errors
      new_name: dt.receiver.solacereceiver.recoverable.unmarshalling_errors
    - action: update
      include: otelcol_receiver_dt_solacereceiver_fatal_unmarshalling_errors
      new_name: dt.collector.receiver.solacereceiver.fatal.unmarshalling_errors
  resource/traces:
    attributes:
    - action: upsert
      key: service.name
      value: dt-otel
receivers:
  filelog:
    attributes:
      ddsource: dt-otel
      ddtags: datacenter_access_type:Private,datacenter_type:CustomerCloud,helm_chart_version:v1.0.0,image_version:0.71.0-1,infrastructure_name:nano-sa-staging-epnuz70bi9i,maas_datacenter_id:dttesting-gke-useast1,maas_id:staging,org_id:dttesting,peb_type:cloud,service:dt_collector,service_class:enterprise-250-standalone,service_id:ekziwbvbed5,service_name:s1-dt-250-sa,service_type:enterprise-standalone
      hostname: ${MY_POD_NAME}
      service: dt_collector
    include:
    - /data/output-logs
    - /data/error-output-logs
    include_file_name: false
    include_file_path: true
    operators:
    - parse_from: body
      timestamp:
        layout: s.ns
        layout_type: epoch
        parse_from: attributes.ts
      type: json_parser
    poll_interval: 500ms
    start_at: beginning
  prometheus:
    config:
      scrape_configs:
      - job_name: otelcol
        scrape_interval: 10s
        static_configs:
        - targets:
          - 0.0.0.0:8888
  solace:
    auth:
      sasl_plain:
        password: ${OTEL_CLIENT_PASSWORD}
        username: dt-otel
    broker: dt-otel:5671
    queue: queue://#telemetry-tel
    tls:
      insecure_skip_verify: true
service:
  extensions:
  - health_check
  - memory_ballast
  pipelines:
    logs:
      exporters:
      - datadog/logs
      processors:
      - memory_limiter
      - batch
      receivers:
      - filelog
    metrics:
      exporters:
      - datadog/metrics
      processors:
      - memory_limiter
      - filter
      - metricstransform
      - attributes/metrics
      - batch
      receivers:
      - prometheus
    traces:
      exporters:
      - datadog
      processors:
      - memory_limiter
      - resource/traces
      - attributes/traces
      receivers:
      - solace
  telemetry:
    logs:
      disable_caller: true
      disable_stacktrace: true
      encoding: json
      error_output_paths:
      - stderr
      - /data/error-output-logs
      level: DEBUG
      output_paths:
      - stderr
      - /data/output-logs
    metrics:
      address: 0.0.0.0:8888
      level: detailed

Log output

{
    "level": "debug",
    "ts": 1680293652.3007936,
    "msg": "exporting native Datadog payload",
    "kind": "exporter",
    "data_type": "metrics",
    "name": "datadog/metrics",
    "metric": [
        {
            "metric": "dt.collector.process.cpu.seconds",
            "points": [
                {
                    "timestamp": 1680293651,
                    "value": 0.040000000000000036
                }
            ],
            "resources": [
                {
                    "name": "gke-dt-testing-gke-d-prod1k-node-pool-3fde0b3b-5z2m.stellar-arcadia-205014",
                    "type": "host"
                }
            ],
            "tags": [
                "service_instance_id:d09b8558-b7cd-413b-9eba-3468c73ccede",
                "service_name:s1-dt-250-sa",
                "service_version:0.71.0",
                "host:nano-sa-staging-epnuz70bi9i-dt-otel-55c86d7bf6-p6567",
                "datacenter_access_type:Private",
                "datacenter_type:CustomerCloud",
                "helm_chart_version:v1.0.0",
                "image_version:0.71.0-1",
                "infrastructure_name:nano-sa-staging-epnuz70bi9i",
                "maas_datacenter_id:dttesting-gke-useast1",
                "maas_id:staging",
                "org_id:dttesting",
                "peb_type:cloud",
                "service:dt_collector",
                "service_class:enterprise-250-standalone",
                "service_id:ekziwbvbed5",
                "service_type:enterprise-standalone",
                "service:otelcol"
            ],
            "type": 1
        },
        {
            "metric": "dt.collector.operational.memory.system_total",
            "points": [
                {
                    "timestamp": 1680293651,
                    "value": 59356424
                }
            ],
            "resources": [
                {
                    "name": "gke-dt-testing-gke-d-prod1k-node-pool-3fde0b3b-5z2m.stellar-arcadia-205014",
                    "type": "host"
                }
            ],
            "tags": [
                "service_instance_id:d09b8558-b7cd-413b-9eba-3468c73ccede",
                "service_name:s1-dt-250-sa",
                "service_version:0.71.0",
                "host:nano-sa-staging-epnuz70bi9i-dt-otel-55c86d7bf6-p6567",
                "datacenter_access_type:Private",
                "datacenter_type:CustomerCloud",
                "helm_chart_version:v1.0.0",
                "image_version:0.71.0-1",
                "infrastructure_name:nano-sa-staging-epnuz70bi9i",
                "maas_datacenter_id:dttesting-gke-useast1",
                "maas_id:staging",
                "org_id:dttesting",
                "peb_type:cloud",
                "service:dt_collector",
                "service_class:enterprise-250-standalone",
                "service_id:ekziwbvbed5",
                "service_type:enterprise-standalone",
                "service:otelcol"
            ],
            "type": 3
        },
        {
            "metric": "dt.collector.operational.physical.memory",
            "points": [
                {
                    "timestamp": 1680293651,
                    "value": 149000192
                }
            ],
            "resources": [
                {
                    "name": "gke-dt-testing-gke-d-prod1k-node-pool-3fde0b3b-5z2m.stellar-arcadia-205014",
                    "type": "host"
                }
            ],
            "tags": [
                "service_instance_id:d09b8558-b7cd-413b-9eba-3468c73ccede",
                "service_name:s1-dt-250-sa",
                "service_version:0.71.0",
                "host:nano-sa-staging-epnuz70bi9i-dt-otel-55c86d7bf6-p6567",
                "datacenter_access_type:Private",
                "datacenter_type:CustomerCloud",
                "helm_chart_version:v1.0.0",
                "image_version:0.71.0-1",
                "infrastructure_name:nano-sa-staging-epnuz70bi9i",
                "maas_datacenter_id:dttesting-gke-useast1",
                "maas_id:staging",
                "org_id:dttesting",
                "peb_type:cloud",
                "service:dt_collector",
                "service_class:enterprise-250-standalone",
                "service_id:ekziwbvbed5",
                "service_type:enterprise-standalone",
                "service:otelcol"
            ],
            "type": 3
        },
        {
            "metric": "dt.collector.operational.process_uptime",
            "points": [
                {
                    "timestamp": 1680293651,
                    "value": 10.00009653699999
                }
            ],
            "resources": [
                {
                    "name": "gke-dt-testing-gke-d-prod1k-node-pool-3fde0b3b-5z2m.stellar-arcadia-205014",
                    "type": "host"
                }
            ],
            "tags": [
                "service_instance_id:d09b8558-b7cd-413b-9eba-3468c73ccede",
                "service_name:s1-dt-250-sa",
                "service_version:0.71.0",
                "host:nano-sa-staging-epnuz70bi9i-dt-otel-55c86d7bf6-p6567",
                "datacenter_access_type:Private",
                "datacenter_type:CustomerCloud",
                "helm_chart_version:v1.0.0",
                "image_version:0.71.0-1",
                "infrastructure_name:nano-sa-staging-epnuz70bi9i",
                "maas_datacenter_id:dttesting-gke-useast1",
                "maas_id:staging",
                "org_id:dttesting",
                "peb_type:cloud",
                "service:dt_collector",
                "service_class:enterprise-250-standalone",
                "service_id:ekziwbvbed5",
                "service_type:enterprise-standalone",
                "service:otelcol"
            ],
            "type": 1
        },
        {
            "metric": "dt.collector.process.runtime.heap.alloc.bytes",
            "points": [
                {
                    "timestamp": 1680293651,
                    "value": 30544600
                }
            ],
            "resources": [
                {
                    "name": "gke-dt-testing-gke-d-prod1k-node-pool-3fde0b3b-5z2m.stellar-arcadia-205014",
                    "type": "host"
                }
            ],
            "tags": [
                "service_instance_id:d09b8558-b7cd-413b-9eba-3468c73ccede",
                "service_name:s1-dt-250-sa",
                "service_version:0.71.0",
                "host:nano-sa-staging-epnuz70bi9i-dt-otel-55c86d7bf6-p6567",
                "datacenter_access_type:Private",
                "datacenter_type:CustomerCloud",
                "helm_chart_version:v1.0.0",
                "image_version:0.71.0-1",
                "infrastructure_name:nano-sa-staging-epnuz70bi9i",
                "maas_datacenter_id:dttesting-gke-useast1",
                "maas_id:staging",
                "org_id:dttesting",
                "peb_type:cloud",
                "service:dt_collector",
                "service_class:enterprise-250-standalone",
                "service_id:ekziwbvbed5",
                "service_type:enterprise-standalone",
                "service:otelcol"
            ],
            "type": 3
        },
        {
            "metric": "dt.collector.receiver.dtreceiver.status",
            "points": [
                {
                    "timestamp": 1680293651,
                    "value": 2
                }
            ],
            "resources": [
                {
                    "name": "gke-dt-testing-gke-d-prod1k-node-pool-3fde0b3b-5z2m.stellar-arcadia-205014",
                    "type": "host"
                }
            ],
            "tags": [
                "service_instance_id:d09b8558-b7cd-413b-9eba-3468c73ccede",
                "service_name:s1-dt-250-sa",
                "service_version:0.71.0",
                "host:nano-sa-staging-epnuz70bi9i-dt-otel-55c86d7bf6-p6567",
                "datacenter_access_type:Private",
                "datacenter_type:CustomerCloud",
                "helm_chart_version:v1.0.0",
                "image_version:0.71.0-1",
                "infrastructure_name:nano-sa-staging-epnuz70bi9i",
                "maas_datacenter_id:dttesting-gke-useast1",
                "maas_id:staging",
                "org_id:dttesting",
                "peb_type:cloud",
                "service:dt_collector",
                "service_class:enterprise-250-standalone",
                "service_id:ekziwbvbed5",
                "service_type:enterprise-standalone",
                "service:otelcol"
            ],
            "type": 3
        },
        {
            "metric": "dt.collector.operational.memory.total",
            "points": [
                {
                    "timestamp": 1680293651,
                    "value": 3368056
                }
            ],
            "resources": [
                {
                    "name": "gke-dt-testing-gke-d-prod1k-node-pool-3fde0b3b-5z2m.stellar-arcadia-205014",
                    "type": "host"
                }
            ],
            "tags": [
                "service_instance_id:d09b8558-b7cd-413b-9eba-3468c73ccede",
                "service_name:s1-dt-250-sa",
                "service_version:0.71.0",
                "host:nano-sa-staging-epnuz70bi9i-dt-otel-55c86d7bf6-p6567",
                "datacenter_access_type:Private",
                "datacenter_type:CustomerCloud",
                "helm_chart_version:v1.0.0",
                "image_version:0.71.0-1",
                "infrastructure_name:nano-sa-staging-epnuz70bi9i",
                "maas_datacenter_id:dttesting-gke-useast1",
                "maas_id:staging",
                "org_id:dttesting",
                "peb_type:cloud",
                "service:dt_collector",
                "service_class:enterprise-250-standalone",
                "service_id:ekziwbvbed5",
                "service_type:enterprise-standalone",
                "service:otelcol"
            ],
            "type": 1
        },
        {
            "metric": "otel.datadog_exporter.metrics.running",
            "points": [
                {
                    "timestamp": 1680293652,
                    "value": 1
                }
            ],
            "resources": [
                {
                    "name": "gke-dt-testing-gke-d-prod1k-node-pool-3fde0b3b-5z2m.stellar-arcadia-205014",
                    "type": "host"
                }
            ],
            "tags": [
                "version:0.71.0",
                "command:otelcol-contrib"
            ],
            "type": 3
        }
    ]
}

Additional context

No response

github-actions[bot] commented 1 year ago

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

mx-psi commented 1 year ago

Hi @hamidp555, we don't handle service_id in any special way on the metrics exporter: it is not added based on any other attributes or handled differently from any other custom tag. Your logs only show the ekziwbvbed5 value, so it does not seem like the exporter is adding multiple service_id values on the metrics you shared with us. To further troubleshoot this, I would recommend:

hamidp555 commented 1 year ago

Hi @mx-psi thank you for your response, we will follow the troubleshooting suggestions I can see that the metric data from some collector with the same resources.name later on caused the issue in datadog. I was wondering how the resources: [{"name": "somevalue", "type": "host"}] can be configured for collector?

github-actions[bot] commented 1 year ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

github-actions[bot] commented 1 year ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

github-actions[bot] commented 11 months ago

This issue has been closed as inactive because it has been stale for 120 days with no activity.