open-telemetry / opentelemetry-collector-contrib

Contrib repository for the OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
3.02k stars 2.33k forks source link

Counter Metrics dropping scrapes / Datapoints #22742

Closed martinrw closed 10 months ago

martinrw commented 1 year ago

Component(s)

exporter/prometheus

What happened?

Describe the bug We are using the opentelemetry-collector helm chart to deploy. We have the otel agent running on a number of Java and Python services.

For metrics that are counter based such as http_server_duration_milliseconds_count we are frequently seeing dropped / missing data points. It is set to scrape every 15 seconds but as you can see from this screenshots we a frequently missing 1 or 2 data points in a row.

image

For our other metrics such as non-counter based Otel ones we see a data point every 15 seconds with nothing being dropped, same for metrics coming from Kube-state-metrics etc

Steps to reproduce Otel agent installed on docker image pushing metrics to the otel collector (See config below) - which is scraped by prometheus. We see the same thing when looking at the metric in both prometheus or grafana.

Looking at the Scrape config in prometheus, this is what we end up with:

- job_name: serviceMonitor/opentelemetry-collector/opentelemetry-collector-apps-monitor/0
  honor_labels: true
  honor_timestamps: true
  scrape_interval: 15s
  scrape_timeout: 10s
  metrics_path: /metrics
  scheme: http
  follow_redirects: true
  enable_http2: true

What did you expect to see? I would expect to see a data point every 15 seconds

What did you see instead? When looking at the values over time for one of these metrics we occasionally have no data for 15 or 30 seconds (See screenshot above)

Collector version

0.77.0

Environment information

Environment

Kubernetes - Bottlerocket OS 1.13.1 (aws-k8s-1.24)

OpenTelemetry Collector configuration

config:
  exporters:
    prometheus:
      endpoint: "0.0.0.0:9464"
      resource_to_telemetry_conversion:
        enabled: true
      enable_open_metrics: true
      metric_expiration: 3m
      namespace: prometheus
  extensions:
    health_check: {}
    zpages: {}
    pprof: {}
    memory_ballast:
      size_in_percentage: 30
  processors:

    memory_limiter:
      check_interval: 1s
      limit_percentage: 50
      spike_limit_percentage: 20
    batch:
      send_batch_size: 10000
      send_batch_max_size: 11000
      timeout: 2s
    resource:
      attributes:
        - key: telemetry.sdk.name
          action: delete
        - key: telemetry.sdk.version
          action: delete
        - key: telemetry.sdk.language
          action: delete
        - key: telemetry.auto.version
          action: delete
        - key: container.id
          action: delete
        - key: process.command.args
          action: delete
    k8sattributes/default:
  receivers:
    jaeger: null
    prometheus: null
    zipkin: null
    otlp:
      protocols:
        grpc:
          endpoint: 0.0.0.0:4317
        http:
          endpoint: 0.0.0.0:4318
  service:
    telemetry:
      #logs for the collector itself
      logs:
        level: panic
    extensions:
      - health_check
      - memory_ballast
    pipelines:
      traces: null
      logs: null
      metrics:
        exporters:
          - prometheus
        processors:
          - memory_limiter
          - batch
          - resource
          - k8sattributes/default
        receivers:
          - otlp
....
ports:
  jaeger-compact:
    enabled: false
  jaeger-thrift:
    enabled: false
  jaeger-grpc:
    enabled: false
  zipkin:
    enabled: false
  otlp-http:
    enabled: true
    containerPort: 4318
    servicePort: 4318
    hostPort: 4318
    protocol: TCP

  app-metrics:
    enabled: true
    containerPort: 9464
    servicePort: 9464
    hostPort: 9464
    protocol: TCP

  metrics:
    enabled: true
    containerPort: 8888
    servicePort: 8888
    protocol: TCP
...
podMonitor:
  enabled: true
  metricsEndpoints:
    - port: metrics
      # interval: 15s

  extraLabels:
    prometheus: scrape
    release: prometheus

serviceMonitor:
  enabled: true
  metricsEndpoints:
    - port: metrics
      interval: 15s
  prometheusMetricsEndpoints:
    - port: app-metrics
      interval: 15s
      honorLabels: true
      relabelings:
        - action: labeldrop
          regex: (container|endpoint|job|namespace|pod|service)
        - action: replace
          regex: (.*)
          replacement: otel-collector
          targetLabel: instance
  extraLabels:
    prometheus: scrape
    release: prometheus

Log output

No response

Additional context

This is a Brand new Prometheus + grafana + opentelemetry set-up. We've had this problem right from the beginning We see the missing datapoints both in our grafana queries and also querying prometheus directly

github-actions[bot] commented 1 year ago

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

martinrw commented 1 year ago

Not sure if this is related or not but I tried enabling error logs and I see stuff like this:

* collected metric prometheus_http_server_duration_milliseconds label:<name:"framework" value:"spring" > label:<name:"host_arch" value:"amd64" > label:<name:"host_name" value:"service-name-bf77bcc49-9dmh9" > label:<name:"http_method" value:"GET" > label:<name:"http_route" value:"/v1/vins/licenseplate/{licensePlateValue}/country/{country}/latest" > label:<name:"http_scheme" value:"http" > label:<name:"http_status_code" value:"204" > label:<name:"job" value:"service-name" > label:<name:"k8s_deployment_name" value:"service-name" > label:<name:"k8s_namespace_name" value:"service-name" > label:<name:"k8s_node_name" value:"ip-10-11-22-89.eu-central-1.compute.internal" > label:<name:"k8s_pod_ip" value:"10.11.20.233" > label:<name:"k8s_pod_name" value:"service-name-bf77bcc49-9dmh9" > label:<name:"k8s_pod_start_time" value:"2023-05-30 12:02:56 +0000 UTC" > label:<name:"k8s_pod_uid" value:"f3b6ef2e-31f5-4f4e-871a-ed34acfc247a" > label:<name:"label_team" value:"team-name" > label:<name:"net_host_name" value:"service-name-svc.service-name" > label:<name:"net_protocol_name" value:"http" > label:<name:"net_protocol_version" value:"1.1" > label:<name:"os_description" value:"Linux 5.15.90" > label:<name:"os_type" value:"linux" > label:<name:"process_command_args" value:"[\"/opt/java/openjdk/bin/java\",\"-XX:+UseSerialGC\",\"-Dopentelemetry.environment=development\",\"-javaagent:/opentelemetry/opentelemetry.jar\",\"-jar\",\"/app/service-name-fa541655.jar\"]" > label:<name:"process_executable_path" value:"/opt/java/openjdk/bin/java" > label:<name:"process_pid" value:"1" > label:<name:"process_runtime_description" value:"Eclipse Adoptium OpenJDK 64-Bit Server VM 19.0.2+7" > label:<name:"process_runtime_name" value:"OpenJDK Runtime Environment" > label:<name:"process_runtime_version" value:"19.0.2+7" > label:<name:"service_name" value:"service-name" > histogram:<sample_count:12 sample_sum:3068.996335 bucket:<cumulative_count:0 upper_bound:0 > bucket:<cumulative_count:0 upper_bound:5 > bucket:<cumulative_count:11 upper_bound:10 > bucket:<cumulative_count:11 upper_bound:25 > bucket:<cumulative_count:11 upper_bound:50 > bucket:<cumulative_count:11 upper_bound:75 > bucket:<cumulative_count:11 upper_bound:100 > bucket:<cumulative_count:11 upper_bound:250 > bucket:<cumulative_count:11 upper_bound:500 > bucket:<cumulative_count:11 upper_bound:750 > bucket:<cumulative_count:11 upper_bound:1000 > bucket:<cumulative_count:11 upper_bound:2500 > bucket:<cumulative_count:12 upper_bound:5000 > bucket:<cumulative_count:12 upper_bound:7500 > bucket:<cumulative_count:12 upper_bound:10000 > >  has help "The duration of the inbound HTTP request" but should have "measures the duration of the inbound HTTP request"
* collected metric prometheus_jvm_threads_states label:<name:"host_arch" value:"amd64" > label:<name:"host_name" value:"5b8bfdc684-29xd5" > label:<name:"job" value:"service-name" > label:<name:"k8s_deployment_name" value:"service-name" > label:<name:"k8s_namespace_name" value:"service-name" > label:<name:"k8s_node_name" value:"ip-10-11-15-117.eu-central-1.compute.internal" > label:<name:"k8s_pod_ip" value:"10.11.11.123" > label:<name:"k8s_pod_name" value:"service-name-5b8bfdc684-29xd5" > label:<name:"k8s_pod_start_time" value:"2023-05-30 12:45:54 +0000 UTC" > label:<name:"k8s_pod_uid" value:"bcc28b31-7026-47d0-bf89-e996e8b138a2" > label:<name:"label_team" value:"team-name" > label:<name:"process_runtime_name" value:"OpenJDK Runtime Environment" > label:<name:"process_runtime_version" value:"17.0.6+10" > label:<name:"service_name" value:"service-name" > label:<name:"state" value:"terminated" > gauge:<value:0 >  has help "The current number of threads having NEW state" but should have "The current number of threads"
* collected metric prometheus_http_server_duration_milliseconds label:<name:"host_arch" value:"amd64" > label:<name:"host_name" value:"74f9654b8-nkpdn" > label:<name:"http_method" value:"GET" > label:<name:"http_route" value:"/status" > label:<name:"http_scheme" value:"http" > label:<name:"http_status_code" value:"200" > label:<name:"job" value:"service-name" > label:<name:"k8s_deployment_name" value:"service-name" > label:<name:"k8s_namespace_name" value:"service-name" > label:<name:"k8s_node_name" value:"ip-10-11-20-192.eu-central-1.compute.internal" > label:<name:"k8s_pod_ip" value:"10.11.18.204" > label:<name:"k8s_pod_name" value:"service-name-74f9654b8-nkpdn" > label:<name:"k8s_pod_start_time" value:"2023-05-30 04:50:43 +0000 UTC" > label:<name:"k8s_pod_uid" value:"402efd2d-1daf-4672-b3cf-535a8f76d52a" > label:<name:"net_host_name" value:"10.11.18.204" > label:<name:"net_host_port" value:"8080" > label:<name:"net_protocol_name" value:"http" > label:<name:"net_protocol_version" value:"1.1" > label:<name:"os_description" value:"Linux 5.15.90" > label:<name:"os_type" value:"linux" > label:<name:"process_command_args" value:"[\"/opt/java/openjdk/bin/java\",\"-Xms512m\",\"-Xmx512m\",\"-Dnewrelic.environment=qa\",\"-javaagent:/open-telemetry/opentelemetry-javaagent.jar\",\"org.springframework.boot.loader.JarLauncher\"]" > label:<name:"process_executable_path" value:"/opt/java/openjdk/bin/java" > label:<name:"process_pid" value:"1" > label:<name:"process_runtime_description" value:"Eclipse Adoptium OpenJDK 64-Bit Server VM 17.0.6+10" > label:<name:"process_runtime_name" value:"OpenJDK Runtime Environment" > label:<name:"process_runtime_version" value:"17.0.6+10" > label:<name:"service_name" value:"tom" > histogram:<sample_count:1 sample_sum:701.935696 bucket:<cumulative_count:0 upper_bound:0 > bucket:<cumulative_count:0 upper_bound:5 > bucket:<cumulative_count:0 upper_bound:10 > bucket:<cumulative_count:0 upper_bound:25 > bucket:<cumulative_count:0 upper_bound:50 > bucket:<cumulative_count:0 upper_bound:75 > bucket:<cumulative_count:0 upper_bound:100 > bucket:<cumulative_count:0 upper_bound:250 > bucket:<cumulative_count:0 upper_bound:500 > bucket:<cumulative_count:1 upper_bound:750 > bucket:<cumulative_count:1 upper_bound:1000 > bucket:<cumulative_count:1 upper_bound:2500 > bucket:<cumulative_count:1 upper_bound:5000 > bucket:<cumulative_count:1 upper_bound:7500 > bucket:<cumulative_count:1 upper_bound:10000 > >  has help "The duration of the inbound HTTP request" but should have "measures the duration of the inbound HTTP request"

just pulling out the puts that seem important there:

> >  has help "The duration of the inbound HTTP request" but should have "measures the duration of the inbound HTTP request"
>  has help "The current number of threads having NEW state" but should have "The current number of threads"

This would suggest a mis-match between the description of the metric being received and the actual spec of that metric?

but, it seems strange that this would only be for some data points in a metric and not every one... does this help to narrow down the issue

martinrw commented 1 year ago

I have a workaround for this, which is to use Transform to make sure the description is always set correctly.. it feels like a bodge though and something that shouldn't be necessary so I'll leave this ticket open:

  transform:
      error_mode: ignore
      metric_statements:
        - context: metric
          statements:
            - set(description, "The duration of the inbound HTTP request") where name == "http.server.duration"
            - set(description, "The current number of threads having NEW state") where name == "jvm.threads.states"
            - set(description, "The number of concurrent HTTP requests that are currently in-flight") where name == "http.server.active_requests" 
            - set(description, "") where name == "http.server.requests"
            - set(description, "") where name == "http.server.requests.max"
            - set(description, "Number of log events that were enabled by the effective log level") where name == "logback.events"
github-actions[bot] commented 1 year ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

bryan-aguilar commented 1 year ago

Can you post a log dump with these error messages? Are they coming from the prometheus exporter? I'm trying to figure out where this is enforced but have not had any luck yet.

If the prometheus exporter enforces the descriptions being semantically correct than I believe the error would be in whatever resource is producing the metric with the bad description.

github-actions[bot] commented 1 year ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

github-actions[bot] commented 10 months ago

This issue has been closed as inactive because it has been stale for 120 days with no activity.