Closed martinrw closed 10 months ago
Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself.
Not sure if this is related or not but I tried enabling error logs and I see stuff like this:
* collected metric prometheus_http_server_duration_milliseconds label:<name:"framework" value:"spring" > label:<name:"host_arch" value:"amd64" > label:<name:"host_name" value:"service-name-bf77bcc49-9dmh9" > label:<name:"http_method" value:"GET" > label:<name:"http_route" value:"/v1/vins/licenseplate/{licensePlateValue}/country/{country}/latest" > label:<name:"http_scheme" value:"http" > label:<name:"http_status_code" value:"204" > label:<name:"job" value:"service-name" > label:<name:"k8s_deployment_name" value:"service-name" > label:<name:"k8s_namespace_name" value:"service-name" > label:<name:"k8s_node_name" value:"ip-10-11-22-89.eu-central-1.compute.internal" > label:<name:"k8s_pod_ip" value:"10.11.20.233" > label:<name:"k8s_pod_name" value:"service-name-bf77bcc49-9dmh9" > label:<name:"k8s_pod_start_time" value:"2023-05-30 12:02:56 +0000 UTC" > label:<name:"k8s_pod_uid" value:"f3b6ef2e-31f5-4f4e-871a-ed34acfc247a" > label:<name:"label_team" value:"team-name" > label:<name:"net_host_name" value:"service-name-svc.service-name" > label:<name:"net_protocol_name" value:"http" > label:<name:"net_protocol_version" value:"1.1" > label:<name:"os_description" value:"Linux 5.15.90" > label:<name:"os_type" value:"linux" > label:<name:"process_command_args" value:"[\"/opt/java/openjdk/bin/java\",\"-XX:+UseSerialGC\",\"-Dopentelemetry.environment=development\",\"-javaagent:/opentelemetry/opentelemetry.jar\",\"-jar\",\"/app/service-name-fa541655.jar\"]" > label:<name:"process_executable_path" value:"/opt/java/openjdk/bin/java" > label:<name:"process_pid" value:"1" > label:<name:"process_runtime_description" value:"Eclipse Adoptium OpenJDK 64-Bit Server VM 19.0.2+7" > label:<name:"process_runtime_name" value:"OpenJDK Runtime Environment" > label:<name:"process_runtime_version" value:"19.0.2+7" > label:<name:"service_name" value:"service-name" > histogram:<sample_count:12 sample_sum:3068.996335 bucket:<cumulative_count:0 upper_bound:0 > bucket:<cumulative_count:0 upper_bound:5 > bucket:<cumulative_count:11 upper_bound:10 > bucket:<cumulative_count:11 upper_bound:25 > bucket:<cumulative_count:11 upper_bound:50 > bucket:<cumulative_count:11 upper_bound:75 > bucket:<cumulative_count:11 upper_bound:100 > bucket:<cumulative_count:11 upper_bound:250 > bucket:<cumulative_count:11 upper_bound:500 > bucket:<cumulative_count:11 upper_bound:750 > bucket:<cumulative_count:11 upper_bound:1000 > bucket:<cumulative_count:11 upper_bound:2500 > bucket:<cumulative_count:12 upper_bound:5000 > bucket:<cumulative_count:12 upper_bound:7500 > bucket:<cumulative_count:12 upper_bound:10000 > > has help "The duration of the inbound HTTP request" but should have "measures the duration of the inbound HTTP request"
* collected metric prometheus_jvm_threads_states label:<name:"host_arch" value:"amd64" > label:<name:"host_name" value:"5b8bfdc684-29xd5" > label:<name:"job" value:"service-name" > label:<name:"k8s_deployment_name" value:"service-name" > label:<name:"k8s_namespace_name" value:"service-name" > label:<name:"k8s_node_name" value:"ip-10-11-15-117.eu-central-1.compute.internal" > label:<name:"k8s_pod_ip" value:"10.11.11.123" > label:<name:"k8s_pod_name" value:"service-name-5b8bfdc684-29xd5" > label:<name:"k8s_pod_start_time" value:"2023-05-30 12:45:54 +0000 UTC" > label:<name:"k8s_pod_uid" value:"bcc28b31-7026-47d0-bf89-e996e8b138a2" > label:<name:"label_team" value:"team-name" > label:<name:"process_runtime_name" value:"OpenJDK Runtime Environment" > label:<name:"process_runtime_version" value:"17.0.6+10" > label:<name:"service_name" value:"service-name" > label:<name:"state" value:"terminated" > gauge:<value:0 > has help "The current number of threads having NEW state" but should have "The current number of threads"
* collected metric prometheus_http_server_duration_milliseconds label:<name:"host_arch" value:"amd64" > label:<name:"host_name" value:"74f9654b8-nkpdn" > label:<name:"http_method" value:"GET" > label:<name:"http_route" value:"/status" > label:<name:"http_scheme" value:"http" > label:<name:"http_status_code" value:"200" > label:<name:"job" value:"service-name" > label:<name:"k8s_deployment_name" value:"service-name" > label:<name:"k8s_namespace_name" value:"service-name" > label:<name:"k8s_node_name" value:"ip-10-11-20-192.eu-central-1.compute.internal" > label:<name:"k8s_pod_ip" value:"10.11.18.204" > label:<name:"k8s_pod_name" value:"service-name-74f9654b8-nkpdn" > label:<name:"k8s_pod_start_time" value:"2023-05-30 04:50:43 +0000 UTC" > label:<name:"k8s_pod_uid" value:"402efd2d-1daf-4672-b3cf-535a8f76d52a" > label:<name:"net_host_name" value:"10.11.18.204" > label:<name:"net_host_port" value:"8080" > label:<name:"net_protocol_name" value:"http" > label:<name:"net_protocol_version" value:"1.1" > label:<name:"os_description" value:"Linux 5.15.90" > label:<name:"os_type" value:"linux" > label:<name:"process_command_args" value:"[\"/opt/java/openjdk/bin/java\",\"-Xms512m\",\"-Xmx512m\",\"-Dnewrelic.environment=qa\",\"-javaagent:/open-telemetry/opentelemetry-javaagent.jar\",\"org.springframework.boot.loader.JarLauncher\"]" > label:<name:"process_executable_path" value:"/opt/java/openjdk/bin/java" > label:<name:"process_pid" value:"1" > label:<name:"process_runtime_description" value:"Eclipse Adoptium OpenJDK 64-Bit Server VM 17.0.6+10" > label:<name:"process_runtime_name" value:"OpenJDK Runtime Environment" > label:<name:"process_runtime_version" value:"17.0.6+10" > label:<name:"service_name" value:"tom" > histogram:<sample_count:1 sample_sum:701.935696 bucket:<cumulative_count:0 upper_bound:0 > bucket:<cumulative_count:0 upper_bound:5 > bucket:<cumulative_count:0 upper_bound:10 > bucket:<cumulative_count:0 upper_bound:25 > bucket:<cumulative_count:0 upper_bound:50 > bucket:<cumulative_count:0 upper_bound:75 > bucket:<cumulative_count:0 upper_bound:100 > bucket:<cumulative_count:0 upper_bound:250 > bucket:<cumulative_count:0 upper_bound:500 > bucket:<cumulative_count:1 upper_bound:750 > bucket:<cumulative_count:1 upper_bound:1000 > bucket:<cumulative_count:1 upper_bound:2500 > bucket:<cumulative_count:1 upper_bound:5000 > bucket:<cumulative_count:1 upper_bound:7500 > bucket:<cumulative_count:1 upper_bound:10000 > > has help "The duration of the inbound HTTP request" but should have "measures the duration of the inbound HTTP request"
just pulling out the puts that seem important there:
> > has help "The duration of the inbound HTTP request" but should have "measures the duration of the inbound HTTP request"
> has help "The current number of threads having NEW state" but should have "The current number of threads"
This would suggest a mis-match between the description of the metric being received and the actual spec of that metric?
but, it seems strange that this would only be for some data points in a metric and not every one... does this help to narrow down the issue
I have a workaround for this, which is to use Transform to make sure the description is always set correctly.. it feels like a bodge though and something that shouldn't be necessary so I'll leave this ticket open:
transform:
error_mode: ignore
metric_statements:
- context: metric
statements:
- set(description, "The duration of the inbound HTTP request") where name == "http.server.duration"
- set(description, "The current number of threads having NEW state") where name == "jvm.threads.states"
- set(description, "The number of concurrent HTTP requests that are currently in-flight") where name == "http.server.active_requests"
- set(description, "") where name == "http.server.requests"
- set(description, "") where name == "http.server.requests.max"
- set(description, "Number of log events that were enabled by the effective log level") where name == "logback.events"
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers
. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.
Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself.
Can you post a log dump with these error messages? Are they coming from the prometheus exporter? I'm trying to figure out where this is enforced but have not had any luck yet.
If the prometheus exporter enforces the descriptions being semantically correct than I believe the error would be in whatever resource is producing the metric with the bad description.
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers
. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.
Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself.
This issue has been closed as inactive because it has been stale for 120 days with no activity.
Component(s)
exporter/prometheus
What happened?
Describe the bug We are using the opentelemetry-collector helm chart to deploy. We have the otel agent running on a number of Java and Python services.
For metrics that are counter based such as http_server_duration_milliseconds_count we are frequently seeing dropped / missing data points. It is set to scrape every 15 seconds but as you can see from this screenshots we a frequently missing 1 or 2 data points in a row.
For our other metrics such as non-counter based Otel ones we see a data point every 15 seconds with nothing being dropped, same for metrics coming from Kube-state-metrics etc
Steps to reproduce Otel agent installed on docker image pushing metrics to the otel collector (See config below) - which is scraped by prometheus. We see the same thing when looking at the metric in both prometheus or grafana.
Looking at the Scrape config in prometheus, this is what we end up with:
What did you expect to see? I would expect to see a data point every 15 seconds
What did you see instead? When looking at the values over time for one of these metrics we occasionally have no data for 15 or 30 seconds (See screenshot above)
Collector version
0.77.0
Environment information
Environment
Kubernetes - Bottlerocket OS 1.13.1 (aws-k8s-1.24)
OpenTelemetry Collector configuration
Log output
No response
Additional context
This is a Brand new Prometheus + grafana + opentelemetry set-up. We've had this problem right from the beginning We see the missing datapoints both in our grafana queries and also querying prometheus directly