open-telemetry / opentelemetry-collector

OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
4.36k stars 1.44k forks source link

Prometheus receiver not handling third-party Prometheus exporters sending untyped metrics #1103

Closed Nicolas-MacBeth closed 4 years ago

Nicolas-MacBeth commented 4 years ago

Describe the bug I am trying to send metrics from databases to the Collector using Prometheus' exporters for third-party services, and the Collector seems to not be able to handle it. My research has led to understand the problem lies in the Prometheus receiver, more specifically its handling of Prometheus untyped TYPE.

The biggest problem here is that because of that one "faulty" metric, all others (multiple hundred) are discarded and not exported. 2 out of 2 metrics exporters that I have tried send this type of metric, so I think it's safe to assume others will too. How would the community like to remedy this on the Collector side? At the very least, I think the behavior should be a warning + the rest of the metrics being handled/exported. Otherwise, how can we handle these untyped metrics? Here's the Prometheus reference.

In my case specifically, that metric could be substituted with a gauge - which fixed everything, but untyped is used a default according to the Prometheus reference. I'm unsure if it would be wise to "try to convert untyped to a gauge type by default", since an exporter passing incorrect values with a gauge type could break other things, but it could also be a great solution since it seems untyped is used when the actual value passed with the labels is unnecessary (for example: that "faulty" metric was the following:

# TYPE pg_static untyped
pg_static{server="localhost:5432",short_version="12.2.0",version="PostgreSQL 12.2 (Debian 12.2-1+build2) on x86_64-pc-linux-gnu, compiled by gcc (Debian 9.2.1-16) 9.2.1 20191030, 64-bit"} 1

here, the value 1 is not needed). Also, even if something passed with untyped "breaks" the receiver, that behavior is not distant from the one we have now (nothing being exported).

I'd appreciate some input on how this should be fixed, and depending on the complexity/work involved, would love to try my hand at fixing it. Thanks!

Extra info:

I tried with the MySQL exporter and the PostgreSQL exporter. They both give this log in the Collector, every 10 seconds (scrape period):

2020-06-10T14:53:39.615Z        INFO    internal/metrics_adjuster.go:357        Adjust - skipping unexpected point      
{"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "type": "UNSPECIFIED"}

The same metrics with the same exporters appear just fine in the actual Prometheus service, but are dropped in the Collector. Also, since they're dropped they're not exported to my backend of choice. Once again, the problem seems to be that the metricType (untyped) is not recognized by the Collector, and this line of code runs: https://github.com/open-telemetry/opentelemetry-collector/blob/8aa273184455591cad278c92c7cfcf75ad353d57/receiver/prometheusreceiver/internal/metrics_adjuster.go#L357 (notice the comment saying // this shouldn't run right above it).

More info: There are other untyped metrics exported by those two exporters, but they all have NaN as their values, so they're treated as dummy metrics (with 0 values if I understand correctly) by the Collector, so they don't cause the same problem - since they're dropped. However the second they have a value they would pose the same problem. Those metrics also appear properly when collected by the standalone Prometheus service, with NaN as values appearing in the dashboard. That's another instance of different behavior between the OTel Prometheus receiver and the standalone Prometheus service. This is less of a problem for now since at least the rest of the metrics are handled/exported.

Steps to reproduce Download and run any of the two aforementioned databases, download and build its exporter and the Collector. Run all three at ounce.

What did you expect to see? I expected to see my metrics exported and appear in my backend of choice (Stackdriver).

What did you see instead? I saw this warning message logged by the Collector for latest commit (9af3381):

2020-06-10T14:53:39.615Z        INFO    internal/metrics_adjuster.go:357        Adjust - skipping unexpected point      
{"component_kind": "receiver", "component_type": "prometheus", "component_name": "prometheus", "type": "UNSPECIFIED"}

Log when running Latest release binary (0.3.0):

{"level":"info","ts":1591373779.4030387,"caller":"internal/metrics_adjuster.go:357",
"msg":"Adjust - skipping unexpected point","type":"UNSPECIFIED"}

Also, those metrics were not exported, only dropped.

What version did you use? Open Telemetry Contrib:

What config did you use?

extensions:
        health_check:
        zpages:
                endpoint: 0.0.0.0:55679
receivers:
        prometheus:
                config:
                        scrape_configs:
                                - job_name: 'postgresql_exporter_collector'
                                  scrape_interval: 10s
                                  static_configs:
                                          - targets: ['localhost:9187']
processors:
        batch:
        queued_retry:
exporters:
        stackdriver:
                project: "placeholder"
                skip_create_metric_descriptor: true
        file:
                path: ./metric.json
service:
        pipelines:
                metrics:
                        receivers: [prometheus]
                        exporters: [stackdriver]
        extensions: [health_check, zpages]

Environment OS: Debian 9 Go version: 1.14.3 linux/amd64

Nicolas-MacBeth commented 4 years ago

Joshua MacDonald in the Gitter seems to support defaulting to gauge when there is ambiguity - see https://gitter.im/open-telemetry/community.

I believe the conversion is done at this function: convToOCAMetricType link.

draffensperger commented 4 years ago

@Nicolas-MacBeth do you think this would be a good first issue for you to tackle to do that conversion? @tigrannajaryan would you be open to that (I saw you assigned it to @asuresh4 )?

asuresh4 commented 4 years ago

@Nicolas-MacBeth - I'd happy to help with the review if you would like to work on this.

Nicolas-MacBeth commented 4 years ago

@asuresh4 I appreciate that, so to confirm; should I take the approach where all cases of ambiguity are handled as gauges (a default for all metric types whose type is unknown/not supported), or only for the untyped ones?

asuresh4 commented 4 years ago

Looking at the commit that introduced the new metric types, seems to me like metrics of untyped type map to unknown type. If that's indeed true, I believe we could convert metrics with unknown type to gauge. However, I'm not sure if the same holds for other types introduced by the commit, especially info and stateset types.

@bogdandrutu @tigrannajaryan - what do you guys think?

tigrannajaryan commented 4 years ago

This is outside of my area of expertise. @bogdandrutu perhaps you can help.

asuresh4 commented 4 years ago

@Nicolas-MacBeth - I synced with @bogdandrutu on this offline and he's in agreement that we can convert metrics of textparse.MetricTypeUnknown type to gauge, and get to the other other unsupported types at a later point. Let me know if you still would like to work on the issue or I could pick it up.

Nicolas-MacBeth commented 4 years ago

Okay, perfect! I'll make a PR soon; unless I'm missing anything it should be a simple fix. Thanks!

Nicolas-MacBeth commented 4 years ago

Hi @asuresh4 , I made a PR if you want to check it out! https://github.com/open-telemetry/opentelemetry-collector/pull/1194