prometheus / graphite_exporter

Server that accepts metrics via the Graphite protocol and exports them as Prometheus metrics
Apache License 2.0
352 stars 99 forks source link

Handle duplicated metrics gracefully #79

Open pnyheim opened 5 years ago

pnyheim commented 5 years ago

Summary:

The exporter remembers metrics by Graphite name. It is possible to get into situations where multiple samples (graphite metrics) result in the same Prometheus metric.

At the moment, this makes the exporter catastrophically fail (return error 500 on /metrics). Instead, it should discard the older sample and use the new sample's value.

Original issue description:


We tried using v0.5.0 in our dev env today, upgrading from 0.4.2. After the upgrade, metrics seem to be dropping out regularly, as can be clearly seen in this grafana dashboard. image

So something is definitively wrong here. There are no changes to the environment or config between 0.4.2 and 0.5.0, and we can not see anything abnormal in the graphite_exporter logfiles.

This looks like a critical bug to us.

It should be noted that we have seen similar behavior before where we had UDP packet drops, but this was resolved by tuning the UDP receive buffer as explained here: https://nav.uninett.no/doc/4.3/howto/tuning_graphite.html

matthiasr commented 5 years ago

Damn!

When this happens, are there scrape errors? What does the exporter's /metrics endpoint look like when it's working and when not? Does it log anything? What does it log with --log-level=debug?

pnyheim commented 5 years ago

Damn!

Indeed.

When this happens, are there scrape errors?

How can I tell if there are scrape errors? I've only looked in the graphite_exporter log file, but did not see anything out of the ordinary. I'll try and look in the prometheus log files tomorrow.

What does the exporter's /metrics endpoint look like when it's working and when not?

Looks normal AFAIKT. Hard to tell if metrics are missing from /metrics though. Will try to look a bit more into it tomorrow as well

Does it log anything?

Nothing out of the ordinary

What does it log with --log-level=debug?

Did not try with the debug flag. Can do that tomorrow as well.

matthiasr commented 5 years ago

How can I tell if there are scrape errors?

On the Prometheus server, the up metric for this target at this time would go to 0. There is additional information on the Targets page of the Prometheus server – when there is an error it's shown next to the target there.

raypettersen commented 5 years ago

image

raypettersen commented 5 years ago

Also noticed this:

2019/02/28 17:31:22 http: multiple response.WriteHeader calls
2019/02/28 17:31:35 http: multiple response.WriteHeader calls

I've never seen these messages before and I don't know what they mean.

brian-brazil commented 5 years ago

To be clear, those are coming from the exporter, not Prometheus?

raypettersen commented 5 years ago

The log entries are from the exporter (if I understood the question correctly). They came after about 3 hours of runtime.

brian-brazil commented 5 years ago

That's probably a client_golang bug then, @beorn7

https://github.com/prometheus/client_golang/blob/2daed26f633ce23fab5efc02fb3f2e803b3526c3/prometheus/promhttp/http.go#L86

I can see this happening if you have a gzipped response, and then there's a problem encoding a metric family.

raypettersen commented 5 years ago

If there is enough "evidence", then I suggest removing the release for now. Let me know if I can do anything else for providing more details.

brian-brazil commented 5 years ago

The question is what's causing the underlying error. Could you add some extra logging around that bit of code?

pnyheim commented 5 years ago

I was able to poll the /metrics endpoint regularly for a while - and every once in a while it would return this:

An error has occurred while serving metrics:

2 error(s) occurred:
* collected metric "hikari_Wait" { label:<name:"attribute" value:"mean_rate" > label:<name:"datasource" value:"dsname" > label:<name:"foo" value:"bar" > gauge:<value:0 > } was collected before with the same name and label values
* collected metric "hikari_ConnectionCreation" { label:<name:"attribute" value:"p98" > label:<name:"datasource" value:"dsname2" > label:<name:"foo" value:"bar" > gauge:<value:25 > } was collected before with the same name and label values

I've seen different metric names and labels reported, and different number of errors from time to time

Ideally, this should not be written to the /metrics response, but logged as an error instead.

brian-brazil commented 5 years ago

The error there is expected and correct from an exposition standpoint. The question is what's causing it inside the graphite exporter.

pnyheim commented 5 years ago

Still, some error lines in the logfile when this happens would have been nice.

BTW, I enabling debug level logging, but could not find anything useful in the log.

matthiasr commented 5 years ago

The error happens at scrape time, so the exporter main doesn't have access to the error. Logging from within the library would be good but I don't know what a good interface for this would be.

The error you posted is input and mapping dependent. Could you share your mapping configuration, and any inputs that could get mapped to these metric names? I suspect we are keeping something in the exporter keyed on different Graphite metric names that then maps down to the same Prometheus time series.

On Thu, Feb 28, 2019, 21:18 Paul Nyheim notifications@github.com wrote:

Still, some error lines in the logfile when this happens would have been nice.

BTW, I enabling debug level logging, but could not find anything useful in the log.

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/prometheus/graphite_exporter/issues/79#issuecomment-468422347, or mute the thread https://github.com/notifications/unsubscribe-auth/AAICBgHOv74zP_6GY1VqT5z1oOdvIAOxks5vSDmDgaJpZM4bWzE8 .

matthiasr commented 5 years ago

This likely happened before but was silently handled (in an undefined way) by the client library. We came from an ancient version, and it has grown a lot more validations since then.

On Thu, Feb 28, 2019, 21:31 Matthias Rampke mr@soundcloud.com wrote:

The error happens at scrape time, so the exporter main doesn't have access to the error. Logging from within the library would be good but I don't know what a good interface for this would be.

The error you posted is input and mapping dependent. Could you share your mapping configuration, and any inputs that could get mapped to these metric names? I suspect we are keeping something in the exporter keyed on different Graphite metric names that then maps down to the same Prometheus time series.

On Thu, Feb 28, 2019, 21:18 Paul Nyheim notifications@github.com wrote:

Still, some error lines in the logfile when this happens would have been nice.

BTW, I enabling debug level logging, but could not find anything useful in the log.

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/prometheus/graphite_exporter/issues/79#issuecomment-468422347, or mute the thread https://github.com/notifications/unsubscribe-auth/AAICBgHOv74zP_6GY1VqT5z1oOdvIAOxks5vSDmDgaJpZM4bWzE8 .

fffonion commented 5 years ago

Sorry this looks like my carelessness during bumping up client_golang. FYI a similar error happened when running e2e test, hence this fix https://github.com/prometheus/graphite_exporter/commit/3d093347bac809689ebd31a344a4fbe9a10c9b35. Indeed this is a behaviour change in client_golang. I feel it could be addressed by not creating new metric every time in https://github.com/prometheus/graphite_exporter/blob/3d093347bac809689ebd31a344a4fbe9a10c9b35/main.go#L192-L201, instead we can cache the metric instance for same originalName (like in statsd_exporter).

However I do have a question in mind and I'm not very familiar with the graphite protocol: if two metric with the same name and labels arrives between two prometheus scrape, the current behaviour will be exporter only export the later value (since they are Gauge). Is this expected?

matthiasr commented 5 years ago

Yes, that's the semantics of the exporter and Prometheus. It's a little more complex in real Graphite because one of them could come with an older timestamp but we don't support that kind of backfilling. We only export the latest received sample.

On Mon, Mar 4, 2019, 07:24 Wangchong Zhou notifications@github.com wrote:

Sorry this looks like my carelessness during bumping up client_golang. FYI a similar error happened when running e2e test, hence this fix 3d09334 https://github.com/prometheus/graphite_exporter/commit/3d093347bac809689ebd31a344a4fbe9a10c9b35. Indeed this is a behaviour change in client_golang. I feel it could be addressed by not creating new metric every time in https://github.com/prometheus/graphite_exporter/blob/3d093347bac809689ebd31a344a4fbe9a10c9b35/main.go#L192-L201, instead we can cache the metric instance for same originalName (like in statsd_exporter).

However I do have a question in mind and I'm not very familiar with the graphite protocol: if two metric with the same name and labels arrives between two prometheus scrape, the current behaviour will be exporter only export the later value (since they are Gauge). Is this expected?

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/prometheus/graphite_exporter/issues/79#issuecomment-469133299, or mute the thread https://github.com/notifications/unsubscribe-auth/AAICBnzV63kbxFAG3Hs7a9v0MthdE6T_ks5vTLwZgaJpZM4bWzE8 .

beorn7 commented 5 years ago

That's probably a client_golang bug then […] I can see this happening if you have a gzipped response, and then there's a problem encoding a metric family.

Sorry for late follow-up. I think @brian-brazil is right here. If the encoding fails somewhere in the middle, we cannot send an HTTP error anymore. I'll think about a way to detect this and handle better. See https://github.com/prometheus/client_golang/issues/542

rtkgjacobs commented 5 years ago

Also ran into this, reverted back to 0.4.2 to unblock for now. Observed that the graphite-exporter returns HTTP/500's image and the following errors observed by directly curling the exporter as such:

HTTP/500 Response

And payload returned

An error has occurred while serving metrics:

15 error(s) occurred:
* collected metric "foobox_ddns_updater_timer_seconds" { gauge:<value:6604.048014 > } was collected before with the same name and label values
* collected metric "foobox_ddns_updater_timer_seconds" { gauge:<value:6604.048014 > } was collected before with the same name and label values
* collected metric "foobox_ddns_updater_timer_seconds" { gauge:<value:1 > } was collected before with the same name and label values
* collected metric "foobox_ddns_updater_timer_seconds" { gauge:<value:1 > } was collected before with the same name and label values
* collected metric "foobox_ddns_updater_timer_seconds" { gauge:<value:6604.048014 > } was collected before with the same name and label values
* collected metric "foobox_ddns_updater_timer_seconds" { gauge:<value:4.3613450171217345e+07 > } was collected before with the same name and label values
* collected metric "foobox_ddns_updater_timer_seconds" { gauge:<value:0 > } was collected before with the same name and label values
* collected metric "foo_ddns_updater_timer_seconds" { gauge:<value:6604.048014 > } was collected before with the same name and label values
* collected metric "os_network_dispatch_total" { label:<name:"action" value:"dispatch" > label:<name:"host" value:"graphite-i-45993598_ops_infra_prod" > gauge:<value:0 > } was collected before with the same name and label values
* collected metric "foo_ddns_updater_timer_seconds" { gauge:<value:4.3613450171217345e+07 > } was collected before with the same name and label values
* collected metric "os_network_send_total" { label:<name:"action" value:"send" > label:<name:"host" value:"graphite-i-45993598_ops_infra_prod" > gauge:<value:0 > } was collected before with the same name and label values
* collected metric "foo_ddns_updater_timer_seconds" { gauge:<value:6604.048014 > } was collected before with the same name and label values
* collected metric "foo_ddns_updater_timer_seconds" { gauge:<value:6604.048014 > } was collected before with the same name and label values
* collected metric "foo_ddns_updater_timer_seconds" { gauge:<value:6604.048014 > } was collected before with the same name and label values
* collected metric "foo_ddns_updater_timer_seconds" { gauge:<value:0.1 > } was collected before with the same name and label values

Let me know if you need any other logging or other details.

rtkgjacobs commented 5 years ago

Further update from my own testing / findings to date:

I assume because we're avoiding the client exporer error was collected before with the same name and label values from the prometheus.MustNewConstMetric() - meaning a new duplicate Metric is being created and the prometheus client library is angry about it and dropping large sets of our data in 0.4.2, and all data regardless in in 0.5.0.

By extending the window of time I can avoid it with our data's behavior. Certainly if logging debug output from the prometheus library is possible, it would be handy in the future to debug this stuff easier, as I have no real way of knowing what is tripping up the exporter client library.

I think above @fffonion is onto something, when suggesting this above:

Indeed this is a behaviour change in client_golang. I feel it could be addressed by not creating new metric every time in

graphite_exporter/main.go

Lines 192 to 201 in 3d09334

 for _, sample := range samples { 
  if ageLimit.After(sample.Timestamp) { 
      continue 
  } 
  ch <- prometheus.MustNewConstMetric( 
      prometheus.NewDesc(sample.Name, sample.Help, []string{}, sample.Labels), 
      sample.Type, 
      sample.Value, 
  ) 
 } 

, instead we can cache the metric instance for same originalName (like in statsd_exporter). However I do have a question in mind and I'm not very familiar with the graphite protocol: if two metric with the same name and labels arrives between two prometheus scrape, the current behaviour will be exporter only export the later value (since they are Gauge). Is this expected?

It seems I can get a working pipeline but only 2x the duration I set with graphite.sample-expiry. Our metrics data feed we're slowly replacing, has several field sets of data that are delivered out of order at times. We're fine with more recent data wining, but it seems the ordering is a pinch for us with the graphite-exporter. Anyone have suggestions? Seems the above suggestion would address this for us in part.

matthiasr commented 5 years ago

I don't think keeping persistent metric instances around is the best way. It fixes the issue incidentally because it forces us to dedup, but we can also dedup and then generate a constmetric.

At least as a direct fix for the issue at hand, what if we add straight-up deduplication to this loop? Basically, remember which (sample.Name,sample.Labels) combinations we have already seen and continue if we encounter one a second time?

That would also allow logging the duplicated samples, without relying on being able to instrument the client library.

As an alternative approach, just above that loop we're copying samples out of the graphiteCollector, what if instead of a slice we copy them into a map, with the (sample.Name,sample.Labels) as the key in some form? That way we would get the deduplication "for free". I'm not sure if either approach would guarantee "last metric wins" though.

rtkgjacobs commented 5 years ago

As you perform the deduplicate loop, can you look at the age of the entry vs what is already staged for passing into the Prometheus client exporter duplicate name wise? Design wise the goal is to ensure 'newer' entries of more than one duplicate found always end up on top as what is exported, a kind of 'bubbles up' sortation wise during that decision making. After all records are iterated over, only the 'newest' entry wins for passing up then. This feature could be an 'arg' like --graphite.dedupe etc if experimental or optional use case wise (would be nice to also have logging at the DEBUG level show when duplicates were found/displaced.

For us right now, our plan for production today sadly is to set a very long retention time like --graphite.sample-expiry=240m and then 'restarting' the service every say 60 minutes to garbage collect older metrics since we have 1000's of systems reporting legacy graphite data that auto-scale up and down. We can't immediately pivot to poll based metrics scraping and are building out a legacy migration story from graphite to push to prometheus scraping over time. We will be moving more and more into modern platforms and approaches like K8's and Prometheus of course and this project is part of that push.

Hope this helps. Happy to test anything out on my side, let me know thx.

matthiasr commented 5 years ago

I don't understand why a longer expiry helps?

rtkgjacobs commented 5 years ago

In our case we're receiving data from field devices in part, which have sporadic connectivity. Think of it as a IoT type use case in part. So we're getting a mixture of 'new' and also sometimes older graphite data. It is all mixed together. The majority of our data is 'fresh' - but it appears to be the sporadic elements that pose the problem for us. We really only care about 'latest' data, but this is a pinch that contrasts how say graphite vs prometheus handles this particular situation.

We find once that mixed set of older/newer data hits most of it is dropped by graphite-exporter (presumably the prometheus exporter library angry about duplicates?). I can't get any logging from the scrape exporter layer to confirm or understand my pinch surface area, which seems to be a limitation of how the prometheus export client is being used presently.

From looking at source, and increasing the expiration time via graphite.sample-expiry, we are able to receive all our data and have it scraped by Prometheus, at the cost of not rolling off old / not seen elements after x time, hence the 'worse case' garbage collection via a service restart. Yes, ugly I know :)

From a high level strategy, we're at a point now where Graphite isn't capable of scaling, and we're migrating to Prometheus. This is a two step approach, first replacing Graphite's role as is by using graphite-exporter to get that data in Prometheus, and then the slower refactoring and implementation of data collection fixtures (push vs poll etc) with the legacy graphite approach perhaps entirely being displaced in time.

Is there anything else I can gather or do to help out?

rtkgjacobs commented 5 years ago

So just to close out, we did much more investigation on our end and found that several origin data points graphite data was being 'duplicated' for prior teams convenience in the past - those two feeds were being sent into our single graphite-exporter - and when faced with those duplicate points at times (or delayed x vs y arrival times), graphite-exporter was 'doing its job'. The issue above still stands, that any duplicates HTTP/503 version 0.5.0 - so we'll use 0.4.2 until fixed. Of course not having logging output from its prometheus client exporter made finding the root cause hard, so a +1 on if that can ever be exposed logging wise.

I think the concept or idea of a --graphite.dedupe-old makes sense as a safety flag or legacy use case where someone wants to firehose some legacy data in and not spend the immense effort to ensure its 100% clean, but we've found our biggest pinch getting this to prod. It's also forcing a better design pattern where we scrape from major sources vs trying to use a single point ingress (which is not very prometheus-proper i'll agree) and likely run a single graphite-exporter far too hot.

Thanks again for your consideration and efforts to date!

matthiasr commented 5 years ago

Thank you for the update!

I think deduplication need not be optional, the failure mode when ingesting duplicate data is too bad. I don't see when you would ever want to accept a duplicate (and then fail on scrape).

On Tue, Mar 12, 2019 at 6:28 PM rtkgjacobs notifications@github.com wrote:

So just to close out, we did much more investigation on our end and found that several origin data points graphite data was being 'duplicated' for prior teams convenience in the past - those two feeds were being sent into our single graphite-exporter - and when faced with those duplicate points at times (or delayed x vs y arrival times), graphite-exporter was 'doing its job'. The issue above still stands, that any duplicates 503 version 0.5.0 - so we'll use 0.4.2 until fixed. Of course not having logging output from its prometheus client exporter made finding the root cause hard, so a +1 on if that can ever be exposed logging wise.

I think the concept or idea of a --graphite.dedupe makes sense as a safety flag or legacy use case where someone wants to firehose some legacy data in and not spend the immense effort to ensure its 100% clean, but we've found our biggest pinch getting this to prod. It's also forcing a better design pattern where we scrape from major sources vs trying to use a single point ingress (which is not very prometheus-proper i'll agree) and likely run a single graphite-exporter far too hot.

Thanks again for your consideration and efforts to date!

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/prometheus/graphite_exporter/issues/79#issuecomment-472126262, or mute the thread https://github.com/notifications/unsubscribe-auth/AAICBqKBffybWzRN8jzCFBLZtxHcK6Pgks5vV_G5gaJpZM4bWzE8 .

mkoppanen commented 4 years ago

I am seeing this issue with Apache Storm and the built-in graphite reporter (https://storm.apache.org/releases/current/javadocs/org/apache/storm/metrics2/reporters/GraphiteStormReporter.html). With quick debugging it seems that parallelism in Apache Storm causes duplicate metrics to be sent to Graphite exporter (each have the same key):

* collected metric "storm_xyz_response_time_per_product_stddev" { label:<name:"component_name" value:"xyz_updater" > label:<name:"job" value:"storm-supervisor-001" > label:<name:"topology_name" value:"product-ingestion" > gauge:<value:0 > } was collected before with the same name and label values

I would be happy by either being able to ignore duplicates or group by with operator. At the moment the behaviour is quite strict and causes graphite-exporter to go into failure state when duplicates are found. (tested with docker tag master and latest)

matthiasr commented 4 years ago

if anyone wants to pick this up, please do.

as for implementation, one way out (that I would prefer long-term) could be to extract the relevant parts of the statsd exporter as we already do for the mapper itself.

Logically, the graphite exporter does pretty much the same as the statsd exporter, except it only accepts gauges and the line format is slightly different. they have diverged a long time ago, and in the end, we keep fixing the same issues independently in both. Strategically, I would like to factor out all the reusable parts (listener, mapper[done], collector, …) into separate packages within the statsd exporter repo, and then only plug them together with a different line parser here.

matthiasr commented 4 years ago

There are more thoughts on what I have in mind for the split here https://github.com/prometheus/statsd_exporter/issues/234#issuecomment-501200735

matthiasr commented 4 years ago

x-ref: #133 made the Collector unchecked, so it should fail in fewer cases now. #134 for re-using more statsd exporter components so we don't have to solve all these issues over and over again.