prometheus-community / stackdriver_exporter

Google Stackdriver Prometheus exporter
Apache License 2.0
257 stars 98 forks source link

[Bugfix]: Fix histogram merge #324

Closed wildum closed 5 months ago

wildum commented 6 months ago

The current implementation is not merging the histogram correctly. The sum is calculated as follow:

count * (existingMean + currentMean) / 2

Now let's say that the existing histogram has a count of 1000 and an existing mean of 10000, the incoming histogram has a count of 100 and a mean of 1000.

newCount = 1000 + 100 = 1100 newMean = (10000 + 1000) / 2 = 5500 newSum = 1100 * newMean = 6_050_000

Now the next incoming histogram has a count of 1000 and an existing mean of 100:

newCount = 1100 + 1000 = 2100 newMean = (5500 +100) / 2 = 2800 newSum = 2100 * 2800 = 5_880_000

By definition, the sum of the histogram is supposed to behave like a counter (A counter is a cumulative metric that represents a single monotonically increasing counter whose value can only increase or be reset to zero on restart.)

In the example above, the sum decreased (5_880_000 < 6_050_000) which is not something that should happen.

You can see it happening with real values:

image

With this PR, the computation of the sum changes to the following:

sum = existingSum + (currentMean * count)

Taking the previous example, we would have:

previousSum = 10_000_000 newSum = previousSum (1000 100) = 10_100_000

previousSum = 10_100_000 newSum = previousSum (100 1000) = 10_200_000

The sum is increasing and the computation is correct: (the sum of 100 buckets with a mean value of 1000 should be equal to the sum of 1000 buckets with a mean value of 100)

enisoc commented 5 months ago

This looks like it will also fix a separate bug that I came here to report:

We've observed that the exported histogram _count was wrong when delta aggregation is turned on. The _count should be equal to the _bucket{le="+Inf"} value (which was itself correct), but instead the _count value was equal to the sum of all bucket counts.

I think that bug is due to this code:

https://github.com/prometheus-community/stackdriver_exporter/blob/205417faf538f3ebceb49564dbeed8b69a8dad48/delta/histogram.go#L114-L117

This PR replaces that code entirely, and the new code looks correct:

    // Increment totals based on incoming totals
    h.Sum += other.Sum
    h.Count += other.Count