stripe / veneur

A distributed, fault-tolerant pipeline for observability data
MIT License
1.73k stars 174 forks source link

veneur-prometheus bugfix: Specifying multiple tags/labels with -a causes sporadic incorrect metric emission #1052

Open dmarriner opened 1 year ago

dmarriner commented 1 year ago

Summary

Sort the labels on a given metric before adding them to the cache.

Motivation

This fixes a bug in veneur-prometheus that caused diff calculations to sporadically break when multiple labels were added via the -a flag. Labels were stored as key/values in a map and returned in a potentially different order each time since map iteration order is random in golang. This caused the translator to sometimes emit the cumulative value instead of the diffed value, because it would seem as though the metric was new when in fact it had been previously cached with a different label ordering. For example, the translator might look for "counter-key2:value2-key1:value1" when it had been previously cached as "counter-key1:value1-key2:value2"

Test plan

Wrote an integration test.

Rollout/monitoring/revert plan

N/A

CLAassistant commented 1 year ago

CLA assistant check
All committers have signed the CLA.

mimran-stripe commented 1 year ago

Hey @dmarriner!

I'm peaceful with this. The probability a map iteration would pass all 100 asserts is very low if I'm not mistaken. Assuming that iteration is truly random: (1 0.5 0.5) ^ (100) = 6.22301528e-61

One nit: could you update that PR comment to have

Summary Sort the labels on a given metric before adding them to the cache.

Motivation This fixes a bug in veneur-prometheus that caused diff calculations to sporadically break when multiple labels were added via the -a flag. Labels were stored as key/values in a map and returned in a potentially different order each time since map iteration order is random in golang. This caused the translator to sometimes emit the cumulative value instead of the diffed value, because it would seem as though the metric was new when in fact it had been previously cached with a different label ordering. For example, the translator might look for "counter-key2:value2-key1:value1" when it had been previously cached as "counter-key1:value1-key2:value2"

It'll make viewing git blame history a bit easier.

dmarriner commented 1 year ago

Updated the PR comment!