Fix issue 57 - errors on VCL reload because of duplicated counters

LorenzoPeri commented 3 years ago

Fixes #57

Issue The exporter fails when VCL is reloaded. This happens because Varnish backend counters (VBE) are repeated for each reload with its timestamp: Example: VBE.boot.default.happy and VBE.reload_20210114_155148_19902.default.happy ... thus generating duplicated metrics.

The -p vcl_cooldown=1 workaround is not good enough, because old counters are still present in varnishstat for some time (granularity is 30s, see https://varnish-cache.org/docs/6.5/reference/varnishd.html#vcl-cooldown).

The result is that on each reload the exporter keeps failing for some time (1m in my tests with -p vcl_cooldown=1, 10m without it).

Solution I propose to consider only the counters with the most recent timestamp. I quickly calculate the most recent timestamp by checking the ".happy" counters, then I filter out the outdated counters. Should work for any Varnish version.

Tests included:

unit test for the 2 new methods
I duplicated VBE metrics in the 6.5.1.json file with 2 different timestamps to simulate 2 consecutive VCL reloads. Existing tests would fail without the fix.

LorenzoPeri commented 3 years ago

Question: Did I understand right that the bug is that over time the stats contain these reload_ prefixed entries. They do go away after that cooldown. but while they exist it breaks the whole metrics scrape and export?

Correct.

Varnish starts with just the VBE.boot counters. On each Varnish reload, a new set of VBE.reload_[timestamp] counters is added. The old counters will disappear after the cooldown.

For example, if you reload Varnish many times in a row, you could end up with many concurrent sets of VBE.reload_ counters, but after the cooldown only the most recent will be present (note that also the original VBE.boot set will be gone, it behaves like the others).

The only way to expose all the counters at the same time would probably be to add the timestamp as additional label on the Prometheus metrics. This way the conflict would be avoided, but a new time series would be created in Prometheus on each Varnish reload. IMHO it would not be good memory-wise for Prometheus and also it would be harder to extract useful metrics (like, how to ignore the outdated ones in queries?).

I think it's better to just expose the current active values.

But if the names are unique by the timestamp there, why is prometheus giving and error? Does our code somehow strip the timestamp out or the metric identifier? And then we end up with multiple metrics with the same name/labels which prometheus then rejects?

Also correct. The timestamp is not part of the exported metrics (name or tags), so there are naming conflicts.

Good code in general. Left some suggestions and I think one bug (though I assume its possible to have only one unique reload_ entries you might know better).

Thanks!

zipkid commented 2 years ago

Is it possible to create a release containing this fix?

jonnenauha commented 2 years ago

@zipkid latest release has this https://github.com/jonnenauha/prometheus_varnish_exporter/releases/tag/1.6.1

jonnenauha / prometheus_varnish_exporter

Fix issue 57 - errors on VCL reload because of duplicated counters #70