rabbitmq / rabbitmq-prometheus

A minimalistic Prometheus exporter of core RabbitMQ metrics
Other
147 stars 110 forks source link

Prevent non-zero publisher count in Grafana when aggregating metrics #61

Closed coro closed 3 years ago

coro commented 3 years ago

In the case where there are 0 channels (and as such 0 publishers), the dashboard reports there are actually n publishers in an n-node cluster.

Proposed Changes

This changes the calculation of publishers to be number of channels (which is always known) minus the number of consumers (which is always known). This avoids the issue where there is ambiguity where a channel which is publishing, but has not actively published yet.

Prior to this change, a cluster configured with prometheus.return_per_object_metrics = false will always report a non-zero number of publishers, even if there is no traffic on that cluster. For my example screenshots, I am using a three-node cluster with a random RabbitMQ pod being killed every minute, over the course of 5 minutes. You can see that there are always n publishers reported in Grafana for n nodes: image

After this change, this is correctly reported as 0: image

I confirmed that the metric behaves as usual where there are non-zero publishers using PerfTest with 5 consumers & publishers: image

Types of Changes

What types of changes does your code introduce to this project? Put an x in the boxes that apply

Checklist

Put an x in the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask on the mailing list. We're here to help! This is simply a reminder of what we are going to look for before merging your code.

Further Comments

Discovered this with @gerhard while we were working on https://github.com/rabbitmq/tgir/issues/19

gerhard commented 3 years ago

Works as expected, thanks! We need to update the dashboard json to the one generated by Grafana 7 before we can upload this to grafana.com. Do this post-merge.

The failing tests are not related to this since no source has changed. We need to drop OTP v21.3 support, Bump Elixir to >= 1.10 & add OTP v23.1 support. This will happen part of the monorepo migration, no point in fixing these GitHub Actions.