Flickering metrics in prometheus

Symptom

In Prometheus (and then in Grafana) there are some holes in the captured metrics.

prom_gluster_volume_profile_fop_hits

Configuration

[globals]
gluster-cluster-id = ""
gluster-mgmt = "glusterd"
glusterd-dir = "/var/lib/glusterd"
gluster-binary-path = "gluster"
port = 8080
metrics-path = "/metrics"
log-file = "stdout"
log-level = "info"

Versions

root@node-1:~# cat /etc/debian_version 
10.0

root@node-1:~# gluster --version
glusterfs 6.1
Repository revision: git://git.gluster.org/glusterfs.git
Copyright (c) 2006-2016 Red Hat, Inc. <https://www.gluster.org/>
GlusterFS comes with ABSOLUTELY NO WARRANTY.
It is licensed to you under your choice of the GNU Lesser
General Public License, version 3 or any later version (LGPLv3
or later), or the GNU General Public License, version 2 (GPLv2),
in all cases as published by the Free Software Foundation.

Exporter built from https://github.com/gluster/gluster-prometheus/commit/6c81f2a50569fe34b5e6c9fe3f02c1aba78eb1ac

Analysis

From what I understand, the metric collectors are roughly built on the same pattern :

Reset the previously collected metrics
Fetch the data
Update the metrics

My asumption is that gluster commands take some time, and if Prometheus scraps the exporter while a collector is in step 2, some metrics are missing.

I've been testing GlusterD and the exporter in small VMs with limited resources, so the fetching time might be longer than on standard hosts.

To highlight the issue, I've built a custom mock_counts collector that sleeps for 2 seconds to simulate the long fetch step. And I indeed have the same kind of flicker in Prometheus.

Proposed solutions

I understand that reseting the metrics is required to avoid keeping stale entries.

I guess the best ways to resolve the issue would be to :

Either to only reset stale metrics
Or populate a new gaugeVec and refresh the whole list at once when the collection is finished

I'm not proficient enough in go to try and implement either of these.

A mitigation implementation could be to push the reset step as late as possible in the update process :

Fetch the data
Reset the previously collected metrics
Update the metrics

This would not be perfect, as it still has a small window with partial exposed metrics. However it's far simpler, and I can work on a PR.

gluster / gluster-prometheus