root@node-1:~# cat /etc/debian_version
10.0
root@node-1:~# gluster --version
glusterfs 6.1
Repository revision: git://git.gluster.org/glusterfs.git
Copyright (c) 2006-2016 Red Hat, Inc. <https://www.gluster.org/>
GlusterFS comes with ABSOLUTELY NO WARRANTY.
It is licensed to you under your choice of the GNU Lesser
General Public License, version 3 or any later version (LGPLv3
or later), or the GNU General Public License, version 2 (GPLv2),
in all cases as published by the Free Software Foundation.
From what I understand, the metric collectors are roughly built on the same pattern :
Reset the previously collected metrics
Fetch the data
Update the metrics
My asumption is that gluster commands take some time, and if Prometheus scraps the exporter while a collector is in step 2, some metrics are missing.
I've been testing GlusterD and the exporter in small VMs with limited resources, so the fetching time might be longer than on standard hosts.
To highlight the issue, I've built a custom mock_counts collector that sleeps for 2 seconds to simulate the long fetch step. And I indeed have the same kind of flicker in Prometheus.
Proposed solutions
I understand that reseting the metrics is required to avoid keeping stale entries.
I guess the best ways to resolve the issue would be to :
Either to only reset stale metrics
Or populate a new gaugeVec and refresh the whole list at once when the collection is finished
I'm not proficient enough in go to try and implement either of these.
A mitigation implementation could be to push the reset step as late as possible in the update process :
Fetch the data
Reset the previously collected metrics
Update the metrics
This would not be perfect, as it still has a small window with partial exposed metrics.
However it's far simpler, and I can work on a PR.
Symptom
In Prometheus (and then in Grafana) there are some holes in the captured metrics.
Configuration
Versions
Exporter built from https://github.com/gluster/gluster-prometheus/commit/6c81f2a50569fe34b5e6c9fe3f02c1aba78eb1ac
Analysis
From what I understand, the metric collectors are roughly built on the same pattern :
My asumption is that gluster commands take some time, and if Prometheus scraps the exporter while a collector is in step 2, some metrics are missing.
I've been testing GlusterD and the exporter in small VMs with limited resources, so the fetching time might be longer than on standard hosts.
To highlight the issue, I've built a custom
mock_counts
collector that sleeps for 2 seconds to simulate the long fetch step. And I indeed have the same kind of flicker in Prometheus.Proposed solutions
I understand that reseting the metrics is required to avoid keeping stale entries.
I guess the best ways to resolve the issue would be to :
I'm not proficient enough in go to try and implement either of these.
A mitigation implementation could be to push the reset step as late as possible in the update process :
This would not be perfect, as it still has a small window with partial exposed metrics. However it's far simpler, and I can work on a PR.