gluster / gluster-prometheus

Gluster monitoring using Prometheus
GNU Lesser General Public License v2.1
119 stars 70 forks source link

Flickering metrics in prometheus #158

Closed Neraud closed 5 years ago

Neraud commented 5 years ago

Symptom

In Prometheus (and then in Grafana) there are some holes in the captured metrics.

prom_gluster_volume_profile_fop_hits

Configuration

[globals]
gluster-cluster-id = ""
gluster-mgmt = "glusterd"
glusterd-dir = "/var/lib/glusterd"
gluster-binary-path = "gluster"
port = 8080
metrics-path = "/metrics"
log-file = "stdout"
log-level = "info"

Versions

root@node-1:~# cat /etc/debian_version 
10.0

root@node-1:~# gluster --version
glusterfs 6.1
Repository revision: git://git.gluster.org/glusterfs.git
Copyright (c) 2006-2016 Red Hat, Inc. <https://www.gluster.org/>
GlusterFS comes with ABSOLUTELY NO WARRANTY.
It is licensed to you under your choice of the GNU Lesser
General Public License, version 3 or any later version (LGPLv3
or later), or the GNU General Public License, version 2 (GPLv2),
in all cases as published by the Free Software Foundation.

Exporter built from https://github.com/gluster/gluster-prometheus/commit/6c81f2a50569fe34b5e6c9fe3f02c1aba78eb1ac

Analysis

From what I understand, the metric collectors are roughly built on the same pattern :

  1. Reset the previously collected metrics
  2. Fetch the data
  3. Update the metrics

My asumption is that gluster commands take some time, and if Prometheus scraps the exporter while a collector is in step 2, some metrics are missing.

I've been testing GlusterD and the exporter in small VMs with limited resources, so the fetching time might be longer than on standard hosts.

To highlight the issue, I've built a custom mock_counts collector that sleeps for 2 seconds to simulate the long fetch step. And I indeed have the same kind of flicker in Prometheus.

Proposed solutions

I understand that reseting the metrics is required to avoid keeping stale entries.

I guess the best ways to resolve the issue would be to :

I'm not proficient enough in go to try and implement either of these.

A mitigation implementation could be to push the reset step as late as possible in the update process :

  1. Fetch the data
  2. Reset the previously collected metrics
  3. Update the metrics

This would not be perfect, as it still has a small window with partial exposed metrics. However it's far simpler, and I can work on a PR.

aravindavk commented 5 years ago

Either to only reset stale metrics

Makes sense to reset only stale metrics. Let me see if I can identify those metrics.