mehdy / keepalived-exporter

Prometheus Keepalived exporter
GNU General Public License v3.0
115 stars 37 forks source link

keepalived-exporter reading too fast from keepalived-generated dump files #169

Open Etrenak opened 2 days ago

Etrenak commented 2 days ago

Hello,

We have been running keepalived on a few routers of our student association, and we decided a few days ago to use this project (keepalived-exporter) to monitor our routers. However we came across some issues with a pair of old routers providing inconsistent metrics. For instance, here is a graph displaying the state of our interfaces of one of our router, and it sometimes "loses" some interfaces

image

What's interesting is that when we "lose" these interfaces, we actually get errored data instead :

image

Note that the vrid is zero, whereas we only have 2 vrrp groups with router IDs 4 and 6.

With some investigations, and by adding debug to the source code, I'm pretty confident I was able to identify the root cause. Because of slow read/write operations on the hard drive on these particular routers, the files keepalived.data is not fully written when the collector tries to read it. Sometimes it fails to read the two VRRP instances, so we get keepalived.data and keepalived.stats datas are not synced messages which is no problem because of the backoff logic. However, it sometimes get enough information from the file not to trigger this error, and the collector generates metrics based on incomplete data.

I also suspect that in some cases, keepalived doesn't even have the time to replace the files keepalived.stats and/or keepalived.data at all, given the shape of this graph : image

I am willing to help solving this issue, and the best idea I can think of is to do these two things :

I'd like to have your thoughts on this. Do you think there is another, easier way ?

clwluvw commented 2 days ago

I believe if it comes to have json stats enabled then it would be only one file we would deal with and the inconsistency would never happen. Not sure why it's not enabled by default.

Etrenak commented 2 days ago

I believe if it comes to have json stats enabled then it would be only one file we would deal with and the inconsistency would never happen. Not sure why it's not enabled by default.

Thank you very much for the tip, I should have looked in more details at the command line args. I enabled SIGJSON instead of data/stats and it certainly is a great improvement, at least for the second minor issue : image

I will let you know if it completely solves the issue after it will have run for a few hours.

Maybe SIGJSON is not enabled by default for compatibility with old versions of keepalived ? I think it would be a great idea to enable it by default, maybe only if we detect a version of keepalived recent enough (Just a guess, I have no idea if it is the reason)