Open Etrenak opened 2 days ago
I believe if it comes to have json stats enabled then it would be only one file we would deal with and the inconsistency would never happen. Not sure why it's not enabled by default.
I believe if it comes to have json stats enabled then it would be only one file we would deal with and the inconsistency would never happen. Not sure why it's not enabled by default.
Thank you very much for the tip, I should have looked in more details at the command line args. I enabled SIGJSON instead of data/stats and it certainly is a great improvement, at least for the second minor issue :
I will let you know if it completely solves the issue after it will have run for a few hours.
Maybe SIGJSON is not enabled by default for compatibility with old versions of keepalived ? I think it would be a great idea to enable it by default, maybe only if we detect a version of keepalived recent enough (Just a guess, I have no idea if it is the reason)
Hello,
We have been running
keepalived
on a few routers of our student association, and we decided a few days ago to use this project (keepalived-exporter
) to monitor our routers. However we came across some issues with a pair of old routers providing inconsistent metrics. For instance, here is a graph displaying the state of our interfaces of one of our router, and it sometimes "loses" some interfacesWhat's interesting is that when we "lose" these interfaces, we actually get errored data instead :
Note that the vrid is zero, whereas we only have 2 vrrp groups with router IDs 4 and 6.
With some investigations, and by adding debug to the source code, I'm pretty confident I was able to identify the root cause. Because of slow read/write operations on the hard drive on these particular routers, the files
keepalived.data
is not fully written when the collector tries to read it. Sometimes it fails to read the two VRRP instances, so we getkeepalived.data and keepalived.stats datas are not synced
messages which is no problem because of the backoff logic. However, it sometimes get enough information from the file not to trigger this error, and the collector generates metrics based on incomplete data.I also suspect that in some cases, keepalived doesn't even have the time to replace the files![image](https://github.com/mehdy/keepalived-exporter/assets/21969306/dee5dd42-0ea9-41ee-afeb-1daeee0d482f)
keepalived.stats
and/orkeepalived.data
at all, given the shape of this graph :I am willing to help solving this issue, and the best idea I can think of is to do these two things :
keepalived
to add a file lock tokeepalived.data
andkeepalived.stats
while it is still writing in it. -> Solve the first issuekeepalived-exporter
before reading the file to verify it has been created recently (that is, by the last signal we sent) -> Solve the last graph issue (I will try this today)I'd like to have your thoughts on this. Do you think there is another, easier way ?