mehdy / keepalived-exporter

Prometheus Keepalived exporter
GNU General Public License v3.0
115 stars 37 forks source link

fix: add backoff to retrieve vrrp data and stats to avoid some out of sync errors #117

Closed dmachard closed 11 months ago

dmachard commented 1 year ago

My proposal to fix the issue #115 .

As suggested, I added the backoff mechanisms to read stats and data files. The synced error is steal raised if occurred after max elapsed time.

dmachard commented 1 year ago

I did that because I did more tests and no errors occurred regarding the opening of the file. The FileNotExists error has never raised (in fact no other errors on this phase).

It's why I added a new backoff retry independent of the retry of the openFile function and limit the number of changes.

In addition, the "not synced" error occurred when we have concurrent access to the /metrics API from prometheus and only with old keepalived version.

clwluvw commented 1 year ago

In addition, the "not synced" error occurred when we have concurrent access to the /metrics API from prometheus and only with old keepalived version.

This couldn't be possible - There is a mutex in Collect func: https://github.com/mehdy/keepalived-exporter/blob/master/internal/collector/collector.go#L111-L112

dmachard commented 1 year ago

In addition, the "not synced" error occurred when we have concurrent access to the /metrics API from prometheus and only with old keepalived version.

This couldn't be possible - There is a mutex in Collect func: https://github.com/mehdy/keepalived-exporter/blob/master/internal/collector/collector.go#L111-L112

Issue is not the exporter :) but in the keepalived binary himself regarding this concurrent problem

clwluvw commented 1 year ago

regarding this concurrent problem

I doubt that there is a concurrency problem here. I guess keepalived itself also logs each time it wants to dump data and stats so you can see if there is any concurrent request to keepalived for dumps.

dmachard commented 1 year ago

I don't have time to investigate more, the error occurred only with old keepalived v1.3.9 and working fine with keepalived v2.2. I guess something is different on keepalived to handle the SIG to dump data and stats.

I can just said that with this PR, the synced error message disappears thank to the retry.

dmachard commented 1 year ago

We are running our own binary with this PR and it's works fine with old and recent version of keepalived. Do you think it will be possible to merge this fix?

clwluvw commented 1 year ago

I'd still go with merging two backoffs with one - You only need to remove the OpenFileWithRetry func. It will retry with your written retry at the end.