Prometheus-postgresql-adapter in HA mode starts to write with 2 nodes after database restart

m2dc0d3r commented 5 years ago

Aug 23 12:14:26 node1 prometheus-postgresql-adapter: {"caller":"log.go:39","err":"error getting DB connection: dial tcp X.X.X.X:5432: connect: connection refused","level":"error","msg":"Failed while becoming a leader","ts":"2019-08-23T12:14:26.236Z"} Aug 23 12:14:26 node1 prometheus-postgresql-adapter: {"caller":"log.go:39","err":"error getting DB connection: dial tcp X.X.X.X:5432: connect: connection refused","level":"error","msg":"Error while trying to become a leader","ts":"2019-08-23T12:14:26.806Z"} Aug 23 12:14:26 node1 prometheus-postgresql-adapter: {"caller":"log.go:39","err":"error getting DB connection: dial tcp X.X.X.X:5432: connect: connection refused","level":"error","msg":"Failed while becoming a leader","ts":"2019-08-23T12:14:26.806Z"} Aug 23 12:14:27 node1 prometheus-postgresql-adapter: {"caller":"log.go:39","err":"error getting DB connection: dial tcp X.X.X.X:5432: connect: connection refused","level":"error","msg":"Error while trying to become a leader","ts":"2019-08-23T12:14:27.403Z"} Aug 23 12:14:27 node1 prometheus-postgresql-adapter: {"caller":"log.go:39","err":"error getting DB connection: dial tcp X.X.X.X:5432: connect: connection refused","level":"error","msg":"Failed while becoming a leader","ts":"2019-08-23T12:14:27.403Z"} Aug 23 12:14:27 node1 prometheus-postgresql-adapter: {"caller":"log.go:31","level":"info","msg":"Samples write throughput","samples/sec":0,"ts":"2019-08-23T12:14:27.403Z"} Aug 23 12:14:27 node1 prometheus-postgresql-adapter: {"caller":"log.go:31","groupId":"10","level":"info","msg":"Instance became a leader","ts":"2019-08-23T12:14:27.883Z"} Aug 23 12:14:27 node1 prometheus-postgresql-adapter: {"caller":"log.go:39","err":"error getting DB connection: dial tcp X.X.X.X:5432: connect: connection refused","level":"error","msg":"Error while trying to become a leader","ts":"2019-08-23T12:14:27.976Z"} Aug 23 12:14:27 node1 prometheus-postgresql-adapter: {"caller":"log.go:39","err":"error getting DB connection: dial tcp X.X.X.X:5432: connect: connection refused","level":"error","msg":"Failed while becoming a leader","ts":"2019-08-23T12:14:27.976Z"} Aug 23 12:14:28 node1 prometheus-postgresql-adapter: {"caller":"log.go:39","err":"error getting DB connection: dial tcp X.X.X.X:5432: connect: connection refused","level":"error","msg":"Error while trying to become a leader","ts":"2019-08-23T12:14:28.563Z"} Aug 23 12:14:28 node1 prometheus-postgresql-adapter: {"caller":"log.go:39","err":"error getting DB connection: dial tcp X.X.X.X:5432: connect: connection refused","level":"error","msg":"Failed while becoming a leader","ts":"2019-08-23T12:14:28.563Z"} Aug 23 12:14:28 node1 prometheus-postgresql-adapter: {"caller":"log.go:31","level":"info","msg":"Samples write throughput","samples/sec":0,"ts":"2019-08-23T12:14:28.563Z"}

m2dc0d3r commented 5 years ago

It should not be possible to make a election when database is down.

mfreed commented 5 years ago

For clarification, my understand from discussion on community slack is the following:

Setup:

User has a 3 node replicated TimescaleDB setting with Patroni and a load balancer in front of them, so that LB points to the primary that has been elected via Patroni.
User has 2 Prometheus adaptors running

Issue:

Prometheus initially properly works, with only one of the adaptors thinking its the leader, so only one writes to the TimescaleDB primary.
The TimescaleDB primary fails / is killed.
The TimescaleDB cluster properly does failover, and elects a former replica as the new primary, and the LB is updated to point to the new primary
However, the existing Prometheus adaptor that previously was in "leader" mode continues to write (now to the new TimescaleDB primary), and the formerly-backup Prometheus adaptor thinks it has become a new leader.
Result: Both adaptors are writing to the newly-elected TimescaleDB primary

Expected:

Only one of the Prom adaptors should think it's leader. Not clear why the second adaptor was able to get the lock / become the leader (or why the first adaptor didn't time out ala a timed lease)

mfreed commented 5 years ago

(Poster believes issue related to second Prometheus adaptor being able to get elected /while/ database is offline.)

m2dc0d3r commented 5 years ago

Any news?

bboule commented 5 years ago

@m2dc0d3r it looks like we have a solution, we have an open pull request from one of our engineers and are coordinating a release timeline for the fix (possibly a 0.5.1 release) will keep you posted.

timescale / prometheus-postgresql-adapter

Prometheus-postgresql-adapter in HA mode starts to write with 2 nodes after database restart #89