timescale / prometheus-postgresql-adapter

Use PostgreSQL as a remote storage database for Prometheus
Apache License 2.0
335 stars 66 forks source link

Prometheus-postgresql-adapter in HA mode starts to write with 2 nodes after database restart #89

Open m2dc0d3r opened 5 years ago

m2dc0d3r commented 5 years ago

node1.txt node2.txt

Aug 23 12:14:26 node1 prometheus-postgresql-adapter: {"caller":"log.go:39","err":"error getting DB connection: dial tcp X.X.X.X:5432: connect: connection refused","level":"error","msg":"Failed while becoming a leader","ts":"2019-08-23T12:14:26.236Z"} Aug 23 12:14:26 node1 prometheus-postgresql-adapter: {"caller":"log.go:39","err":"error getting DB connection: dial tcp X.X.X.X:5432: connect: connection refused","level":"error","msg":"Error while trying to become a leader","ts":"2019-08-23T12:14:26.806Z"} Aug 23 12:14:26 node1 prometheus-postgresql-adapter: {"caller":"log.go:39","err":"error getting DB connection: dial tcp X.X.X.X:5432: connect: connection refused","level":"error","msg":"Failed while becoming a leader","ts":"2019-08-23T12:14:26.806Z"} Aug 23 12:14:27 node1 prometheus-postgresql-adapter: {"caller":"log.go:39","err":"error getting DB connection: dial tcp X.X.X.X:5432: connect: connection refused","level":"error","msg":"Error while trying to become a leader","ts":"2019-08-23T12:14:27.403Z"} Aug 23 12:14:27 node1 prometheus-postgresql-adapter: {"caller":"log.go:39","err":"error getting DB connection: dial tcp X.X.X.X:5432: connect: connection refused","level":"error","msg":"Failed while becoming a leader","ts":"2019-08-23T12:14:27.403Z"} Aug 23 12:14:27 node1 prometheus-postgresql-adapter: {"caller":"log.go:31","level":"info","msg":"Samples write throughput","samples/sec":0,"ts":"2019-08-23T12:14:27.403Z"} Aug 23 12:14:27 node1 prometheus-postgresql-adapter: {"caller":"log.go:31","groupId":"10","level":"info","msg":"Instance became a leader","ts":"2019-08-23T12:14:27.883Z"} Aug 23 12:14:27 node1 prometheus-postgresql-adapter: {"caller":"log.go:39","err":"error getting DB connection: dial tcp X.X.X.X:5432: connect: connection refused","level":"error","msg":"Error while trying to become a leader","ts":"2019-08-23T12:14:27.976Z"} Aug 23 12:14:27 node1 prometheus-postgresql-adapter: {"caller":"log.go:39","err":"error getting DB connection: dial tcp X.X.X.X:5432: connect: connection refused","level":"error","msg":"Failed while becoming a leader","ts":"2019-08-23T12:14:27.976Z"} Aug 23 12:14:28 node1 prometheus-postgresql-adapter: {"caller":"log.go:39","err":"error getting DB connection: dial tcp X.X.X.X:5432: connect: connection refused","level":"error","msg":"Error while trying to become a leader","ts":"2019-08-23T12:14:28.563Z"} Aug 23 12:14:28 node1 prometheus-postgresql-adapter: {"caller":"log.go:39","err":"error getting DB connection: dial tcp X.X.X.X:5432: connect: connection refused","level":"error","msg":"Failed while becoming a leader","ts":"2019-08-23T12:14:28.563Z"} Aug 23 12:14:28 node1 prometheus-postgresql-adapter: {"caller":"log.go:31","level":"info","msg":"Samples write throughput","samples/sec":0,"ts":"2019-08-23T12:14:28.563Z"}

m2dc0d3r commented 5 years ago

It should not be possible to make a election when database is down.

mfreed commented 5 years ago

For clarification, my understand from discussion on community slack is the following:

Setup:

Issue:

  1. Prometheus initially properly works, with only one of the adaptors thinking its the leader, so only one writes to the TimescaleDB primary.
  2. The TimescaleDB primary fails / is killed.
  3. The TimescaleDB cluster properly does failover, and elects a former replica as the new primary, and the LB is updated to point to the new primary
  4. However, the existing Prometheus adaptor that previously was in "leader" mode continues to write (now to the new TimescaleDB primary), and the formerly-backup Prometheus adaptor thinks it has become a new leader.
  5. Result: Both adaptors are writing to the newly-elected TimescaleDB primary

Expected:

mfreed commented 5 years ago

(Poster believes issue related to second Prometheus adaptor being able to get elected /while/ database is offline.)

m2dc0d3r commented 5 years ago

Any news?

bboule commented 5 years ago

@m2dc0d3r it looks like we have a solution, we have an open pull request from one of our engineers and are coordinating a release timeline for the fix (possibly a 0.5.1 release) will keep you posted.