I have 3 instances of prometheus-postgresql-adapter (v0.6.0) running in HA mode as sidecar to prometheus and TimescaleDB (PostgreSQL 10) running in HA using Patroni, with the following configurations:
When the leaders' prometheus instance takes more than 15s to complete its remote_write, it logs this:
{"caller":"log.go:35","level":"warn","msg":"Prometheus timeout exceeded","timeout":"15s","ts":"2019-10-17T12:06:37.707Z"}
{"caller":"log.go:35","level":"warn","msg":"Scheduled election is paused. Instance is removed from election pool.","ts":"2019-10-17T12:06:37.713Z"}
{"caller":"log.go:27","count":5000,"duration":29.083339934,"level":"debug","msg":"Wrote samples","ts":"2019-10-17T12:06:40.305Z"}
{"caller":"log.go:31","level":"info","msg":"Samples write throughput","samples/sec":5000,"ts":"2019-10-17T12:06:40.306Z"}
{"caller":"log.go:27","count":5000,"duration":29.084319155,"level":"debug","msg":"Wrote samples","ts":"2019-10-17T12:06:40.306Z"}
{"caller":"log.go:31","level":"info","msg":"Samples write throughput","samples/sec":10000,"ts":"2019-10-17T12:06:40.308Z"}
{"caller":"log.go:27","level":"debug","msg":"Scheduled election is paused. Instance can't become a leader until scheduled election is resumed (Prometheus comes up again)","ts":"2019-10-17T12:06:42.026Z"}
{"caller":"log.go:31","level":"info","msg":"Instance is no longer a leader","ts":"2019-10-17T12:06:42.026Z"}
{"caller":"log.go:27","count":5000,"duration":28.095977723,"level":"debug","msg":"Wrote samples","ts":"2019-10-17T12:06:44.005Z"}
{"caller":"log.go:31","level":"info","msg":"Samples write throughput","samples/sec":5000,"ts":"2019-10-17T12:06:44.005Z"}
{"caller":"log.go:27","count":5000,"duration":30.800004099,"level":"debug","msg":"Wrote samples","ts":"2019-10-17T12:06:45.005Z"}
{"caller":"log.go:31","level":"info","msg":"Samples write throughput","samples/sec":5000,"ts":"2019-10-17T12:06:45.105Z"}
{"caller":"log.go:27","level":"debug","msg":"Election id 1: Instance is not a leader. Can't write data","ts":"2019-10-17T12:06:48.714Z"}
{"caller":"log.go:27","level":"debug","msg":"Scheduled election is paused. Instance can't become a leader until scheduled election is resumed (Prometheus comes up again)","ts":"2019-10-17T12:06:48.714Z"}
{"caller":"log.go:31","level":"info","msg":"Prometheus seems alive. Resuming scheduled election.","ts":"2019-10-17T12:06:48.729Z"}
{"caller":"log.go:27","level":"debug","msg":"Election id 1: Instance is not a leader. Can't write data","ts":"2019-10-17T12:06:48.732Z"}
...
{"caller":"log.go:31","level":"info","msg":"Samples write throughput","samples/sec":0,"ts":"2019-10-17T12:06:57.505Z"}
{"caller":"log.go:27","level":"debug","msg":"Election id 1: Instance is not a leader. Can't write data","ts":"2019-10-17T12:06:58.509Z"}
At that time none of the other 2 adapters are able to pickup the lock, and fail to write to the database.
Notice that it still writes to the database and prints write throughput even after losing its leader status. After some writes like this, samples/sec tune down to 0.
I have 3 instances of prometheus-postgresql-adapter (v0.6.0) running in HA mode as sidecar to prometheus and TimescaleDB (PostgreSQL 10) running in HA using Patroni, with the following configurations:
postgres.conf
prometheus-postgresql-adapter parameters
prometheus spec
Issue:
When the leaders' prometheus instance takes more than 15s to complete its remote_write, it logs this:
At that time none of the other 2 adapters are able to pickup the lock, and fail to write to the database.
Notice that it still writes to the database and prints write throughput even after losing its leader status. After some writes like this,
samples/sec
tune down to 0.All 3 adapters are printing the same thing: