timescale / prometheus-postgresql-adapter

Use PostgreSQL as a remote storage database for Prometheus
Apache License 2.0
335 stars 66 forks source link

Leader election not happening if leader resigns due to prometheus timeout #97

Closed teraflik closed 4 years ago

teraflik commented 4 years ago

I have 3 instances of prometheus-postgresql-adapter (v0.6.0) running in HA mode as sidecar to prometheus and TimescaleDB (PostgreSQL 10) running in HA using Patroni, with the following configurations:

postgres.conf

shared_preload_libraries: "pg_prometheus,timescaledb"
work_mem: "10485kB"
maintenance_work_mem: "1GB"
effective_io_concurrency: 200
wal_buffers: "16MB"
max_wal_size: "8GB"
min_wal_size: "4GB"
random_page_cost: 1.1
effective_cache_size: "6GB"
default_statistics_target: 500
autovacuum_naptime: 10
autovacuum_max_workers: 10
checkpoint_completion_target: 0.9
max_connections: 100
max_locks_per_transaction: 128
shared_buffers: "2GB"
synchronous_commit: "off"

prometheus-postgresql-adapter parameters

-leader-election-pg-advisory-lock-id=1
-leader-election-pg-advisory-lock-prometheus-timeout=15s
-pg-host=$(TIMESCALEDB_HOST)
-pg-port=$(TIMESCALEDB_PORT)
-pg-database=$(TIMESCALEDB_NAME)
-pg-user=$(TIMESCALEDB_USER)
-pg-password=$(TIMESCALEDB_PASSWORD)
-pg-prometheus-chunk-interval=24h"

prometheus spec

  remoteWrite:
    - url: http://localhost:9201/write
      queueConfig:
        capacity: 5000
        maxShards: 500
        minShards: 1
        maxSamplesPerSend: 1000
        batchSendDeadline: 5s
  remoteRead:
    - url: http://localhost:9201/read
      readRecent: true

Issue:

When the leaders' prometheus instance takes more than 15s to complete its remote_write, it logs this:

{"caller":"log.go:35","level":"warn","msg":"Prometheus timeout exceeded","timeout":"15s","ts":"2019-10-17T12:06:37.707Z"}        
{"caller":"log.go:35","level":"warn","msg":"Scheduled election is paused. Instance is removed from election pool.","ts":"2019-10-17T12:06:37.713Z"}
{"caller":"log.go:27","count":5000,"duration":29.083339934,"level":"debug","msg":"Wrote samples","ts":"2019-10-17T12:06:40.305Z"}
{"caller":"log.go:31","level":"info","msg":"Samples write throughput","samples/sec":5000,"ts":"2019-10-17T12:06:40.306Z"}
{"caller":"log.go:27","count":5000,"duration":29.084319155,"level":"debug","msg":"Wrote samples","ts":"2019-10-17T12:06:40.306Z"}
{"caller":"log.go:31","level":"info","msg":"Samples write throughput","samples/sec":10000,"ts":"2019-10-17T12:06:40.308Z"}
{"caller":"log.go:27","level":"debug","msg":"Scheduled election is paused. Instance can't become a leader until scheduled election is resumed (Prometheus comes up again)","ts":"2019-10-17T12:06:42.026Z"}
{"caller":"log.go:31","level":"info","msg":"Instance is no longer a leader","ts":"2019-10-17T12:06:42.026Z"}
{"caller":"log.go:27","count":5000,"duration":28.095977723,"level":"debug","msg":"Wrote samples","ts":"2019-10-17T12:06:44.005Z"}
{"caller":"log.go:31","level":"info","msg":"Samples write throughput","samples/sec":5000,"ts":"2019-10-17T12:06:44.005Z"}
{"caller":"log.go:27","count":5000,"duration":30.800004099,"level":"debug","msg":"Wrote samples","ts":"2019-10-17T12:06:45.005Z"}
{"caller":"log.go:31","level":"info","msg":"Samples write throughput","samples/sec":5000,"ts":"2019-10-17T12:06:45.105Z"}
{"caller":"log.go:27","level":"debug","msg":"Election id 1: Instance is not a leader. Can't write data","ts":"2019-10-17T12:06:48.714Z"}
{"caller":"log.go:27","level":"debug","msg":"Scheduled election is paused. Instance can't become a leader until scheduled election is resumed (Prometheus comes up again)","ts":"2019-10-17T12:06:48.714Z"}
{"caller":"log.go:31","level":"info","msg":"Prometheus seems alive. Resuming scheduled election.","ts":"2019-10-17T12:06:48.729Z"}
{"caller":"log.go:27","level":"debug","msg":"Election id 1: Instance is not a leader. Can't write data","ts":"2019-10-17T12:06:48.732Z"}
...
{"caller":"log.go:31","level":"info","msg":"Samples write throughput","samples/sec":0,"ts":"2019-10-17T12:06:57.505Z"}   
{"caller":"log.go:27","level":"debug","msg":"Election id 1: Instance is not a leader. Can't write data","ts":"2019-10-17T12:06:58.509Z"}

At that time none of the other 2 adapters are able to pickup the lock, and fail to write to the database.

Notice that it still writes to the database and prints write throughput even after losing its leader status. After some writes like this, samples/sec tune down to 0.

All 3 adapters are printing the same thing: logs