Open SuperQ opened 11 months ago
This affects replication monitoring in the way that if only pg_up and pg_replication_lag_seconds are monitored in Secondary servers and there's a network outage between Primary and Secondary servers, Secondary servers get lagged without any alarm being triggered.
It seems more reasonable to monitor replication looking at Primary server data.
SELECT COUNT(*) FROM pg_stat_replication WHERE client_addr='SLAVE_IP' AND state = 'streaming';
If it returns 0, we have an unreachable Secondary server.
SELECT COALESCE(EXTRACT(EPOCH FROM replay_lag)::bigint, 0) AS replay_lag FROM pg_stat_replication WHERE client_addr='SLAVE_IP';
If it returns more than X we have a lagged Secondary server.
Proposal
There are existing queries for
pg_stat_replication
incmd/postgres_exporter/queries.go
. These metrics should be migrated to the collector package.