Open akamensky opened 4 years ago
When I tried to use the recommended query I got a parsing error on the 00:00:00
result. This query worked better for me:
SELECT CASE WHEN pg_last_wal_receive_lsn() = pg_last_wal_replay_lsn() THEN 0 ELSE EXTRACT (epoch from (now() - pg_last_xact_replay_timestamp())) END AS lag;
FWIW, not sure about earlier versions but 10/11/12 you can get replication lag from primary node without any math in it. It is directly available there in microseconds value.
@akamensky You mean select * from pg_stat_replication;
on the primary instance?
I have postgresql 12.5 here and implemented it as
pg_replication:
query: "select extract(epoch from max(coalesce(replay_lag, make_interval(0,0,0,0,0,0,0.0)))) as lag_seconds from pg_stat_replication"
master: true
metrics:
- lag_seconds:
usage: "GAUGE"
description: "Replication lag behind primary in seconds, as detected after a roundtrip on the primary"
From postgres documentation https://www.postgresql.org/docs/12/monitoring-stats.html#PG-STAT-REPLICATION-VIEW
The pg_stat_replication view will contain one row per WAL sender process, showing statistics about replication to that sender's connected standby server. Only directly connected standbys are listed; no information is available about downstream standby servers.
The lag times reported in the pg_stat_replication view are measurements of the time taken for recent WAL to be written, flushed and replayed and for the sender to know about it. These times represent the commit delay that was (or would have been) introduced by each synchronous commit level, if the remote server was configured as a synchronous standby. For an asynchronous standby, the replay_lag column approximates the delay before recent transactions became visible to queries.
If the standby server has entirely caught up with the sending server and there is no more WAL activity, the most recently measured lag times will continue to be displayed for a short time and then show NULL.
NaN
Releases after postgresql 9.3 had a change in logic where it would have no WAL updates while master host is idle, which would result in linear lag values increase when master is idle and use old query:
SELECT EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp())) as lag
Appropriate query should be checking if we are on the latest know WAL position and if not then use this query, thus correct query is something like below:
SELECT CASE WHEN pg_last_wal_receive_lsn() = pg_last_wal_replay_lsn() THEN make_interval(0,0,0,0,0,0,0.0) ELSE (now() - pg_last_xact_replay_timestamp()) END AS replication_lag;