prometheus-community / postgres_exporter

A PostgreSQL metric exporter for Prometheus
Apache License 2.0
2.83k stars 745 forks source link

WAL functions cannot be executed in standby servers #899

Open fmbiete opened 1 year ago

fmbiete commented 1 year ago

WAL functions cannot be executed in standby servers

ERROR:  recovery is in progress
HINT:  WAL control functions cannot be executed during recovery.

In PostgreSQL 16 replication slots persist in the standby servers. This is also the case when using extensions like pg_failover_slots that transfer the slot information to the standby.

This condition will make those queries to return values only in the primary node avoiding the errors.

elpavel commented 1 year ago

What about using this logic when pg_is_in_recovery ?

        {
            semver.MustParseRange(">=9.4.0 <10.0.0"),
            `
-           SELECT slot_name, database, active, pg_xlog_location_diff(pg_current_xlog_location(), restart_lsn)
+           SELECT slot_name, database, active,
+               (case pg_is_in_recovery() when 't' then pg_xlog_location_diff(pg_last_xlog_receive_location(), restart_lsn) else pg_xlog_location_diff(pg_current_xlog_location(), restart_lsn) end) as pg_xlog_location_diff
            FROM pg_replication_slots
            `,
        },
        {
            semver.MustParseRange(">=10.0.0"),
            `
-           SELECT slot_name, database, active, pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)
+           SELECT slot_name, database, active,
+               (case pg_is_in_recovery() when 't' then pg_wal_lsn_diff(pg_last_wal_receive_lsn(), restart_lsn) else pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) end) as pg_wal_lsn_diff
            FROM pg_replication_slots
            `,
        },
fmbiete commented 1 year ago

What about using this logic when pg_is_in_recovery ?

That doesn't throw an error but there are some buts.

The value in the case of being in recovery could be quite off, compared to the normal value.

What do you think?

elpavel commented 11 months ago

I am not sure what is the best solution here, but anything would be better than the original failing query.

mbanck-ntap commented 6 months ago

In any case, the patch proposed here leads to no metrics being sent back from the standby at all, is that what we want?

fmbiete commented 6 months ago

Yes, that query should only be executed in the primary.

nickaein commented 5 months ago

914 have taken a different approach and uses pg_last_wal_receive_lsn() instead of pg_current_wal_lsn() as a workaround.