Closed multi-io closed 5 years ago
Can you show your deployment.yaml
?
Up until a future 0.7.0, the exporter doesn't do any caching of query data - it queries the DB each time and returns that.
This error is a race condition and happens when the exporter connects to the database engine as it is starting up or shutting down (per the line time="2019-06-26T23:14:31Z" level=warning msg="Proceeding with outdated query maps, as the Postgres version could not be determined: Error scanning version string: pq: the database system is shutting down" source="postgres_exporter.go:1081"
), because the connection to the engine succeeds, but statistics collector views aren't available yet, leading to exporting pg_settings_.*
series, but not pg_stat_.*
ones.
I've been semi-consistently hitting this behavior on a 3-node Patroni cluster (spawned from the incubator/patroni
chart) with a postgres-exporter container added into the pod, exporting metrics from localhost. Setting synchronous replication behavior makes the problem more visible, for example, by setting bootstrap.dcs.synchronous_mode
to true
in Patroni's configuration.
problem is relevant for version 0.9.0
time="2023-10-31T08:00:14+01:00" level=info msg="Error running query on database \"/var/run/postgresql/:5432\": pg_replication_slots pq: recovery is in progress" source="postgres_exporter.go:1503"
postgres_exporter[2930812]: time="2023-10-31T08:00:15+01:00" level=error msg="queryNamespaceMappings returned 1 errors" source="postgres_exporter.go:1621"
if exporter run for replica it is obvious that datbase will be in recovery state
v0.9.0 is a very old, unsupported release.
We have a postgres cluster (1 master, 2 replicas, using Zalando's postgres-operator on Kubernetes, but that should be unrelated to the problem). We have postgres_exporter running alongside the cluster, and the exporter has been working smoothly until this morning the cluster had a temporary network hiccup (basically every TCP connection got terminated). The cluster repaired and resynced itself automatically and everything works again except the exporter, which somehow reports wrong/outdated metrics which clearly aren't consistent with what the underlying queries return.
metakube-resource-usage-collector-* are the three psql nodes; -1 is currently the master and -0 and -2 are replicas. The 4th pod is postgres_exporter.
The metakube-resource-usage-collector service/endpoints select the master; the exporter connects via that (see below):
Start alpine container for debugging in the exporter container's network and pid namespace:
Open TCP connections. The one to 10.105.13.113:5432 is the connection to psql, the others are incoming metrics queries:
The running exporter's config file. The first two metrics in the file will already show the problem. Full file is attached to the issue. config.yaml.gz
Querying the metrics returns 0/0
Run another exporter locally with the same config and env against the same psql. That new exporter reports the correct metrics (2/2 instead of 0/0):
Querying the postgres endpoint directly using the query from the metric definition also yields the correct result
Verify that the exporter instance isn't totally static but actually does update several postgres-related metrics:
Logs of the exporter. The network reset started at 2019-06-26T23:09:37Z; the recovery has long since completed.
Any ideas what might be going on here? Looks like there is some local query caching or something?
If I manually delete/restart the exporter pod, it will report the correct metrics again -- verified in another cluster where the same problem occurred simultaneously.