prometheus-community / postgres_exporter

A PostgreSQL metric exporter for Prometheus
Apache License 2.0
2.79k stars 737 forks source link

exporter reporting wrong/autdated data #283

Closed multi-io closed 5 years ago

multi-io commented 5 years ago

We have a postgres cluster (1 master, 2 replicas, using Zalando's postgres-operator on Kubernetes, but that should be unrelated to the problem). We have postgres_exporter running alongside the cluster, and the exporter has been working smoothly until this morning the cluster had a temporary network hiccup (basically every TCP connection got terminated). The cluster repaired and resynced itself automatically and everything works again except the exporter, which somehow reports wrong/outdated metrics which clearly aren't consistent with what the underlying queries return.

metakube-resource-usage-collector-* are the three psql nodes; -1 is currently the master and -0 and -2 are replicas. The 4th pod is postgres_exporter.

$ kubectl -n resource-usage-collector get pod -o wide
NAME                                                              READY   STATUS      RESTARTS   AGE    IP              NODE             NOMINATED NODE   READINESS GATES
metakube-resource-usage-collector-0                               1/1     Running     0          12d    10.244.19.13    worker16-stage   <none>           <none>
metakube-resource-usage-collector-1                               1/1     Running     0          12d    10.244.17.178   worker14-stage   <none>           <none>
metakube-resource-usage-collector-2                               1/1     Running     0          12d    10.244.22.151   worker17-stage   <none>           <none>
postgres-exporter-prometheus-postgres-exporter-7b8b59d9c6-mvbkk   1/1     Running     0          12d    10.244.26.207   worker6-stage    <none>           <none>
...
$

The metakube-resource-usage-collector service/endpoints select the master; the exporter connects via that (see below):

$ kubectl -n resource-usage-collector get service -o wide
NAME                                             TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)             AGE   SELECTOR
metakube-resource-usage-collector                ClusterIP   10.105.13.113    <none>        5432/TCP            75d   <none>
metakube-resource-usage-collector-repl           ClusterIP   10.103.188.227   <none>        5432/TCP            75d   application=spilo,spilo-role=replica,version=metakube-resource-usage-collector
postgres-exporter-prometheus-postgres-exporter   ClusterIP   10.110.101.39    <none>        80/TCP              75d   app=prometheus-postgres-exporter,release=postgres-exporter
...
$
$ kubectl -n resource-usage-collector get endpoints -o wide
NAME                                             ENDPOINTS                                                            AGE
metakube-resource-usage-collector                10.244.17.178:5432                                                   75d
metakube-resource-usage-collector-config         <none>                                                               15h
metakube-resource-usage-collector-repl           10.244.19.13:5432,10.244.22.151:5432                                 75d
postgres-exporter-prometheus-postgres-exporter   10.244.26.207:9187                                                   75d
...
$

Start alpine container for debugging in the exporter container's network and pid namespace:

$ ssh ubuntu@worker6-stage
ubuntu@worker6-stage:~$ docker ps | grep postgres_exporter
1cb72b345eb8        01295a253857                                            "/postgres_exporter …"   12 days ago         Up 12 days                              k8s_prometheus-postgres-exporter_postgres-exporter-prometheus-postgres-exporter-7b8b59d9c6-mvbkk_resource-usage-collector_1d73d492-8f50-11e9-9fd7-fa163e3912c3_0
ubuntu@worker6-stage:~$
ubuntu@worker6-stage:~$ docker run -ti --pid=container:1cb72b345eb8 --net=container:1cb72b345eb8 --cap-add sys_admin --cap-add sys_ptrace alpine sh
/ # apk add curl strace postgresql-client
...
/ # ps axfu
PID   USER     TIME  COMMAND
    1 root     14:59 /postgres_exporter --extend.query-path=/etc/config.yaml
   34 root      0:00 sh
   39 root      0:00 ps axfu
/ #

Open TCP connections. The one to 10.105.13.113:5432 is the connection to psql, the others are incoming metrics queries:

/ # netstat -tupen
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
tcp        0      0 10.244.26.207:59882     10.105.13.113:5432      ESTABLISHED 1/postgres_exporter
tcp        0      0 ::ffff:10.244.26.207:9187 ::ffff:10.244.26.1:50106 TIME_WAIT   -
tcp        0      0 ::ffff:10.244.26.207:9187 ::ffff:10.244.22.5:47802 ESTABLISHED 1/postgres_exporter
tcp        0      0 ::ffff:10.244.26.207:9187 ::ffff:10.244.26.1:49094 TIME_WAIT   -
tcp        0      0 ::ffff:10.244.26.207:9187 ::ffff:10.244.26.1:49268 TIME_WAIT   -
tcp        0      0 ::ffff:10.244.26.207:9187 ::ffff:10.244.6.250:56948 ESTABLISHED 1/postgres_exporter
tcp        0      0 ::ffff:10.244.26.207:9187 ::ffff:10.244.26.1:49672 TIME_WAIT   -
tcp        0      0 ::ffff:10.244.26.207:9187 ::ffff:10.244.26.1:49072 TIME_WAIT   -
tcp        0      0 ::ffff:10.244.26.207:9187 ::ffff:10.244.26.1:48872 TIME_WAIT   -
tcp        0      0 ::ffff:10.244.26.207:9187 ::ffff:10.244.26.1:49288 TIME_WAIT   -
tcp        0      0 ::ffff:10.244.26.207:9187 ::ffff:10.244.26.1:49624 TIME_WAIT   -
tcp        0      0 ::ffff:10.244.26.207:9187 ::ffff:10.244.26.1:48670 TIME_WAIT   -
tcp        0      0 ::ffff:10.244.26.207:9187 ::ffff:10.244.26.1:48904 TIME_WAIT   -
/ #

The running exporter's config file. The first two metrics in the file will already show the problem. Full file is attached to the issue. config.yaml.gz

/ # cat /proc/1/root/etc/config.yaml | head -10
pg_replication_count:
  query: "SELECT (SELECT count(*) FROM pg_replication_slots) AS slots, (SELECT count(*) FROM pg_stat_replication) AS replicas"
  metrics:
    - slots:
        usage: "GAUGE"
        description: "Number of replication slots"
    - replicas:
        usage: "GAUGE"
        description: "Number of active replicas"

/ #

Querying the metrics returns 0/0

/ # curl -s http://localhost:9187/metrics | grep pg_replication_count
# HELP pg_replication_count_replicas Number of active replicas
# TYPE pg_replication_count_replicas gauge
pg_replication_count_replicas 0
# HELP pg_replication_count_slots Number of replication slots
# TYPE pg_replication_count_slots gauge
pg_replication_count_slots 0
/ #

Run another exporter locally with the same config and env against the same psql. That new exporter reports the correct metrics (2/2 instead of 0/0):

/ # cat /proc/1/environ  | tr '\0' '\n' | grep 'DATA_SOURCE'
DATA_SOURCE_URI=metakube-resource-usage-collector.resource-usage-collector.svc.cluster.local:5432/resource_usage?sslmode=require
DATA_SOURCE_USER=metrics
DATA_SOURCE_PASS=xxxxxxxxx
/ # # take these three into the local environment and run the exporter
/ # eval $(cat /proc/1/environ  | tr '\0' '\n' | grep 'DATA_SOURCE' | sed 's/^/export /g')
/ # /proc/1/root/postgres_exporter --web.listen-address=":9188" --extend.query-path=/proc/1/root/etc/config.yaml &
/ # INFO[0000] Established new database connection.          source="postgres_exporter.go:1035"
INFO[0000] Semantic Version Changed: 0.0.0 -> 10.5.0     source="postgres_exporter.go:965"
INFO[0000] Starting Server: :9188                        source="postgres_exporter.go:1178"

/ # curl -s http://localhost:9188/metrics | grep pg_replication_count
# HELP pg_replication_count_replicas Number of active replicas
# TYPE pg_replication_count_replicas gauge
pg_replication_count_replicas 2
# HELP pg_replication_count_slots Number of replication slots
# TYPE pg_replication_count_slots gauge
pg_replication_count_slots 2
/ #

Querying the postgres endpoint directly using the query from the metric definition also yields the correct result

/ # PGHOST=metakube-resource-usage-collector.resource-usage-collector.svc.cluster.local PGPORT=5432 PGDATABASE=resource_usage PGUSER=$DATA_SOURCE_USER PGPASSWORD=$DATA_SOURCE_PASS psql
psql (11.3, server 10.5 (Ubuntu 10.5-2.pgdg18.04+1))
SSL connection (protocol: TLSv1.2, cipher: ECDHE-RSA-AES256-GCM-SHA384, bits: 256, compression: off)
Type "help" for help.

resource_usage=# SELECT (SELECT count(*) FROM pg_replication_slots) AS slots, (SELECT count(*) FROM pg_stat_replication) AS replicas;
 slots | replicas
-------+----------
     2 |        2
(1 row)

resource_usage=# \q

Verify that the exporter instance isn't totally static but actually does update several postgres-related metrics:

/ # curl -s http://localhost:9187/metrics >1; sleep 10; curl -s http://localhost:9187/metrics >2; diff -u 1 2
--- 1
+++ 2
... (go metrics diffs cut out) ...
@@ -96,13 +96,13 @@
 pg_database_size{datname="template1"} 7.709187e+06
 # HELP pg_exporter_last_scrape_duration_seconds Duration of the last scrape of metrics from PostgresSQL.
 # TYPE pg_exporter_last_scrape_duration_seconds gauge
-pg_exporter_last_scrape_duration_seconds 0.082874985
+pg_exporter_last_scrape_duration_seconds 0.090184116
 # HELP pg_exporter_last_scrape_error Whether the last scrape of metrics from PostgreSQL resulted in an error (1 for error, 0 for success).
 # TYPE pg_exporter_last_scrape_error gauge
 pg_exporter_last_scrape_error 0
 # HELP pg_exporter_scrapes_total Total number of times PostgresSQL was scraped for metrics.
 # TYPE pg_exporter_scrapes_total counter
-pg_exporter_scrapes_total 38698
+pg_exporter_scrapes_total 38700
 # HELP pg_exporter_user_queries_load_error Whether the user queries file was loaded and parsed successfully (1 for error, 0 for success).
 # TYPE pg_exporter_user_queries_load_error gauge
 pg_exporter_user_queries_load_error{filename="/etc/config.yaml",hashsum="b43a0d8eb09674732787eee2b2f2ab435e162ca01f250206d40f346c27d94721"} 0
@@ -839,7 +839,7 @@
 pg_stat_bgwriter_checkpoints_req 0
 # HELP pg_stat_bgwriter_checkpoints_timed Number of scheduled checkpoints that have been performed
 # TYPE pg_stat_bgwriter_checkpoints_timed counter
-pg_stat_bgwriter_checkpoints_timed 3507
+pg_stat_bgwriter_checkpoints_timed 3508
...
-pg_stat_database_blks_hit{datid="16560",datname="resource_usage"} 324476
+pg_stat_database_blks_hit{datid="16560",datname="resource_usage"} 324576
...
-pg_stat_database_tup_fetched{datid="16560",datname="resource_usage"} 98321
+pg_stat_database_tup_fetched{datid="16560",datname="resource_usage"} 98325
...
-pg_stat_database_tup_returned{datid="16560",datname="resource_usage"} 4.601664e+06
+pg_stat_database_tup_returned{datid="16560",datname="resource_usage"} 4.604594e+06
...
-pg_stat_database_xact_commit{datid="13018",datname="postgres"} 10592
-pg_stat_database_xact_commit{datid="16560",datname="resource_usage"} 37062
+pg_stat_database_xact_commit{datid="13018",datname="postgres"} 10593
+pg_stat_database_xact_commit{datid="16560",datname="resource_usage"} 37092
...
-process_cpu_seconds_total 901.24
+process_cpu_seconds_total 901.27
...
-process_resident_memory_bytes 1.3504512e+07
+process_resident_memory_bytes 1.284096e+07
...
/ #

Logs of the exporter. The network reset started at 2019-06-26T23:09:37Z; the recovery has long since completed.

$ kubectl -n resource-usage-collector logs postgres-exporter-prometheus-postgres-exporter-7b8b59d9c6-mvbkk
time="2019-06-15T09:29:49Z" level=info msg="Established new database connection." source="postgres_exporter.go:1035"
time="2019-06-15T09:29:49Z" level=info msg="Semantic Version Changed: 0.0.0 -> 10.5.0" source="postgres_exporter.go:965"
time="2019-06-15T09:29:49Z" level=info msg="Starting Server: :9187" source="postgres_exporter.go:1178"
time="2019-06-26T23:09:37Z" level=info msg="Error running query on database:  pg_replication_lags pq: recovery is in progress\n" source="postgres_exporter.go:933"
time="2019-06-26T23:09:37Z" level=info msg="Error running query on database:  pg_replication_slot_lags pq: recovery is in progress\n" source="postgres_exporter.go:933"
time="2019-06-26T23:10:13Z" level=info msg="Error running query on database:  pg_replication_slot_lags pq: recovery is in progress\n" source="postgres_exporter.go:933"
time="2019-06-26T23:10:13Z" level=info msg="Error running query on database:  pg_replication_lags pq: recovery is in progress\n" source="postgres_exporter.go:933"
time="2019-06-26T23:10:32Z" level=info msg="Error running query on database:  pg_replication_lags pq: recovery is in progress\n" source="postgres_exporter.go:933"
time="2019-06-26T23:10:32Z" level=info msg="Error running query on database:  pg_replication_slot_lags pq: recovery is in progress\n" source="postgres_exporter.go:933"
time="2019-06-26T23:11:08Z" level=info msg="Error running query on database:  pg_replication_lags pq: recovery is in progress\n" source="postgres_exporter.go:933"
time="2019-06-26T23:11:08Z" level=info msg="Error running query on database:  pg_replication_slot_lags pq: recovery is in progress\n" source="postgres_exporter.go:933"
time="2019-06-26T23:14:31Z" level=warning msg="Proceeding with outdated query maps, as the Postgres version could not be determined: Error scanning version string: pq: the database system is shutting down" source="postgres_exporter.go:1081"
time="2019-06-26T23:22:24Z" level=info msg="Error running query on database:  pg_replication_slot_lags pq: recovery is in progress\n" source="postgres_exporter.go:933"
$

Any ideas what might be going on here? Looks like there is some local query caching or something?

If I manually delete/restart the exporter pod, it will report the correct metrics again -- verified in another cluster where the same problem occurred simultaneously.

jonny-wg2 commented 5 years ago

Can you show your deployment.yaml ?

wrouesnel commented 5 years ago

Up until a future 0.7.0, the exporter doesn't do any caching of query data - it queries the DB each time and returns that.

zer0def commented 4 years ago

This error is a race condition and happens when the exporter connects to the database engine as it is starting up or shutting down (per the line time="2019-06-26T23:14:31Z" level=warning msg="Proceeding with outdated query maps, as the Postgres version could not be determined: Error scanning version string: pq: the database system is shutting down" source="postgres_exporter.go:1081"), because the connection to the engine succeeds, but statistics collector views aren't available yet, leading to exporting pg_settings_.* series, but not pg_stat_.* ones.

I've been semi-consistently hitting this behavior on a 3-node Patroni cluster (spawned from the incubator/patroni chart) with a postgres-exporter container added into the pod, exporting metrics from localhost. Setting synchronous replication behavior makes the problem more visible, for example, by setting bootstrap.dcs.synchronous_mode to true in Patroni's configuration.

garry-t commented 1 year ago

problem is relevant for version 0.9.0

time="2023-10-31T08:00:14+01:00" level=info msg="Error running query on database \"/var/run/postgresql/:5432\": pg_replication_slots pq: recovery is in progress" source="postgres_exporter.go:1503"

postgres_exporter[2930812]: time="2023-10-31T08:00:15+01:00" level=error msg="queryNamespaceMappings returned 1 errors" source="postgres_exporter.go:1621"

if exporter run for replica it is obvious that datbase will be in recovery state

SuperQ commented 1 year ago

v0.9.0 is a very old, unsupported release.