prometheus-community / postgres_exporter

A PostgreSQL metric exporter for Prometheus
Apache License 2.0
2.64k stars 712 forks source link

Metrics endpoint starts timing out at intermittent intervals #1035

Open ayush-rathore-quartic opened 1 month ago

ayush-rathore-quartic commented 1 month ago

What did you do?

Running postgres exporter as a container in a kubernetes pod which also hosts the postgresql server at localhost:5432

What did you expect to see?

The /metrics endpoint should have returned the prometheus metrics at all times

What did you see instead? Under which circumstances?

  1. While the /metrics endpoint worked well for about an hour, after some time the metrics server starts timing out i.e. there is no response at :9187/metrics (Postgres Exporter is running at port 9187 of the pod). There are no logs about the failure to serve these requests in the postgres exporter logs
  2. This issue often gets fixed when the postgres server is restarted,but only for sometime.
  3. I can connect to the postgres server through psql at the same time

More Information The requests at :9187/ and :9187/probe are being served well. When i try probing my postgresql server through the below commands:

curl "<POD IP>:9187/probe?target=127.0.0.1:5432&sslmode=disable" curl "<POD IP>:9187/probe?target=:5432&sslmode=disable" curl "<POD IP>:9187/probe?target=/var/run/postgresql:5432&sslmode=disable"

Output

# HELP pg_exporter_last_scrape_duration_seconds Duration of the last scrape of metrics from PostgreSQL.
# TYPE pg_exporter_last_scrape_duration_seconds gauge
pg_exporter_last_scrape_duration_seconds{cluster_name="mydb",namespace="default"} 1.002118094
# HELP pg_exporter_last_scrape_error Whether the last scrape of metrics from PostgreSQL resulted in an error (1 for error, 0 for success).
# TYPE pg_exporter_last_scrape_error gauge
pg_exporter_last_scrape_error{cluster_name="mydb",namespace="default"} 1

.... 
....
....
pg_up{cluster_name="mydb",namespace="default"} 0

Logs emitted in postgres exporter EVERYTIME the above /probe requests are fired to check reach-ability to postgres server

ts=2024-05-23T20:30:23.947Z caller=probe.go:41 level=info msg="no auth_module specified, using default"
ts=2024-05-23T20:30:23.947Z caller=server.go:74 level=info msg="Established new database connection" fingerprint=localhost:5432
ts=2024-05-23T20:30:23.949Z caller=collector.go:194 level=error target=:5432 msg="collector failed" name=bgwriter duration_seconds=0.001488188 err="pq: SSL is not enabled on the server"
ts=2024-05-23T20:30:23.950Z caller=collector.go:194 level=error target=:5432 msg="collector failed" name=replication_slot duration_seconds=0.002488279 err="pq: SSL is not enabled on the server"
ts=2024-05-23T20:30:23.950Z caller=collector.go:194 level=error target=:5432 msg="collector failed" name=database duration_seconds=0.003197173 err="pq: SSL is not enabled on the server"
ts=2024-05-23T20:30:24.949Z caller=postgres_exporter.go:716 level=error err="Error opening connection to database (postgresql://:5432): pq: SSL is not enabled on the server"

Environment

Linux/Kubernetes

    postgres_exporter, version 0.12.1 (branch: HEAD, revision: 1c063b1b1913db029d449818e9cd1750c2282198)
        build user:       root@a5fc99238ef0
        build date:       20230613-16:18:22
        go version:       go1.20.5
        platform:         linux/amd64
        tags:             netgo static_build
/usr/local/bin/postgres_exporter --log.level=info
    {
      "name": "DATA_SOURCE_NAME",
      "value": "postgresql://postgres@:5432/postgres?host=/var/run/postgresql&sslmode=disable"
    },
    {
      "name": "PG_EXPORTER_EXTEND_QUERY_PATH",
      "value": "/var/opt/postgres-exporter/queries.yaml"
    },
    {
      "name": "PG_EXPORTER_CONSTANT_LABELS",
      "value": "cluster_name=mydb, namespace=default"
    }
sh-4.4$ psql --version
psql (PostgreSQL) 16.2 (OnGres 16.2-build-6.31)

<No logs are emitted even when the requests at /metrics are timing out>

ayush-rathore-quartic commented 1 month ago

cc @sysadmind