prometheus-community / postgres_exporter

A PostgreSQL metric exporter for Prometheus
Apache License 2.0
2.71k stars 722 forks source link

Facing error "metric pg_database_size_bytes was collected before with the same name and label values" #706

Open pankdhnd opened 1 year ago

pankdhnd commented 1 year ago

What did you do? We deployed postgres exporter version 0.11.1 on Azure using the official community helm chart. We are facing an issue with the below query from the chart:

` pg_database: query: "SELECT pg_database.datname, pg_database_size(pg_database.datname) as size_bytes FROM pg_database" master: true cache_seconds: 30 metrics:

We keep observing below error on the /metrics endpoint for the above query:

An error has occurred while serving metrics:

7 error(s) occurred:
* collected metric "pg_database_size_bytes" { label:<name:"datname" value:"azure_maintenance" > label:<name:"server" value:"<IP ADDRESS>:5432" > gauge:<value:8.401455e+06 > } was collected before with the same name and label values
* collected metric "pg_database_size_bytes" { label:<name:"datname" value:"template1" > label:<name:"server" value:"<IP ADDRESS>:5432" > gauge:<value:8.159747e+06 > } was collected before with the same name and label values
* collected metric "pg_database_size_bytes" { label:<name:"datname" value:"postgres" > label:<name:"server" value:"<IP ADDRESS>:5432" > gauge:<value:8.589871e+06 > } was collected before with the same name and label values
* collected metric "pg_database_size_bytes" { label:<name:"datname" value:"template0" > label:<name:"server" value:"<IP ADDRESS>:5432" > gauge:<value:8.159747e+06 > } was collected before with the same name and label values
* collected metric "pg_database_size_bytes" { label:<name:"datname" value:"azure_sys" > label:<name:"server" value:"<IP ADDRESS>:5432" > gauge:<value:8.974895e+06 > } was collected before with the same name and label values
* collected metric "pg_database_size_bytes" { label:<name:"datname" value:"testdb1" > label:<name:"server" value:"<IP ADDRESS>:5432" > gauge:<value:8.311343e+06 > } was collected before with the same name and label values
* collected metric "pg_database_size_bytes" { label:<name:"datname" value:"example" > label:<name:"server" value:"<IP ADDRESS>:5432" > gauge:<value:3.0814767e+07 > } was collected before with the same name and label values

What did you expect to see? We expect to see regular metrices for pg_database_size_bytes with database name and size in labels

What did you see instead? Under which circumstances? We keep seeing the error listed above on the /metrics endpoint

Environment Azure

AlexDCraig commented 1 year ago

I'm getting this too on the same metric after upgrading. Kinda weird.

AlexDCraig commented 1 year ago

After digging a little more, this is because in exporter 0.11 they introduced the pg_database collector (see here: https://github.com/prometheus-community/postgres_exporter/compare/v0.10.1...v0.11.0#diff-b335630551682c19a781afebcf4d07bf978fb1f8ac04c6bf87428ed5106870f5R44).

It doesn't look like you can turn this collector off via cli args (https://github.com/prometheus-community/postgres_exporter/blob/v0.11.1/collector/collector.go#L41 and https://github.com/prometheus-community/postgres_exporter/blob/v0.11.1/collector/pg_database.go#L24)

There are a couple options. The workaround is to erase the pg_database metric from your own copy of the queries file. The permanent solution is to either (a) update the postgres_exporter code to allow turnoff of the pg_database collector, or (b) update the Helm chart to not include the query.

I'll submit a PR for the exporter to allow a user to toggle off the pg_database collector

KES777 commented 1 year ago

Same problem here:

An error has occurred while serving metrics:

10 error(s) occurred:
* collected metric pg_settings_archive_timeout_seconds label:<name:"dbcluster" value:"pgdb" > label:<name:"server" value:"tsdb:5432" > gauge:<value:0 >  has help "Forces a switch to the next WAL file if a new file has not been started within N seconds. [Units converted to seconds.]" but should have "Sets the amount of time to wait before forcing a switch to the next WAL file. [Units converted to seconds.]"
* collected metric pg_settings_checkpoint_warning_seconds label:<name:"dbcluster" value:"pgdb" > label:<name:"server" value:"tsdb:5432" > gauge:<value:30 >  has help "Enables warnings if checkpoint segments are filled more frequently than this. [Units converted to seconds.]" but should have "Sets the maximum time before warning if checkpoints triggered by WAL volume happen too frequently. [Units converted to seconds.]"
* collected metric pg_settings_commit_siblings label:<name:"dbcluster" value:"pgdb" > label:<name:"server" value:"tsdb:5432" > gauge:<value:5 >  has help "Sets the minimum concurrent open transactions before performing commit_delay." but should have "Sets the minimum number of concurrent open transactions required before performing commit_delay."
* collected metric pg_settings_log_parameter_max_length_bytes label:<name:"dbcluster" value:"pgdb" > label:<name:"server" value:"tsdb:5432" > gauge:<value:-1 >  has help "When logging statements, limit logged parameter values to first N bytes. [Units converted to bytes.]" but should have "Sets the maximum length in bytes of data logged for bind parameter values when logging statements. [Units converted to bytes.]"
* collected metric pg_settings_log_parameter_max_length_on_error_bytes label:<name:"dbcluster" value:"pgdb" > label:<name:"server" value:"tsdb:5432" > gauge:<value:0 >  has help "When reporting an error, limit logged parameter values to first N bytes. [Units converted to bytes.]" but should have "Sets the maximum length in bytes of data logged for bind parameter values when logging statements, on error. [Units converted to bytes.]"
* collected metric pg_settings_log_rotation_age_seconds label:<name:"dbcluster" value:"pgdb" > label:<name:"server" value:"tsdb:5432" > gauge:<value:86400 >  has help "Automatic log file rotation will occur after N minutes. [Units converted to seconds.]" but should have "Sets the amount of time to wait before forcing log file rotation. [Units converted to seconds.]"
* collected metric pg_settings_log_rotation_size_bytes label:<name:"dbcluster" value:"pgdb" > label:<name:"server" value:"tsdb:5432" > gauge:<value:1.048576e+07 >  has help "Automatic log file rotation will occur after N kilobytes. [Units converted to bytes.]" but should have "Sets the maximum size a log file can reach before being rotated. [Units converted to bytes.]"
* collected metric pg_settings_post_auth_delay_seconds label:<name:"dbcluster" value:"pgdb" > label:<name:"server" value:"tsdb:5432" > gauge:<value:0 >  has help "Waits N seconds on connection startup after authentication. [Units converted to seconds.]" but should have "Sets the amount of time to wait after authentication on connection startup. [Units converted to seconds.]"
* collected metric pg_settings_pre_auth_delay_seconds label:<name:"dbcluster" value:"pgdb" > label:<name:"server" value:"tsdb:5432" > gauge:<value:0 >  has help "Waits N seconds on connection startup before authentication. [Units converted to seconds.]" but should have "Sets the amount of time to wait before authentication on connection startup. [Units converted to seconds.]"
* collected metric pg_settings_ssl_passphrase_command_supports_reload label:<name:"dbcluster" value:"pgdb" > label:<name:"server" value:"tsdb:5432" > gauge:<value:0 >  has help "Also use ssl_passphrase_command during server reload." but should have "Controls whether ssl_passphrase_command is called during server reload."

Error appears only when DATA_SOURCE_NAME was configured to multi target. Separately each server works fine. When multi target was configured and PG_EXPORTER_AUTO_DISCOVER_DATABASES: "false", error still exists.

image is from :master: https://hub.docker.com/layers/prometheuscommunity/postgres-exporter/master/images/sha256-c692c74a12a89dc905cbc3ccbcf77d73a715fa53a2f5506d4d1060a7725c689d?context=explore

KES777 commented 1 year ago

providing --disable-default-metrics option resolves problem. This seems because exporter exports same metrics from different servers under same name.

sysadmind commented 1 year ago

@AlexDHoffer I think there could be a bug in the flag logic. The flag logic comes from node_exporter and it's mostly copy/paste. Each collector in the collector directory is supposed to have a flag to enable/disable, but it may not be working as expected. I'll try and put some more time into figuring out if that is working correctly.

AlexDCraig commented 1 year ago

@sysadmind Yes I believe it's a bug. I did a little local testing and the disable didn't function and my admittedly small code changes didn't seem to fix it

sysadmind commented 1 year ago

Ok I figured out how why the --help options don't show how to disable default collectors. The problem is outlined here: https://github.com/alecthomas/kingpin/issues/216

The short answer is that for every --collector.collector_name flag, there is also a --no-collector.collector_name flag as well, but it is not shown in the help text. This is unfortunately a limitation of the library that we are using for flag parsing. We may be able to revisit this in the future. I'll see about adding some docs to make this more clear in the repo.

KES777 commented 1 year ago

@sysadmind Could it be possible to display some default message about that --no* options then?

heidricha commented 1 year ago

providing --disable-default-metrics option resolves problem. This seems because exporter exports same metrics from different servers under same name.

Doesn't solve the case for me. What can be the difference?

[gitlab-runner@runner8 docker]$ curl --noproxy "*" 172.30.0.5:9187/metrics
An error has occurred while serving metrics:

4 error(s) occurred:
* collected metric "pg_scrape_collector_duration_seconds" { label:<name:"collector" value:"bgwriter" > gauge:<value:0.038808206 > } was collected before with the same name and label values
* collected metric "pg_scrape_collector_success" { label:<name:"collector" value:"bgwriter" > gauge:<value:1 > } was collected before with the same name and label values
* collected metric "pg_scrape_collector_duration_seconds" { label:<name:"collector" value:"database" > gauge:<value:0.044570629 > } was collected before with the same name and label values
* collected metric "pg_scrape_collector_success" { label:<name:"collector" value:"database" > gauge:<value:1 > } was collected before with the same name and label values
[gitlab-runner@runner8 docker]$ docker-compose logs postgres-exporter
Attaching to imaginarium_postgres-exporter_1
postgres-exporter_1  | ts=2022-11-25T09:16:19.813Z caller=main.go:135 level=info msg="Listening on address" address=:9187
postgres-exporter_1  | ts=2022-11-25T09:16:19.815Z caller=tls_config.go:195 level=info msg="TLS is disabled." http2=false
postgres-exporter_1  | ts=2022-11-25T09:16:24.997Z caller=server.go:74 level=info msg="Established new database connection" fingerprint=smv-timescale:9050
postgres-exporter_1  | ts=2022-11-25T09:16:25.017Z caller=postgres_exporter.go:645 level=info msg="Semantic version changed" server=smv-timescale:9050 from=0.0.0 to=14.5.0
postgres-exporter_1  | ts=2022-11-25T09:16:25.063Z caller=server.go:74 level=info msg="Established new database connection" fingerprint=smv-timescale-op:9055
postgres-exporter_1  | ts=2022-11-25T09:16:25.082Z caller=postgres_exporter.go:645 level=info msg="Semantic version changed" server=smv-timescale-op:9055 from=0.0.0 to=14.5.0
[gitlab-runner@runner8 docker]$ cat smv-psql-exporter/start.sh
#!/bin/sh

DATA_SOURCE_NAME="postgresql://postgresadmin:$(cat /run/secrets/postgresadmin)@smv-timescale:9050/smartvision?sslmode=disable,postgresql://postgresadmin:$(cat /run/secrets/postgresadmin)@smv-timescale-op:9055/smartvision-op?sslmode=disable" \
  postgres_exporter "$@"
  postgres-exporter:
    <<: *sec-defaults
    <<: *log-defaults
    restart: on-failure:5
    cpu_count: 1
    cpu_shares: 256
    cpu_percent: 30
    mem_limit: 400mb
    image: ${DOCKER_PROXY}bitnami/postgres-exporter:0.11.1
    read_only: true
    networks:
      default:
        aliases:
          - smv-postgres-exporter
    healthcheck:
      test: [ "CMD", "/bin/true" ]
    secrets:
      - postgresadmin
    volumes:
      - ./smv-psql-exporter/queries.yaml:/queries.yaml:ro
      - ./smv-psql-exporter/start.sh:/start.sh
    entrypoint: /start.sh
    environment:
      PG_EXPORTER_AUTO_DISCOVER_DATABASES: "false"
      PG_EXPORTER_EXCLUDE_DATABASES: "template0,template1"
    command:
      - "--disable-default-metrics"
      - "--extend.query-path=/queries.yaml"
BibbyChung commented 1 year ago

providing --disable-default-metrics option resolves problem. This seems because exporter exports same metrics from different servers under same name.

It doesn't work for me.

rbudiharso commented 1 year ago

I'm also having the same problem with multi target and autoDiscoverDatabases set to true

heidricha commented 1 year ago

I just tried this image: bitnami/postgres-exporter:0.12.1-debian-11-r1

and finally works for me with the same config I used with 0.10.1:

[gitlab-runner@runner8 docker]$ cat pge/start.sh 
#!/bin/sh

DATA_SOURCE_NAME="postgresql://postgresadmin:$(cat /run/secrets/postgresadmin)@smv-timescale:9050/smartvision?sslmode=disable,postgresql://postgresadmin:$(cat /run/secrets/postgresadmin)@smv-timescale-op:9055/smartvision-op?sslmode=disable" \
  postgres_exporter "$@"

No error messages, metrics collected from both databases

pgacek commented 1 year ago

Hello, I have some strange behaviour regarding the multi-target and a couple of metrics. v0.13.2 exporter version. Scrapped PG are in multiple versions, from 11 and 14.

The exporter runs as the pod on k8s. Prometheus scrapping is managed via the ServiceMonitor and SD, without static_config.targets

What I can see is that some of the metrics have nicely discovered {server="server_name"} labels, and some of them are not. I checked the option with --disable-default-metrics and nothing has changed.

Most of the metrics without the server label are related to the pg_stat_ metrics, also one of the most important metrics does not have the server label:

pg_database_size_bytes{datname="allure-reports"} 2.5514655e+07
pg_database_size_bytes{datname="allure-uaa"} 9.278111e+06
pg_database_size_bytes{datname="apache_airflow"} 2.561749663e+09
pg_database_size_bytes{datname="area86"} 6.35941535e+08
pg_database_size_bytes{datname="checker_content"} 1.76837279e+08
g_stat_statements_calls_total{datname="universal_checker",queryid="6698132865110312156",user="universal-checker"} 4.670399e+06
pg_stat_statements_calls_total{datname="universal_checker",queryid="7280780680036750623",user="universal-checker"} 1.6597204e+07
pg_stat_statements_calls_total{datname="universal_checker",queryid="833913155023572892",user="universal-checker"} 1.35308487e+08
pg_stat_statements_calls_total{datname="universal_checker",queryid="8446622893467498190",user="universal-checker"} 5.839715e+06
# HELP pg_stat_statements_rows_total Total number of rows retrieved or affected by the statement
# TYPE pg_stat_statements_rows_total counter
pg_stat_statements_rows_total{datname="apache_airflow",queryid="-1648181532424320167",user="apache_airflow"} 3.3807645e+07
pg_stat_statements_rows_total{datname="apache_airflow",queryid="-396213125263694507",user="apache_airflow"} 2.44226478e+08

and metrics with the server label:

pg_settings_allow_in_place_tablespaces{container="prometheus-postgres-exporter", endpoint="http", instance="192.168.225.236:9187", job="prometheus-postgres-exporter", namespace="monitoring", pod="prometheus-postgres-exporter-fb978497c-5l8w7", server="<server5>:5432", service="prometheus-postgres-exporter"}
0
pg_settings_allow_in_place_tablespaces{container="prometheus-postgres-exporter", endpoint="http", instance="192.168.225.236:9187", job="prometheus-postgres-exporter", namespace="monitoring", pod="prometheus-postgres-exporter-fb978497c-5l8w7", server="<server4>:5432", service="prometheus-postgres-exporter"}
0
pg_settings_allow_in_place_tablespaces{container="prometheus-postgres-exporter", endpoint="http", instance="192.168.225.236:9187", job="prometheus-postgres-exporter", namespace="monitoring", pod="prometheus-postgres-exporter-fb978497c-5l8w7", server="<server3>:5432", service="prometheus-postgres-exporter"}
0
pg_settings_allow_in_place_tablespaces{container="prometheus-postgres-exporter", endpoint="http", instance="192.168.225.236:9187", job="prometheus-postgres-exporter", namespace="monitoring", pod="prometheus-postgres-exporter-fb978497c-5l8w7", server="<server2>:5432", service="prometheus-postgres-exporter"}
0
pg_settings_allow_in_place_tablespaces{container="prometheus-postgres-exporter", endpoint="http", instance="192.168.225.236:9187", job="prometheus-postgres-exporter", namespace="monitoring", pod="prometheus-postgres-exporter-fb978497c-5l8w7", server="<server1>:5432", service="prometheus-postgres-exporter"}
0

Any ideas?