prometheus-community / postgres_exporter

A PostgreSQL metric exporter for Prometheus
Apache License 2.0
2.72k stars 725 forks source link

autoDiscoverDatabases:true gives error 500 (collected metric "pg_stat_activity_idle_process_seconds" was collected before) #753

Open baznikin opened 1 year ago

baznikin commented 1 year ago

Copied from prometheus-community/helm-charts

This issue originally was opened here but was unattended. Copied here as is. I experience this issue with postgres_exporter 0.10.1 and latest 0.11.1 as well

Describe the bug a clear and concise description of what the bug is.

With config.autoDiscoverDatabases: true I got error 500 on /metrics endpoint:

< HTTP/1.1 500 Internal Server Error
< Content-Type: text/plain; charset=utf-8
< X-Content-Type-Options: nosniff
< Date: Thu, 14 Jul 2022 11:37:11 GMT
< Transfer-Encoding: chunked
< 
An error has occurred while serving metrics:

12 error(s) occurred:
* collected metric "pg_stat_activity_idle_process_seconds" { label:<name:"application_name" value:"" > label:<name:"server" value:"develop-postgresql:5432" > histogram:<sample_count:117 sample_sum:9240 bucket:<cumulative_count:36 upper_bound:1 > bucket:<cumulative_count:42 upper_bound:2 > bucket:<cumulative_count:53 upper_bound:5 > bucket:<cumulative_count:64 upper_bound:15 > bucket:<cumulative_count:65 upper_bound:30 > bucket:<cumulative_count:67 upper_bound:60 > bucket:<cumulative_count:67 upper_bound:90 > bucket:<cumulative_count:79 upper_bound:120 > bucket:<cumulative_count:118 upper_bound:300 > > } was collected before with the same name and label values
* collected metric "pg_stat_activity_idle_process_seconds" { label:<name:"application_name" value:"Patroni" > label:<name:"server" value:"develop-postgresql:5432" > histogram:<sample_count:1 sample_sum:1 bucket:<cumulative_count:0 upper_bound:1 > bucket:<cumulative_count:1 upper_bound:2 > bucket:<cumulative_count:1 upper_bound:5 > bucket:<cumulative_count:1 upper_bound:15 > bucket:<cumulative_count:1 upper_bound:30 > bucket:<cumulative_count:1 upper_bound:60 > bucket:<cumulative_count:1 upper_bound:90 > bucket:<cumulative_count:1 upper_bound:120 > bucket:<cumulative_count:1 upper_bound:300 > > } was collected before with the same name and label values

According to https://github.com/prometheus-community/postgres_exporter/issues/429#issuecomment-850865088 I manually added key master: true to query pg_stat_activity_idle and it fixes error

What's your helm version?

version.BuildInfo{Version:"v3.7.1", GitCommit:"1d11fcb5d3f3bf00dbe6fe31b8412839a96b3dc4", GitTreeState:"clean", GoVersion:"go1.16.9"}

What's your kubectl version?

Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.5", GitCommit:"6b1d87acf3c8253c123756b9e61dac642678305f", GitTreeState:"clean", BuildDate:"2021-03-18T01:10:43Z", GoVersion:"go1.15.8", Compiler:"gc", Platform:"linux/amd64"}

Which chart?

prometheus-postgres-exporter

What's the chart version?

3.0.3

What happened?

No response

What you expected to happen?

No response

How to reproduce it?

No response

Enter the changed values of values.yaml?

serviceMonitor: enabled: true config: datasource: host: develop-postgresql user: develop_admin password: passwordSecret: name: develop-admin.develop-postgresql.credentials.postgresql.acid.zalan.do key: password database: 'template1' logLevel: info autoDiscoverDatabases: true

Enter the command that you execute and failing/misfunctioning.

helm install postgres-exporter prometheus-community/prometheus-postgres-exporter --namespace db -f ./postrges_exporter.yaml

Anything else we need to know?

No response

kuzmindb commented 1 year ago

There is a lot of issues about "was collected before with the same name and label values" error and i found a way to reproduce it.

I faced this error after adding the following block to my queries.yaml:

pg_invalid_indexes:
  query: "SELECT current_database() datname, count (*) FROM pg_catalog.pg_index i LEFT JOIN pg_catalog.pg_locks l ON (i.indexrelid = l.relation) WHERE NOT (i.indisvalid AND i.indisready) AND l.relation IS NULL"
  metrics:
    - datname:
        usage: "LABEL"
        description: "Name of current database"
    - count:
        usage: "GAUGE"
        description: "Count of invalid indexes"

I have 3 databases that is not a template in test environment: postgres, mydb1 and mydb2. When i try to query the exporter i got the following error:

curl -s 'localhost:9187/metrics'|egrep "pg_invalid_indexes_count"
collected metric "pg_invalid_indexes_count" { label:<name:"datname" value:"postgres" > label:<name:"server" value:"127.0.0.1:5432" > gauge:<value:0 > } was collected before with the same name and label values

This query counts invalid indexes in a db and it has to be run in every database, so my exporter is running with --auto-discover-databases flag. I know that this query always returns exacly one row, so there should be no duplicates.

To figure out what's happening i added a random() value, like this:

pg_invalid_indexes:
  query: "SELECT random(), current_database() datname, count (*) FROM pg_catalog.pg_index i LEFT JOIN pg_catalog.pg_locks l ON (i.indexrelid = l.relation) WHERE NOT (i.indisvalid AND i.indisready) AND l.relation IS NULL"
  metrics:
    - random:
        usage: "LABEL"
        description: "test"
    - datname:
        usage: "LABEL"
        description: "Name of current database"
    - count:
        usage: "GAUGE"
        description: "Count of invalid indexes"

And got this. The output could change after every restart of postgres exporter:

curl -s 'localhost:9187/metrics'|egrep "pg_invalid_indexes_count"
pg_invalid_indexes_count{datname="postgres",random="0.4343765986839898",server="127.0.0.1:5432"} 0
pg_invalid_indexes_count{datname="postgres",random="0.8933916408300497",server="127.0.0.1:5432"} 0
pg_invalid_indexes_count{datname="postgres",random="0.9730973867622801",server="127.0.0.1:5432"} 0

restarting exporter

curl -s 'localhost:9187/metrics'|egrep "pg_invalid_indexes_count"
pg_invalid_indexes_count{datname="mydb1",random="0.3768252689858862",server="127.0.0.1:5432"} 0
pg_invalid_indexes_count{datname="postgres",random="0.09343378715585615",server="127.0.0.1:5432"} 0
pg_invalid_indexes_count{datname="postgres",random="0.14668469182350918",server="127.0.0.1:5432"} 0

restarting exporter

curl -s 'localhost:9187/metrics'|egrep "pg_invalid_indexes_count"
pg_invalid_indexes_count{datname="mydb2",random="0.7216720910762717",server="127.0.0.1:5432"} 0
pg_invalid_indexes_count{datname="postgres",random="0.3559019246896078",server="127.0.0.1:5432"} 0
pg_invalid_indexes_count{datname="postgres",random="0.5074779531582507",server="127.0.0.1:5432"} 0

If i keep restarting the exporter, eventually everything is working fine:

curl -s 'localhost:9187/metrics'|egrep "pg_invalid_indexes_count"
pg_invalid_indexes_count{datname="mydb1",random="0.4005597441722699",server="127.0.0.1:5432"} 0
pg_invalid_indexes_count{datname="mydb2",random="0.35699799578830493",server="127.0.0.1:5432"} 0
pg_invalid_indexes_count{datname="postgres",random="0.24877513869852663",server="127.0.0.1:5432"} 0

As you can see, exporter queries same database multiple times. Hope this helps to finally fix this issue. Tested on 0.11.1 and 0.12.0-rc.0

kuzmindb commented 1 year ago

The problem seems to be with the connstring syntax in the DSN, that was added in https://github.com/prometheus-community/postgres_exporter/pull/473. When i change DSN to URIs (postgresql://user:pass@host/dbname) everything works just fine. Connstring that led to accidental duplicate metrics error looks like this: DATA_SOURCE_NAME=user=usename password=pass database=postgres host=hostfqdn port=5432 sslmode=disable

SLoeuillet commented 1 year ago

the issue is that if you add external queries to your yaml file, you have to specify "master: true" to have it executed only once (on one table only)

See https://github.com/prometheus-community/postgres_exporter/issues/786