Closed lichuan0620 closed 1 year ago
@lichuan0620 does the cardinality issue occurs due to your too many custom queries, or the default query?
I don't think that pg_stat_statements is part of the default metrics exported by the exporter. Did you copy the example queries.yaml file and set it to be used with --extend.query-path
or PG_EXPORTER_EXTEND_QUERY_PATH
? You could remove it from that file to not export those metrics. It's probably a poor metric to export and we should remove it from the example.
You could remove it from that file to not export those metrics.
Yes, and I've done just that.
we should remove it from the example
That'll do.
I made a filtered version of the pg_stat_statments
query a while back. This reduced our cardinality a lot, and put some harder limits on it.
pg_stat_statements:
query: |
SELECT
pg_get_userbyid(userid) as user,
pg_database.datname,
pg_stat_statements.queryid,
pg_stat_statements.calls,
pg_stat_statements.total_time / 1000.0 as seconds_total,
pg_stat_statements.rows,
pg_stat_statements.blk_read_time / 1000.0 as block_read_seconds_total,
pg_stat_statements.blk_write_time / 1000.0 as block_write_seconds_total
FROM pg_stat_statements
JOIN pg_database
ON pg_database.oid = pg_stat_statements.dbid
WHERE
total_time > (
SELECT percentile_cont(0.1)
WITHIN GROUP (ORDER BY total_time)
FROM pg_stat_statements
)
ORDER BY seconds_total DESC
LIMIT 500
metrics:
- user:
usage: "LABEL"
description: "The user who executed the statement"
- datname:
usage: "LABEL"
description: "The database in which the statement was executed"
- queryid:
usage: "LABEL"
description: "Internal hash code, computed from the statement's parse tree"
- calls:
usage: "COUNTER"
description: "Number of times executed"
- seconds_total:
usage: "COUNTER"
description: "Total time spent in the statement, in seconds"
- rows:
usage: "COUNTER"
description: "Total number of rows retrieved or affected by the statement"
- block_read_seconds_total:
usage: "COUNTER"
description: "Total time the statement spent reading blocks, in seconds"
- block_write_seconds_total:
usage: "COUNTER"
description: "Total time the statement spent writing blocks, in seconds"
One of the big problems here is that the pg_stat_statatments
has no "last seen" column. This means you can't filter out queries that are not being made anymore. So, IMO, it's best practice to TRUNCATE TABLE
regularly. I found that daily was a good option.
I have no idea how come someone would put the queryid
to a metric , why is that even needed ?!
I am getting 25K metrics for every query
We're experiencing the same issue.
The postgres exporter is exporting 100k separate metrics to prometheus. It constantly tops the graph of promQL topk(5, scrape_samples_scraped)
.
It is occurring because queryid
is applied as a label to the postgres metrics. This causes a cardinality explosion (each unique value of queryid
only occurs once, which is a practice that confuses metrics with logs, and results in an additional suite of ~19 separate metric series being stored for every individual query), and queryid
is reported by the prometheus server /tsdb-status
page as one of the top label names by value count. There are ~5000 separate values of the queryid
label.
Note, Grafana Cloud charges $8/month per 1k active metrics, so this could incur a substantial financial cost to users.
What did you do?
deploy postgres-exporter, run lots of queries
What did you expect to see?
everything works, nothing crashes
What did you see instead? Under which circumstances?
The number of time series created by postgres-exporter increased rapidly. Prometheus was OOM killed soon after.
Additional comment
I understand that it's commonly agreed that Prometheus metrics should have reasonably cardinality and avoid ID-like labels such as trace ID or query ID. This best practice has been discussed in this post and various community issues.
Of course, the user could always disable or drop these metrics, as I already have. But these are still relevant information and could cause confusion to those who haven't investigate closely (It did in my team). These data should be organized into more well reasonable metrics that either sum over different queries or put them into histograms.
Environment
System information:
insert output of
uname -srm
herepostgres_exporter version:
0.8.0
postgres_exporter flags:
PostgresSQL version:
insert PostgreSQL version here
Logs: