prometheus-community / postgres_exporter

A PostgreSQL metric exporter for Prometheus
Apache License 2.0
2.81k stars 741 forks source link

Memory leak in Postgres query's process #799

Open mpzrtauio opened 1 year ago

mpzrtauio commented 1 year ago

What did you do? Run PostgreSQL and postgres_exporter on the same VM (Ubuntu) in Azure. We scrape postgres_exporter endpoint in every 15 sec.

What did you expect to see? Almost constant and low memory usage at Postgres query's process (which belongs to postgres_exporter).

What did you see instead? Under which circumstances? image

image

As you can seen there is a sawtooth in memory consumption. We started to measure our Linux processes memory usage with htop and pg_top. We found Postgres query's process (which belongs to postgres_exporter) eats more and more memory (at every scraping, in our case 1MB / 1 min). When we stopped our postgres_exporter's service then the issue disappeared. We also tried to stop scraping then memory consumption stay constant. And finally we tried to kill Postgres query's idle process and after that occupied memory has been freed.

Our workaround is based on this information. We wrote a bash script which kills Postgres query's process which belongs to exporter in every 15 minutes and in this case memory usage sawtooth is much smaller.

The largest PostgreSQL database size is around 2Tb. We use the following PG extensions:

Environment

Linux 5.15.0-1038-azure x86_64

postgres_exporter, version 0.12.0 (branch: HEAD, revision: f9a1edbbc6d629663aa658378e1d989293e6e488) build user: root@463311dfdfb6 build date: 20230321-23:52:03 go version: go1.20.2 platform: linux/amd64 tags: netgo static_build

We don't use any flags, we use default settings.

PostgreSQL 14.4 (Ubuntu 14.4-1.pgdg20.04+1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0, 64-bit

keithf4 commented 1 year ago

Do you have JIT enabled in PostgreSQL? We were having similar issues and thought it was the exporter as well, but disabling JIT in postgresql solved this issue

https://github.com/CrunchyData/pgmonitor/issues/182#issuecomment-706559384

If you don't want to disable it completely, you can do it for just your monitoring user. That is what we did

ALTER ROLE ccp_monitoring SET jit TO 'off';
mpzrtauio commented 1 year ago

We have already turned off JIT for monitoring users, and the issue is still present.

keithf4 commented 1 year ago

Were you able to narrow it down to any specific queries? That's what helped us figure out it was JIT in our case. Could help diagnose whether it's an upstream issue in PG or something related to the exporter itself.

mpzrtauio commented 1 year ago

@keithf4 Apologies for the delayed response. We turned on pg_stat_statements PG extension and started to collect metrics about exporter's query.

SELECT * 
FROM pg_stat_statements 
where userid::regrole = 'monitoring'::regrole 
ORDER BY (shared_blks_hit+shared_blks_dirtied)
userid dbid toplevel queryid query plans total_plan_time min_plan_time max_plan_time mean_plan_time stddev_plan_time calls total_exec_time min_exec_time max_exec_time mean_exec_time stddev_exec_time rows shared_blks_hit shared_blks_read shared_blks_dirtied shared_blks_written
6681618 16447 true -10637612914751176 SELECT pg_database_size($1) 0 0.0 0.0 0.0 0.0 0.0 915420 4135248.9797929386 0.415105 179.16520699999998 4.517324266230871 8.040072400672269 915420 1832208 0 0 0
6681618 16447 true -2954216581073452720 SELECT pg_database.datname, tmp.state, tmp2.usename, tmp2.application_name, COALESCE(count,$1) as count, COALESCE(max_tx_duration,$2) as max_tx_duration FROM ( VALUES ($3), ($4), ($5), ($6), ($7), ($8) ) AS tmp(state) CROSS JOIN pg_database LEFT JOIN ( SELECT datname, state, usename, application_name, count(*) AS count, MAX(EXTRACT($9 FROM now() - xact_start))::float AS max_tx_duration FROM pg_stat_activity GROUP BY datname,state,usename,application_name) AS tmp2 ON tmp.state = tmp2.state AND pg_database.datname = tmp2.datname 0 0.0 0.0 0.0 0.0 0.0 182917 26361.279016000426 0.103602 14.475577 0.1441160691242474 0.132534524797907 5542986 548759 0 0 0
6681618 16447 true -182263461300699970 SELECT pg_database.datname,tmp.mode,COALESCE(count,$1) as count FROM ( VALUES ($2), ($3), ($4), ($5), ($6), ($7), ($8), ($9), ($10) ) AS tmp(mode) CROSS JOIN pg_database LEFT JOIN (SELECT database, lower(mode) AS mode,count(*) AS count FROM pg_locks WHERE database IS NOT NULL GROUP BY database, lower(mode) ) AS tmp2 ON tmp.mode=tmp2.mode and pg_database.oid = tmp2.database ORDER BY 1 0 0.0 0.0 0.0 0.0 0.0 182917 78858.32328600078 0.199402 63.891402 0.43111533256066775 1.6316327571492935 8231265 183430 0 0 0
6681618 16447 true -5932576113725481334 SELECT pg_database.datname FROM pg_database 0 0.0 0.0 0.0 0.0 0.0 183084 2100.63370199979 0.0055 5.0638630000000004 0.011473606115225648 0.029558547476256736 915420 183084 0 0 0
6681618 16447 true -3196603689379148377 SELECT * FROM pg_stat_database_conflicts 0 0.0 0.0 0.0 0.0 0.0 182917 323691.1460849984 1.04641 34.539633 1.769606685463864 2.1520858588312666 914585 182917 0 0 0
6681618 16447 true -5444365506547626638 SELECT * FROM pg_stat_database 0 0.0 0.0 0.0 0.0 0.0 182916 379707.6526510065 1.118914 36.115644 2.075858058622502 2.3765647629448967 1097496 182916 0 0 0
6681618 16447 true 2859277235430319612 SELECT name, setting, COALESCE(unit, $1), short_desc, vartype FROM pg_settings WHERE vartype IN ($2, $3, $4) AND name != $5 0 0.0 0.0 0.0 0.0 0.0 182917 305611.66955899115 1.01731 64.357009 1.6707669027974452 0.7644764082973766 49753424 1311 49 0 0
6681618 16447 true -3033217080036596150 SELECT checkpoints_timed, checkpoints_req ,checkpoint_write_time ,checkpoint_sync_time ,buffers_checkpoint ,buffers_clean ,maxwritten_clean ,buffers_backend ,buffers_backend_fsync ,buffers_alloc ,stats_reset FROM pg_stat_bgwriter 0 0.0 0.0 0.0 0.0 0.0 183084 2131866.113088992 1.337615 74.049995 11.644196724394355 2.2696155408569916 183084 0 0 0 0
6681618 16447 true -9113088414397004364 SELECT *, extract($1 from now() - last_archived_time) AS last_archive_age FROM pg_stat_archiver 0 0.0 0.0 0.0 0.0 0.0 182917 429206.3244129997 1.0390110000000001 40.687288 2.3464539895855387 2.9282936754452984 182917 0 0 0 0
6681618 16447 true -2563412158741520104 SELECT version() 0 0.0 0.0 0.0 0.0 0.0 182917 1364.2742470001874 0.0037 2.147126 0.007458433316750187 0.029891294116721038 182917 0 0 0 0
6681618 16447 true -3693365941426935874 SELECT slot_name, database, active, pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) FROM pg_replication_slots 0 0.0 0.0 0.0 0.0 0.0 182917 1406.4713770002957 0.005001 11.418737 0.0076891233564949915 0.03974505046098803 0 0 0 0 0
6681618 16447 true 7763026806317272650 SELECT slot_name, pg_current_wal_lsn() - $1 AS current_wal_lsn, coalesce(confirmed_flush_lsn, $2) - $3, active FROM pg_replication_slots 0 0.0 0.0 0.0 0.0 0.0 183084 2160.2764899999106 0.0056 9.693518 0.011799373456992284 0.039906631555195596 0 0 0 0 0
6681618 16447 true -5644479190607423912 SELECT *, (case pg_is_in_recovery() when $1 then $2 else pg_current_wal_lsn() end) AS pg_current_wal_lsn, (case pg_is_in_recovery() when $3 then $4 else pg_wal_lsn_diff(pg_current_wal_lsn(), pg_lsn($5))::float end) AS pg_current_wal_lsn_bytes, (case pg_is_in_recovery() when $6 then $7 else pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn)::float end) AS pg_wal_lsn_diff FROM pg_stat_replication 0 0.0 0.0 0.0 0.0 0.0 182917 10725.877506000359 0.039701 14.47156 0.05863794784519792 0.10586158018482977 0 0 0 0 0

I hope it can give you some extra information. We welcome any suggestion for new measurements ideas.

garry-t commented 1 year ago

@mpzrtauio do you have such problem on latest version ? I plan to upgrade, but in case latest version has same issue i'll think twice

garry-t commented 12 months ago

I upgraded https://github.com/prometheus-community/postgres_exporter/issues/958 :)

piyushfs2 commented 10 months ago

did anyone find a resolution for this, I am facing the same issue