Memory leak in Postgres query's process

mpzrtauio commented 1 year ago

What did you do? Run PostgreSQL and postgres_exporter on the same VM (Ubuntu) in Azure. We scrape postgres_exporter endpoint in every 15 sec.

What did you expect to see? Almost constant and low memory usage at Postgres query's process (which belongs to postgres_exporter).

What did you see instead? Under which circumstances?

As you can seen there is a sawtooth in memory consumption. We started to measure our Linux processes memory usage with htop and pg_top. We found Postgres query's process (which belongs to postgres_exporter) eats more and more memory (at every scraping, in our case 1MB / 1 min). When we stopped our postgres_exporter's service then the issue disappeared. We also tried to stop scraping then memory consumption stay constant. And finally we tried to kill Postgres query's idle process and after that occupied memory has been freed.

Our workaround is based on this information. We wrote a bash script which kills Postgres query's process which belongs to exporter in every 15 minutes and in this case memory usage sawtooth is much smaller.

The largest PostgreSQL database size is around 2Tb. We use the following PG extensions:

timescaledb 2.7.1
plpgsql 1.0
plpython3u 1.0
postgis 3.2.1
unaccent 1.1
pg_trgm 1.6

We turn off any data transformation and loading tasks in our ELT system.

Environment

System information:

Linux 5.15.0-1038-azure x86_64

postgres_exporter version:

postgres_exporter, version 0.12.0 (branch: HEAD, revision: f9a1edbbc6d629663aa658378e1d989293e6e488) build user: root@463311dfdfb6 build date: 20230321-23:52:03 go version: go1.20.2 platform: linux/amd64 tags: netgo static_build

postgres_exporter flags:

We don't use any flags, we use default settings.

PostgreSQL version:

PostgreSQL 14.4 (Ubuntu 14.4-1.pgdg20.04+1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0, 64-bit

Logs: We haven't got any relevant log.

keithf4 commented 1 year ago

Do you have JIT enabled in PostgreSQL? We were having similar issues and thought it was the exporter as well, but disabling JIT in postgresql solved this issue

https://github.com/CrunchyData/pgmonitor/issues/182#issuecomment-706559384

If you don't want to disable it completely, you can do it for just your monitoring user. That is what we did

ALTER ROLE ccp_monitoring SET jit TO 'off';

mpzrtauio commented 1 year ago

We have already turned off JIT for monitoring users, and the issue is still present.

keithf4 commented 1 year ago

Were you able to narrow it down to any specific queries? That's what helped us figure out it was JIT in our case. Could help diagnose whether it's an upstream issue in PG or something related to the exporter itself.

mpzrtauio commented 1 year ago

@keithf4 Apologies for the delayed response. We turned on pg_stat_statements PG extension and started to collect metrics about exporter's query.

SELECT * 
FROM pg_stat_statements 
where userid::regrole = 'monitoring'::regrole 
ORDER BY (shared_blks_hit+shared_blks_dirtied)

userid	dbid	toplevel	queryid	query	calls	total_exec_time	min_exec_time	max_exec_time	mean_exec_time	stddev_exec_time	rows	shared_blks_hit	shared_blks_read
6681618	16447	true	-10637612914751176	SELECT pg_database_size($1)	915420	4135248.9797929386	0.415105	179.16520699999998	4.517324266230871	8.040072400672269	915420	1832208	0
6681618	16447	true	-2954216581073452720	SELECT pg_database.datname, tmp.state, tmp2.usename, tmp2.application_name, COALESCE(count,$1) as count, COALESCE(max_tx_duration,$2) as max_tx_duration FROM ( VALUES ($3), ($4), ($5), ($6), ($7), ($8) ) AS tmp(state) CROSS JOIN pg_database LEFT JOIN ( SELECT datname, state, usename, application_name, count(*) AS count, MAX(EXTRACT($9 FROM now() - xact_start))::float AS max_tx_duration FROM pg_stat_activity GROUP BY datname,state,usename,application_name) AS tmp2 ON tmp.state = tmp2.state AND pg_database.datname = tmp2.datname	182917	26361.279016000426	0.103602	14.475577	0.1441160691242474	0.132534524797907	5542986	548759	0
6681618	16447	true	-182263461300699970	SELECT pg_database.datname,tmp.mode,COALESCE(count,$1) as count FROM ( VALUES ($2), ($3), ($4), ($5), ($6), ($7), ($8), ($9), ($10) ) AS tmp(mode) CROSS JOIN pg_database LEFT JOIN (SELECT database, lower(mode) AS mode,count(*) AS count FROM pg_locks WHERE database IS NOT NULL GROUP BY database, lower(mode) ) AS tmp2 ON tmp.mode=tmp2.mode and pg_database.oid = tmp2.database ORDER BY 1	182917	78858.32328600078	0.199402	63.891402	0.43111533256066775	1.6316327571492935	8231265	183430	0
6681618	16447	true	-5932576113725481334	SELECT pg_database.datname FROM pg_database	183084	2100.63370199979	0.0055	5.0638630000000004	0.011473606115225648	0.029558547476256736	915420	183084	0
6681618	16447	true	-3196603689379148377	SELECT * FROM pg_stat_database_conflicts	182917	323691.1460849984	1.04641	34.539633	1.769606685463864	2.1520858588312666	914585	182917	0
6681618	16447	true	-5444365506547626638	SELECT * FROM pg_stat_database	182916	379707.6526510065	1.118914	36.115644	2.075858058622502	2.3765647629448967	1097496	182916	0
6681618	16447	true	2859277235430319612	SELECT name, setting, COALESCE(unit, $1), short_desc, vartype FROM pg_settings WHERE vartype IN ($2, $3, $4) AND name != $5	182917	305611.66955899115	1.01731	64.357009	1.6707669027974452	0.7644764082973766	49753424	1311	49
6681618	16447	true	-3033217080036596150	SELECT checkpoints_timed, checkpoints_req ,checkpoint_write_time ,checkpoint_sync_time ,buffers_checkpoint ,buffers_clean ,maxwritten_clean ,buffers_backend ,buffers_backend_fsync ,buffers_alloc ,stats_reset FROM pg_stat_bgwriter	183084	2131866.113088992	1.337615	74.049995	11.644196724394355	2.2696155408569916	183084	0	0
6681618	16447	true	-9113088414397004364	SELECT *, extract($1 from now() - last_archived_time) AS last_archive_age FROM pg_stat_archiver	182917	429206.3244129997	1.0390110000000001	40.687288	2.3464539895855387	2.9282936754452984	182917	0	0
6681618	16447	true	-2563412158741520104	SELECT version()	182917	1364.2742470001874	0.0037	2.147126	0.007458433316750187	0.029891294116721038	182917	0	0
6681618	16447	true	-3693365941426935874	SELECT slot_name, database, active, pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) FROM pg_replication_slots	182917	1406.4713770002957	0.005001	11.418737	0.0076891233564949915	0.03974505046098803	0	0	0
6681618	16447	true	7763026806317272650	SELECT slot_name, pg_current_wal_lsn() - $1 AS current_wal_lsn, coalesce(confirmed_flush_lsn, $2) - $3, active FROM pg_replication_slots	183084	2160.2764899999106	0.0056	9.693518	0.011799373456992284	0.039906631555195596	0	0	0
6681618	16447	true	-5644479190607423912	SELECT *, (case pg_is_in_recovery() when $1 then $2 else pg_current_wal_lsn() end) AS pg_current_wal_lsn, (case pg_is_in_recovery() when $3 then $4 else pg_wal_lsn_diff(pg_current_wal_lsn(), pg_lsn($5))::float end) AS pg_current_wal_lsn_bytes, (case pg_is_in_recovery() when $6 then $7 else pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn)::float end) AS pg_wal_lsn_diff FROM pg_stat_replication	182917	10725.877506000359	0.039701	14.47156	0.05863794784519792	0.10586158018482977	0	0	0

I hope it can give you some extra information. We welcome any suggestion for new measurements ideas.

garry-t commented 1 year ago

@mpzrtauio do you have such problem on latest version ? I plan to upgrade, but in case latest version has same issue i'll think twice

garry-t commented 12 months ago

I upgraded https://github.com/prometheus-community/postgres_exporter/issues/958 :)

piyushfs2 commented 10 months ago

did anyone find a resolution for this, I am facing the same issue

prometheus-community / postgres_exporter

Memory leak in Postgres query's process #799