Closed hrawulwa closed 2 months ago
Hi,
It seems to be because there are just too many things to process, which would eventually need more than 1GB of memory to handle (that's a hard limit in multiple area in postgres).
What is the pg_wait_sampling configuration on that remote server? The most likely explanation is that pg_wait_sampling.profile_pid
is on, which leads to a lot of entries. This is not required for powa, and I would generally discourage it exactly because of this kind of issue. This is even worse nowadays as the pid is not limited to 16bits address space anymore, which basically mean that enabling this option will give you an almost infinite amount of entries.
Yes, pg_wait_sampling.profile_pid is on. Looks like that is the default value, and set to "on" on all the DB servers. But I'm wondering why only these two servers are having the issue. Does pg_wait_sampling consume memory from the shared_buffers configuration, and increasing shared_buffers will resolve the issue? If I set pg_wait_sampling.profile_pid to off, do I need to restart the database? Also, please let me know if any of the performance metrics will not be captured by setting this to off.
Thanks
pg_wait_sampling use dynamic shared memory, so it can grow way more easily compared to other extensions. When profile_pid
is on, it will keep one entry per (pid, query_id, want_event) rather than (query_id, want_event). So if those 2 servers have errors it means that either:
or just a bit of everything.
If I set pg_wait_sampling.profile_pid to off, do I need to restart the database?
no, you can change it in the config file / ALTER SYSTEM and reload. But you will probably have to reset pg_wait_sampling to get rid of the old entries.
Also, please let me know if any of the performance metrics will not be captured by setting this to off.
Not as far as powa is concerned, as it doesn't use that value. In general those metrics are not really useful, as pid can get recycled, and once a connection has been close you can't really do much with its pid anyway.
OK. I will test today on these servers and see if the error goes away. Can you please share how to reset pg_wait_sampling to get rid of the old entries, and whether this step needs to be done on the DB server or repository server. Also, please advise if I need to restart powa-collector and powa-web after this change.
Thanks
Can you please share how to reset pg_wait_sampling to get rid of the old entries
SELECT pg_wait_sampling_reset_profile(); (per https://github.com/postgrespro/pg_wait_sampling)
and whether this step needs to be done on the DB server or repository server
on the repository server
Also, please advise if I need to restart powa-collector and powa-web after this change
no need, they will pick up automatically
I have a remote setup with powa-collector 1.2.0 and powa-web 4.2.1 running. Out of the many DB servers registered, I see wait sampling errors for couple of DB servers. The pg_wait_sampling extension I'm using in the DB servers is 1.1.5
POWA-WEB ERRORS
TGA-OCC2: Error while calling public.powa_wait_sampling_src: invalid memory alloc request size 1126241016 CONTEXT: PL/pgSQL function powa_wait_sampling_src(integer) line 4 at RETURN QUERY
SCM OCE: Error while calling public.powa_wait_sampling_src: invalid memory alloc request size 1307690112 CONTEXT: SQL statement "SELECT now(), COALESCE(pgss.dbid, 0) AS dbid, s.event_type, s.event, s.queryid, sum(s.count) as count FROM pg_wait_sampling_profile s -- pg_wait_sampling doesn't offer a per (userid, dbid, queryid) view, -- only per pid, but pid can be reused for different databases or users -- so we cannot deduce db or user from it. However, queryid should be -- unique across differet databases, so we retrieve the dbid this way. -- Note that the same queryid can exists for multiple entries if -- multiple users execute the query, so it's critical to retrieve a -- single row from pg_stat_statements per (dbid, queryid) LEFT JOIN (SELECT DISTINCT s2.dbid, s2.queryid FROM pg_stat_statements(false) s2 ) pgss ON pgss.queryid = s.queryid WHERE s.event_type IS NOT NULL AND s.event IS NOT NULL AND COALESCE(pgss.dbid, 0) NOT IN (SELECT oid FROM powa_databases WHERE dropped IS NOT NULL) GROUP BY pgss.dbid, s.event_type, s.event, s.queryid" PL/pgSQL function powa_wait_sampling_src(integer) line 4 at RETURN QUERY
DB SERVER ALERT LOG ERRORS
tgadbp002 errors
024-08-29 05:10:01.646 GMT [3699615]: [4-1] STATEMENT: SELECT snapshot() 2024-08-29 05:10:27.756 GMT [3655854]: [9-1] ERROR: XX000: invalid memory alloc request size 1112214336 2024-08-29 05:10:27.756 GMT [3655854]: [10-1] CONTEXT: PL/pgSQL function powa_wait_sampling_src(integer) line 4 at RETURN QUERY 2024-08-29 05:10:27.756 GMT [3655854]: [11-1] LOCATION: palloc, mcxt.c:934 2024-08-29 05:10:27.756 GMT [3655854]: [12-1] STATEMENT: COPY (SELECT 16, * FROM public.powa_wait_sampling_src(0)) TO stdout 2024-08-29 05:10:27.756 GMT [5731]: [257-1] WARNING: 01000: pg_wait_sampling collector: receiver of message queue has been detached 2024-08-29 05:10:27.756 GMT [5731]: [258-1] LOCATION: send_profile, collector.c:275
scmdbp023 errors
2024-08-29 15:18:03.320 GMT [7866]: [181-1] ERROR: XX000: invalid memory alloc request size 1307690112 2024-08-29 15:18:03.320 GMT [7866]: [182-1] CONTEXT: SQL statement "SELECT now(), COALESCE(pgss.dbid, 0) AS dbid, s.event_type, s.event, s.queryid, sum(s.count) as count FROM pg_wait_sampling_profile s -- pg_wait_sampling doesn't offer a per (userid, dbid, queryid) view, -- only per pid, but pid can be reused for different databases or users -- so we cannot deduce db or user from it. However, queryid should be -- unique across differet databases, so we retrieve the dbid this way. -- Note that the same queryid can exists for multiple entries if -- multiple users execute the query, so it's critical to retrieve a -- single row from pg_stat_statements per (dbid, queryid) LEFT JOIN (SELECT DISTINCT s2.dbid, s2.queryid FROM pg_stat_statements(false) s2 ) pgss ON pgss.queryid = s.queryid WHERE s.event_type IS NOT NULL AND s.event IS NOT NULL AND COALESCE(pgss.dbid, 0) NOT IN (SELECT oid FROM powa_databases WHERE dropped IS NOT NULL) GROUP BY pgss.dbid, s.event_type, s.event, s.queryid" PL/pgSQL function powa_wait_sampling_src(integer) line 4 at RETURN QUERY 2024-08-29 15:18:03.320 GMT [7866]: [183-1] LOCATION: palloc, mcxt.c:1078 2024-08-29 15:18:03.320 GMT [7866]: [184-1] STATEMENT: COPY (SELECT 5, * FROM public.powa_wait_sampling_src(0)) TO stdout