powa-team / powa-web

PoWA user interface
http://powa.readthedocs.io/
74 stars 31 forks source link

powa_wait_sampling errors in powa-web UI #211

Closed hrawulwa closed 2 months ago

hrawulwa commented 2 months ago

I have a remote setup with powa-collector 1.2.0 and powa-web 4.2.1 running. Out of the many DB servers registered, I see wait sampling errors for couple of DB servers. The pg_wait_sampling extension I'm using in the DB servers is 1.1.5

POWA-WEB ERRORS

TGA-OCC2: Error while calling public.powa_wait_sampling_src: invalid memory alloc request size 1126241016 CONTEXT: PL/pgSQL function powa_wait_sampling_src(integer) line 4 at RETURN QUERY

SCM OCE: Error while calling public.powa_wait_sampling_src: invalid memory alloc request size 1307690112 CONTEXT: SQL statement "SELECT now(), COALESCE(pgss.dbid, 0) AS dbid, s.event_type, s.event, s.queryid, sum(s.count) as count FROM pg_wait_sampling_profile s -- pg_wait_sampling doesn't offer a per (userid, dbid, queryid) view, -- only per pid, but pid can be reused for different databases or users -- so we cannot deduce db or user from it. However, queryid should be -- unique across differet databases, so we retrieve the dbid this way. -- Note that the same queryid can exists for multiple entries if -- multiple users execute the query, so it's critical to retrieve a -- single row from pg_stat_statements per (dbid, queryid) LEFT JOIN (SELECT DISTINCT s2.dbid, s2.queryid FROM pg_stat_statements(false) s2 ) pgss ON pgss.queryid = s.queryid WHERE s.event_type IS NOT NULL AND s.event IS NOT NULL AND COALESCE(pgss.dbid, 0) NOT IN (SELECT oid FROM powa_databases WHERE dropped IS NOT NULL) GROUP BY pgss.dbid, s.event_type, s.event, s.queryid" PL/pgSQL function powa_wait_sampling_src(integer) line 4 at RETURN QUERY

DB SERVER ALERT LOG ERRORS

tgadbp002 errors

024-08-29 05:10:01.646 GMT [3699615]: [4-1] STATEMENT: SELECT snapshot() 2024-08-29 05:10:27.756 GMT [3655854]: [9-1] ERROR: XX000: invalid memory alloc request size 1112214336 2024-08-29 05:10:27.756 GMT [3655854]: [10-1] CONTEXT: PL/pgSQL function powa_wait_sampling_src(integer) line 4 at RETURN QUERY 2024-08-29 05:10:27.756 GMT [3655854]: [11-1] LOCATION: palloc, mcxt.c:934 2024-08-29 05:10:27.756 GMT [3655854]: [12-1] STATEMENT: COPY (SELECT 16, * FROM public.powa_wait_sampling_src(0)) TO stdout 2024-08-29 05:10:27.756 GMT [5731]: [257-1] WARNING: 01000: pg_wait_sampling collector: receiver of message queue has been detached 2024-08-29 05:10:27.756 GMT [5731]: [258-1] LOCATION: send_profile, collector.c:275

scmdbp023 errors

2024-08-29 15:18:03.320 GMT [7866]: [181-1] ERROR: XX000: invalid memory alloc request size 1307690112 2024-08-29 15:18:03.320 GMT [7866]: [182-1] CONTEXT: SQL statement "SELECT now(), COALESCE(pgss.dbid, 0) AS dbid, s.event_type, s.event, s.queryid, sum(s.count) as count FROM pg_wait_sampling_profile s -- pg_wait_sampling doesn't offer a per (userid, dbid, queryid) view, -- only per pid, but pid can be reused for different databases or users -- so we cannot deduce db or user from it. However, queryid should be -- unique across differet databases, so we retrieve the dbid this way. -- Note that the same queryid can exists for multiple entries if -- multiple users execute the query, so it's critical to retrieve a -- single row from pg_stat_statements per (dbid, queryid) LEFT JOIN (SELECT DISTINCT s2.dbid, s2.queryid FROM pg_stat_statements(false) s2 ) pgss ON pgss.queryid = s.queryid WHERE s.event_type IS NOT NULL AND s.event IS NOT NULL AND COALESCE(pgss.dbid, 0) NOT IN (SELECT oid FROM powa_databases WHERE dropped IS NOT NULL) GROUP BY pgss.dbid, s.event_type, s.event, s.queryid" PL/pgSQL function powa_wait_sampling_src(integer) line 4 at RETURN QUERY 2024-08-29 15:18:03.320 GMT [7866]: [183-1] LOCATION: palloc, mcxt.c:1078 2024-08-29 15:18:03.320 GMT [7866]: [184-1] STATEMENT: COPY (SELECT 5, * FROM public.powa_wait_sampling_src(0)) TO stdout

rjuju commented 2 months ago

Hi,

It seems to be because there are just too many things to process, which would eventually need more than 1GB of memory to handle (that's a hard limit in multiple area in postgres).

What is the pg_wait_sampling configuration on that remote server? The most likely explanation is that pg_wait_sampling.profile_pid is on, which leads to a lot of entries. This is not required for powa, and I would generally discourage it exactly because of this kind of issue. This is even worse nowadays as the pid is not limited to 16bits address space anymore, which basically mean that enabling this option will give you an almost infinite amount of entries.

hrawulwa commented 2 months ago

Yes, pg_wait_sampling.profile_pid is on. Looks like that is the default value, and set to "on" on all the DB servers. But I'm wondering why only these two servers are having the issue. Does pg_wait_sampling consume memory from the shared_buffers configuration, and increasing shared_buffers will resolve the issue? If I set pg_wait_sampling.profile_pid to off, do I need to restart the database? Also, please let me know if any of the performance metrics will not be captured by setting this to off.

Thanks

rjuju commented 2 months ago

pg_wait_sampling use dynamic shared memory, so it can grow way more easily compared to other extensions. When profile_pid is on, it will keep one entry per (pid, query_id, want_event) rather than (query_id, want_event). So if those 2 servers have errors it means that either:

or just a bit of everything.

If I set pg_wait_sampling.profile_pid to off, do I need to restart the database?

no, you can change it in the config file / ALTER SYSTEM and reload. But you will probably have to reset pg_wait_sampling to get rid of the old entries.

Also, please let me know if any of the performance metrics will not be captured by setting this to off.

Not as far as powa is concerned, as it doesn't use that value. In general those metrics are not really useful, as pid can get recycled, and once a connection has been close you can't really do much with its pid anyway.

hrawulwa commented 2 months ago

OK. I will test today on these servers and see if the error goes away. Can you please share how to reset pg_wait_sampling to get rid of the old entries, and whether this step needs to be done on the DB server or repository server. Also, please advise if I need to restart powa-collector and powa-web after this change.

Thanks

rjuju commented 2 months ago

Can you please share how to reset pg_wait_sampling to get rid of the old entries

SELECT pg_wait_sampling_reset_profile(); (per https://github.com/postgrespro/pg_wait_sampling)

and whether this step needs to be done on the DB server or repository server

on the repository server

Also, please advise if I need to restart powa-collector and powa-web after this change

no need, they will pick up automatically