powa-team / powa-collector

PoWA Collector daemon for remote snapshot
http://powa.readthedocs.io/
PostgreSQL License
7 stars 4 forks source link

Heavy load with 338 instances #20

Open frost242 opened 3 hours ago

frost242 commented 3 hours ago

Hello,

Just a quick feedback when using powa-collector 1.3.0 with PoWA 5.0. We have 338 PostgreSQL instances monitored by a single PoWA instance. While running, the collector generates a huge load on our server. Loadavg goes up to 50. I notice a lot of sessions in waiting state with wait_event_type LWLock and wait_event SubtransSLRU.

[postgres]# select wait_event_type, wait_event, count(*) from pg_stat_activity group by 1, 2;
 wait_event_type |      wait_event      | count
-----------------+----------------------+-------
 [null]          | [null]               |    77
 Timeout         | CheckpointWriteDelay |     1
 Activity        | WalWriterMain        |     1
 Lock            | extend               |     2
 Activity        | LogicalLauncherMain  |     1
 IO              | DataFileRead         |    14
 LWLock          | BufferMapping        |   238
 Activity        | AutovacuumMain       |     1
 IPC             | BufferIo             |     8
 Client          | ClientRead           |     1
 Activity        | BgwriterMain         |     1
 Extension       | Extension            |     1
(12 lignes)

Just a few minutes after upgrading our PoWA db to PostgreSQL 17, we have a huge activity on the subtransaction SLRU :

[postgres]# select * from pg_stat_slru ;
       name       | blks_zeroed | blks_hit | blks_read | blks_written | blks_exists | flushes | truncates |          stats_reset
------------------+-------------+----------+-----------+--------------+-------------+---------+-----------+-------------------------------
 commit_timestamp |           0 |        0 |         0 |            0 |           0 |      12 |         0 | 2024-11-15 14:48:34.904968+01
 multixact_member |           1 |        2 |         4 |            5 |           0 |      20 |         0 | 2024-11-15 14:48:34.904968+01
 multixact_offset |           1 |        7 |        10 |            8 |          11 |      20 |         0 | 2024-11-15 14:48:34.904968+01
 notify           |           0 |        0 |         0 |            0 |           0 |       0 |         0 | 2024-11-15 14:48:34.904968+01
 serializable     |           0 |        0 |         0 |            0 |           0 |       0 |         0 | 2024-11-15 14:48:34.904968+01
 subtransaction   |          10 | 51057277 |         0 |            8 |           0 |      12 |        12 | 2024-11-15 14:48:34.904968+01
 transaction      |           2 |    14242 |         9 |           11 |           0 |      12 |         0 | 2024-11-15 14:48:34.904968+01
 other            |           0 |        0 |         0 |            0 |           0 |       0 |         0 | 2024-11-15 14:48:34.904968+01
(8 lignes)

I'm not sure that playing with subtransaction_buffers will help a lot here, as I only have hits.

Things goes better after having set subtransaction_buffers to '1GB', SLRU blks_hit appears less often. However, load goes up even more, I have to investigate further this issue - probably a mix of infrastructure issue and something else. I'll let you know. I thought it's a good idea to share this with you.

rjuju commented 2 hours ago

Hi Thomas,

Thanks for the report!

I think this is coming from the new per-db snapshot feature. Can you try a custom patch? Maybe this would be enough to fix the problem:

diff --git a/powa_collector/snapshot.py b/powa_collector/snapshot.py
index d7eb5e2..c0a1b82 100644
--- a/powa_collector/snapshot.py
+++ b/powa_collector/snapshot.py
@@ -113,6 +113,7 @@ def copy_remote_data_to_repo(cls, data_name,
     buf = StringIO()
     try:
         data_src.copy_expert("COPY (%s) TO stdout" % data_src_sql, buf)
+        data_src.execute("RELEASE src")
     except psycopg2.Error as e:
         src_ok = False
         err = "Error retrieving datasource data %s:\n%s" % (data_name, e)
@@ -125,6 +126,7 @@ def copy_remote_data_to_repo(cls, data_name,
         try:
             cls.logger.debug("Calling %s..." % cleanup_sql)
             data_src.execute(cleanup_sql)
+            data_src.execute("RELEASE src")
         except psycopg2.Error as e:
             err = "Error while calling %s:\n%s" % (cleanup_sql, e)
             errors.append(err)
@@ -142,6 +144,7 @@ def copy_remote_data_to_repo(cls, data_name,
     try:
         # For data import the schema is now on the repository server
         data_ins.copy_expert("COPY %s FROM stdin" % target_tbl_name, buf)
+        data_ins.execute("RELEASE data")
     except psycopg2.Error as e:
         err = "Error while inserting data:\n%s" % e
         cls.logger.warning(err)
frost242 commented 2 hours ago

I'm trying it now, I'll let you know ASAP.

frost242 commented 2 hours ago

Things are probably better, now I experiment memory errors. I'll let you know on monday I think.

rjuju commented 2 hours ago

ok, that's a least a first good news I hope!

frost242 commented 2 hours ago

Also, now many sessions are experiencing LWLock waits for BufferMapping, as well as relation extension locks. I have to check this issue first.

rjuju commented 2 hours ago

this one looks just a side effect of having that many servers on a fresh install. once you hit the coalesce and purge, auto vacuum should kick in and you shouldn't hit the relation extension anymore. for the buffer mapping, there is probably not much to do apart from lowering the frequency or excluding some databases from the per db snapshots. do you usually have one db per instance or a lot?

frost242 commented 2 hours ago

Thanks for the feedback. I have usually one db per instance. So I'll try to solve this memory issue first.