powa-team / powa-archivist

powa-archivist: the powa PostgreSQL extension
http://powa.readthedocs.io/
PostgreSQL License
52 stars 20 forks source link

powa spins at 100% CPU and eats all the memory #6

Closed girgen closed 6 years ago

girgen commented 7 years ago

Hi,

This is similar to #4 actually, very similar, but it's been there are new versions of everything so I'm opening a new issue.

postgresql-9.6.0 powa-archivist-3.1.0 pg_qualstats-1.0.1 pg_stat_kcache-2.0.3

the query SELECT powa_take_snapshot() runs for hours on end, eating all the memory. Can I stop it with a pg_cancel_backend(pid)?

Seems to have happened under high load.

As you can see, the amount of data in powa_all_relations_history_current is quite substantial, about 20 % of database disk footprint in this case.

First question is, can I cancel it without restarting the database? Will powa.frequency then take effect immediately, so I can stop powa from running in production?

We have had powa running fine for six months after turning off pg_qualstats (not loading it at all). Now after upgrading to postgres-9.6, I wanted to reactive it and expected pg_qualstats to work, as it did in our test environments. Apparently, there is still a problem somewhere with high loads?

Suggestions?

Palle

powa=# SELECT relname, pg_size_pretty(pg_table_size(oid)) FROM pg_class WHERE relkind = 'r' and relname like 'powa%';
                  relname                   | pg_size_pretty 
--------------------------------------------+----------------
 powa_databases                             | 8192 bytes
 powa_last_aggregation                      | 8192 bytes
 powa_last_purge                            | 8192 bytes
 powa_statements_history                    | 302 MB
 powa_statements_history_db                 | 448 kB
 powa_statements_history_current            | 108 MB
 powa_statements_history_current_db         | 80 kB
 powa_user_functions_history_current        | 704 kB
 powa_user_functions_history                | 2880 kB
 powa_all_relations_history_current         | 70 GB
 powa_all_relations_history                 | 251 MB
 powa_functions                             | 16 kB
 powa_kcache_metrics                        | 597 MB
 powa_kcache_metrics_db                     | 312 kB
 powa_kcache_metrics_current                | 58 MB
 powa_kcache_metrics_current_db             | 56 kB
 powa_qualstats_quals_history_current       | 728 kB
 powa_statements                            | 6047 MB
 powa_qualstats_quals_history               | 20 MB
 powa_qualstats_quals                       | 1872 kB
 powa_qualstats_constvalues_history         | 33 MB
 powa_qualstats_constvalues_history_current | 1136 kB
(22 rows)
SELECT * FROM pg_stat_statements WHERE query like '%powa%snapshot%' and queryid != 105189781;
 userid |    dbid    |  queryid   |                 query                 | calls |    total_time    | min_time | max_time  |    mean_time     |   stddev_time    | rows | shared_blks_hit | shared_blks_read | shared_blks_dirtied | shared_blks_written | local_blks_hit | local_blks_read | local_blks_dirtied | local_blks_written | temp_blks_read | temp_blks_written | blk_read_time | blk_write_time 
--------+------------+------------+---------------------------------------+-------+------------------+----------+-----------+------------------+------------------+------+-----------------+------------------+---------------------+---------------------+----------------+-----------------+--------------------+--------------------+----------------+-------------------+---------------+----------------
     10 | 3055152042 |  951689334 | SELECT powa_user_functions_snapshot() |  2358 |       141768.738 |    4.416 |   228.313 | 60.1224503816793 | 7.31830802358817 | 2358 |          311776 |             3467 |                3536 |                   0 |              0 |               0 |                  0 |                  0 |              0 |                 0 |         0.331 |              0
     10 | 3055152042 | 1077229976 | SELECT powa_all_relations_snapshot()  |  2357 |       159776.632 |   29.307 |   209.913 | 67.7881340687315 | 25.0953617563583 | 2357 |        13865440 |           343403 |              343444 |                   0 |              0 |               0 |                  0 |                  0 |              0 |                 0 |         1.589 |              0
     10 | 3055152042 | 1773443275 | SELECT powa_take_snapshot()           |  2357 |      1989114.956 |  145.235 | 41980.686 |  843.91809758167 | 3306.94629951111 | 2357 |       165743477 |          1313125 |             1555277 |                   0 |              0 |               0 |                  0 |                  0 |              0 |                 0 |       1833.96 |              0
     10 | 3055152042 | 2950805420 | SELECT powa_qualstats_snapshot()      |  2357 |        36807.554 |     0.39 |   164.902 | 15.6162723801442 |  12.616370700035 | 2357 |         5479811 |             7373 |               15672 |                   0 |              0 |               0 |                  0 |                  0 |              0 |                 0 |         0.935 |              0
     10 | 3055152042 | 2857038487 | SELECT powa_kcache_snapshot()         |  2357 |       173026.297 |    4.027 |   232.473 | 73.4095447602885 | 26.4572671045701 | 2357 |        22303108 |           289694 |              289768 |                   0 |              0 |               0 |                  0 |                  0 |              0 |                 0 |         2.084 |              0
     10 | 3055152042 | 3588286645 | SELECT powa_statements_snapshot()     |  2358 | 689089.514000002 |   21.281 |  1544.558 | 292.234738761663 | 82.7828903193194 | 2358 |       113496013 |           542052 |              585903 |                   0 |              0 |               0 |                  0 |                  0 |              0 |                 0 |      1113.779 |              0
(6 rows)
rjuju commented 7 years ago

Ouch, 70 GB!

For your first question, issue #4 has been fixed in 3.1.0, so you can safely use pg_cancel_backend() or pg_terminate_backend(). If you want to disable powa, you can proceed like this:

I just tried with 3.1.0 and simulating a long snapshot (using pg_sleep), and it works as intended:

2016-10-07 12:07:07 CEST 10830 [2-1] ERROR:  canceling statement due to user request
2016-10-07 12:07:07 CEST 10830 [3-1] CONTEXT:  SQL statement "SELECT pg_sleep(123)"
    PL/pgSQL function powa_take_snapshot() line 23 at PERFORM
    SQL statement "SELECT powa_take_snapshot()"
2016-10-07 12:07:07 CEST 10256 [40-1] LOG:  worker process: powa (PID 10830) exited with exit code 1
2016-10-07 12:07:17 CEST 10838 [1-1] LOG:  POWA connected to database powa
2016-10-07 12:07:33 CEST 10256 [41-1] LOG:  received SIGHUP, reloading configuration files
2016-10-07 12:07:33 CEST 10256 [42-1] LOG:  parameter "powa.frequency" changed to "-1"
2016-10-07 12:07:45 CEST 10838 [2-1] ERROR:  canceling statement due to user request
2016-10-07 12:07:45 CEST 10838 [3-1] CONTEXT:  SQL statement "SELECT pg_sleep(123)"
    PL/pgSQL function powa_take_snapshot() line 23 at PERFORM
    SQL statement "SELECT powa_take_snapshot()"
2016-10-07 12:07:45 CEST 10256 [43-1] LOG:  worker process: powa (PID 10838) exited with exit code 1
2016-10-07 12:07:55 CEST 10957 [1-1] LOG:  POWA is deactivated (powa.frequency = -1), exiting
2016-10-07 12:07:55 CEST 10256 [44-1] LOG:  worker process: powa (PID 10957) exited with exit code 1

For the huge size issue, did you dump / restore data to switch to 9.6 or did you use pg_upgrade?

There was a bloat issue with previous version of powa which has been fixed in 3.1.0. If you used pg_upgrade you could try to vacuum full this table.

Also, if you have a database containing a huge number of objects, you can disable gathering statistics on objects with this query, on powa database:

UPDATE powa_functions SET enabled = false WHERE module = 'powa_stat_all_relations'

and then truncate the powa_all_relations_* tables. This is likely to solve your issue (I already encounter this huge memory usage problem while testing powa on a database containing something like 300 000 objects, and I just disabled the relation statistics, as it's unlikely that any interesting analysis can be done on such a number of objects).

girgen commented 7 years ago

We used pg_upgrade to migrate to 9.6, from 9.4.

By objects, do you mean tables? There are roughly 1000+ tables in the database. The powa database was dropped before pg_upgrade, and recreated afterwards.

rjuju commented 7 years ago

We used pg_upgrade to migrate to 9.6, from 9.4. ok

By objects, do you mean tables? There are roughly 1000+ tables in the database. I mean anything ending up with object statistics (tables, indexes, materialized views...). You can check how many object records will be stored for each database every snapshot with this query:

SELECT datname, count(*) from (select datname, powa_stat_all_rel(oid) from pg_database) s GROUP by datname;

The powa database was dropped before pg_upgrade, and recreated afterwards.

Ok, so previous bloat is not the issue. Can you share results of previous query (you can remove the database names if needed), so I'll try to reproduce this problem with a similar number of objects.

girgen commented 7 years ago

Hi,

Sorry for the delay in responding. We had to turn it off powa to prevent the servers from crashing, basically, so the urgent issue was resolved. It seems, the more load, the likelier it would crash, which seems natural. As you wrote above, pg_terminate_backend did work nicely, and I've set powa.frequency = -1 so it is not running now. It seems, in the heat of the action, I dropped the powa database from the server with the biggest problems.

This is from another server, also showing these problems. It was stopped using pg_terminate_backend and powa.frequency = -1 before it did any real harm, although it was already spinning down in an infinite loop when I stopped it.

    datname    | count 
---------------+-------
 squirrelmail  |    36
 postgres      |    40
 akerlyckan    |    41
 team1         |    46
 devserv       |    49
 va            |    56
 helptexts     |    57
 scrumbanboard |    73
 graphite2     |    74
 mango         |    75
 grafana       |    90
 template1     |   113
 powa          |   135
 rt            |   178
 teamcity      |   409
 testki        |   610
 testphs       |   643
 teststhlmr01  |   679
 teststhlm     |   701
 devsthlm      |   781
 aprikos       |  1206
 gurka         |  1447
 apa           |  1747
 foo           |  3068
 bar           |  3414
(25 rows)
girgen commented 7 years ago

BTW is the fix in 1.0.2 of pg_qualstats perhaps related to this? @rjuju

rjuju commented 7 years ago

Unfortunately, I don't so. The issue fixed in 1.0.2 of pg_qualstats in when you have queries with more WHERE and/or JOIN clauses than pg_qualstats.max. Do you have queries with ~1k WHERE clauses (or IN with that much values for instance)? Also, if a query is stuck in such a case, it shouldn't be interruptible with a pg_terminate_backend.

girgen commented 7 years ago

yes we do have some quite substantial amount of values in IN clauses. WHERE foobar IN ('a','b',...) with sometimes a thousand values...

girgen commented 7 years ago

and the query is interruptable this time. (the one six months ago in #4 was not interruptable, but it also didn't claim memory, just one single CPU at 100%, so we could live with that...) :)

girgen commented 7 years ago

Hi @rjuju ! We naïvely hoped the fix for large number of IN parameters in pg_qualstats would have helped, so after upgrading qualstats we restarted powa on one databasecluster. After a few days the powa_all_relations_history_current suddenly started growing quickly and after a few hours it was 100 GB. This happened during the night and the server crashed and rebooted before we reacted. Same problem as above it seems. I guess it is not even related to pg_qualstats, is it?

I have continuous snapshot backups on a replicated slave, would it help to investigate that table or others as they grew? We could probably play it through using PITR. What should I look for in that case?

rjuju commented 7 years ago

Hello @girgen

Yes, there's apparently an issue with the pgstats retrieval, and it's not related to pg_qualstats. I could reproduce it with following conditions:

but I still have to wait quite a long time before hitting the bug (while watching my memory consumption). I couldn't yet understand why this is happening.

Disabling the pgstats retrieval should avoid triggering this bug (and given the number of object you have, disabling it is probably a good idea to avoid too much overhead):

UPDATE powa_functions SET enabled = false WHERE module = 'powa_stat_all_relations'
girgen commented 7 years ago

Hi @rjuju,

It is comforting that you can reproduce it :-). Yes, it did take some days for it to happen.

I've turned the powa_stat_all_relations off. What part of the stats is it responsible for?

If there's anything I can do to help. please let me know.

Palle

rjuju commented 7 years ago

It's the PgStat_StatTabEntry statistics (https://github.com/postgres/postgres/blob/master/src/include/pgstat.h#L611), the source for pg_stat_all_indexes, pg_stat_all_tables, pg_statio_all_indexes and pg_statio_all_tables. I hope I'll be able to provide a fix for this soon.

rjuju commented 7 years ago

@girgen I pushed some fixes for this issue (https://github.com/dalibo/powa-archivist/commit/a974d943d49391edc98266893e104b1524cfecee but that looks unrelated, and https://github.com/dalibo/powa-archivist/commit/98d24d3624c7c5421f74472658370a4dd0be84b8).

After setting up an environment with the 2nd patch included, I couldn't reproduce this in 6hours of snapshot every 5 seconds, while creating a hundred databases and around 400k tables. I'll continue to run this kind of activity to be sure, but I hope last commit solved this issue. Do you have the possibility to test latest commit on a non-critical environment?

girgen commented 7 years ago

Hi. we only really see problems on the servers with really high load, that is in production. I can test it in a non-critical environment and set the powa.frequency = '5s', but we really only saw problems in production. If you think the sql bit is enough (98d24d3), then I can try that in production on tuesday and monitor the size of powa_all_relations_history_current. The (a974d94) is a bit more quirky to test in prod since it needs a restart.

rjuju commented 7 years ago

Yes, https://github.com/dalibo/powa-archivist/commit/98d24d3624c7c5421f74472658370a4dd0be84b8 should be enough (I could reproduce the issue with only https://github.com/dalibo/powa-archivist/commit/a974d943d49391edc98266893e104b1524cfecee applied). I'll try to reproduce the issue all Monday, and confirm you if I couldn't retrigger the problem. I'll also do some pgbench activity just in case.

rjuju commented 7 years ago

So I did for all day the same tests as yesterday with some additional pgbench activity, and except filling up all my disk space, I didn't trigger any problem.

rjuju commented 6 years ago

hello @girgen, it's been quite some time since working on this issue. Can I close this issue or did you raise this issue since applying the fix?