Closed girgen closed 6 years ago
Ouch, 70 GB!
For your first question, issue #4 has been fixed in 3.1.0, so you can safely use pg_cancel_backend() or pg_terminate_backend(). If you want to disable powa, you can proceed like this:
I just tried with 3.1.0 and simulating a long snapshot (using pg_sleep), and it works as intended:
2016-10-07 12:07:07 CEST 10830 [2-1] ERROR: canceling statement due to user request
2016-10-07 12:07:07 CEST 10830 [3-1] CONTEXT: SQL statement "SELECT pg_sleep(123)"
PL/pgSQL function powa_take_snapshot() line 23 at PERFORM
SQL statement "SELECT powa_take_snapshot()"
2016-10-07 12:07:07 CEST 10256 [40-1] LOG: worker process: powa (PID 10830) exited with exit code 1
2016-10-07 12:07:17 CEST 10838 [1-1] LOG: POWA connected to database powa
2016-10-07 12:07:33 CEST 10256 [41-1] LOG: received SIGHUP, reloading configuration files
2016-10-07 12:07:33 CEST 10256 [42-1] LOG: parameter "powa.frequency" changed to "-1"
2016-10-07 12:07:45 CEST 10838 [2-1] ERROR: canceling statement due to user request
2016-10-07 12:07:45 CEST 10838 [3-1] CONTEXT: SQL statement "SELECT pg_sleep(123)"
PL/pgSQL function powa_take_snapshot() line 23 at PERFORM
SQL statement "SELECT powa_take_snapshot()"
2016-10-07 12:07:45 CEST 10256 [43-1] LOG: worker process: powa (PID 10838) exited with exit code 1
2016-10-07 12:07:55 CEST 10957 [1-1] LOG: POWA is deactivated (powa.frequency = -1), exiting
2016-10-07 12:07:55 CEST 10256 [44-1] LOG: worker process: powa (PID 10957) exited with exit code 1
For the huge size issue, did you dump / restore data to switch to 9.6 or did you use pg_upgrade?
There was a bloat issue with previous version of powa which has been fixed in 3.1.0. If you used pg_upgrade you could try to vacuum full this table.
Also, if you have a database containing a huge number of objects, you can disable gathering statistics on objects with this query, on powa database:
UPDATE powa_functions SET enabled = false WHERE module = 'powa_stat_all_relations'
and then truncate the powa_all_relations_*
tables. This is likely to solve your issue (I already encounter this huge memory usage problem while testing powa on a database containing something like 300 000 objects, and I just disabled the relation statistics, as it's unlikely that any interesting analysis can be done on such a number of objects).
We used pg_upgrade to migrate to 9.6, from 9.4.
By objects, do you mean tables? There are roughly 1000+ tables in the database. The powa database was dropped before pg_upgrade, and recreated afterwards.
We used pg_upgrade to migrate to 9.6, from 9.4. ok
By objects, do you mean tables? There are roughly 1000+ tables in the database. I mean anything ending up with object statistics (tables, indexes, materialized views...). You can check how many object records will be stored for each database every snapshot with this query:
SELECT datname, count(*) from (select datname, powa_stat_all_rel(oid) from pg_database) s GROUP by datname;
The powa database was dropped before pg_upgrade, and recreated afterwards.
Ok, so previous bloat is not the issue. Can you share results of previous query (you can remove the database names if needed), so I'll try to reproduce this problem with a similar number of objects.
Hi,
Sorry for the delay in responding. We had to turn it off powa to prevent the servers from crashing, basically, so the urgent issue was resolved. It seems, the more load, the likelier it would crash, which seems natural. As you wrote above, pg_terminate_backend did work nicely, and I've set powa.frequency = -1 so it is not running now. It seems, in the heat of the action, I dropped the powa database from the server with the biggest problems.
This is from another server, also showing these problems. It was stopped using pg_terminate_backend and powa.frequency = -1 before it did any real harm, although it was already spinning down in an infinite loop when I stopped it.
datname | count
---------------+-------
squirrelmail | 36
postgres | 40
akerlyckan | 41
team1 | 46
devserv | 49
va | 56
helptexts | 57
scrumbanboard | 73
graphite2 | 74
mango | 75
grafana | 90
template1 | 113
powa | 135
rt | 178
teamcity | 409
testki | 610
testphs | 643
teststhlmr01 | 679
teststhlm | 701
devsthlm | 781
aprikos | 1206
gurka | 1447
apa | 1747
foo | 3068
bar | 3414
(25 rows)
BTW is the fix in 1.0.2 of pg_qualstats perhaps related to this? @rjuju
Unfortunately, I don't so. The issue fixed in 1.0.2 of pg_qualstats in when you have queries with more WHERE and/or JOIN clauses than pg_qualstats.max. Do you have queries with ~1k WHERE clauses (or IN with that much values for instance)? Also, if a query is stuck in such a case, it shouldn't be interruptible with a pg_terminate_backend.
yes we do have some quite substantial amount of values in IN clauses. WHERE foobar IN ('a','b',...)
with sometimes a thousand values...
and the query is interruptable this time. (the one six months ago in #4 was not interruptable, but it also didn't claim memory, just one single CPU at 100%, so we could live with that...) :)
Hi @rjuju ! We naïvely hoped the fix for large number of IN parameters in pg_qualstats would have helped, so after upgrading qualstats we restarted powa on one databasecluster. After a few days the powa_all_relations_history_current
suddenly started growing quickly and after a few hours it was 100 GB. This happened during the night and the server crashed and rebooted before we reacted. Same problem as above it seems. I guess it is not even related to pg_qualstats, is it?
I have continuous snapshot backups on a replicated slave, would it help to investigate that table or others as they grew? We could probably play it through using PITR. What should I look for in that case?
Hello @girgen
Yes, there's apparently an issue with the pgstats retrieval, and it's not related to pg_qualstats. I could reproduce it with following conditions:
but I still have to wait quite a long time before hitting the bug (while watching my memory consumption). I couldn't yet understand why this is happening.
Disabling the pgstats retrieval should avoid triggering this bug (and given the number of object you have, disabling it is probably a good idea to avoid too much overhead):
UPDATE powa_functions SET enabled = false WHERE module = 'powa_stat_all_relations'
Hi @rjuju,
It is comforting that you can reproduce it :-). Yes, it did take some days for it to happen.
I've turned the powa_stat_all_relations off. What part of the stats is it responsible for?
If there's anything I can do to help. please let me know.
Palle
It's the PgStat_StatTabEntry statistics (https://github.com/postgres/postgres/blob/master/src/include/pgstat.h#L611), the source for pg_stat_all_indexes, pg_stat_all_tables, pg_statio_all_indexes and pg_statio_all_tables. I hope I'll be able to provide a fix for this soon.
@girgen I pushed some fixes for this issue (https://github.com/dalibo/powa-archivist/commit/a974d943d49391edc98266893e104b1524cfecee but that looks unrelated, and https://github.com/dalibo/powa-archivist/commit/98d24d3624c7c5421f74472658370a4dd0be84b8).
After setting up an environment with the 2nd patch included, I couldn't reproduce this in 6hours of snapshot every 5 seconds, while creating a hundred databases and around 400k tables. I'll continue to run this kind of activity to be sure, but I hope last commit solved this issue. Do you have the possibility to test latest commit on a non-critical environment?
Hi. we only really see problems on the servers with really high load, that is in production. I can test it in a non-critical environment and set the powa.frequency = '5s'
, but we really only saw problems in production. If you think the sql bit is enough (98d24d3), then I can try that in production on tuesday and monitor the size of powa_all_relations_history_current
. The (a974d94) is a bit more quirky to test in prod since it needs a restart.
Yes, https://github.com/dalibo/powa-archivist/commit/98d24d3624c7c5421f74472658370a4dd0be84b8 should be enough (I could reproduce the issue with only https://github.com/dalibo/powa-archivist/commit/a974d943d49391edc98266893e104b1524cfecee applied). I'll try to reproduce the issue all Monday, and confirm you if I couldn't retrigger the problem. I'll also do some pgbench activity just in case.
So I did for all day the same tests as yesterday with some additional pgbench activity, and except filling up all my disk space, I didn't trigger any problem.
hello @girgen, it's been quite some time since working on this issue. Can I close this issue or did you raise this issue since applying the fix?
Hi,
This is similar to #4 actually, very similar, but it's been there are new versions of everything so I'm opening a new issue.
postgresql-9.6.0 powa-archivist-3.1.0 pg_qualstats-1.0.1 pg_stat_kcache-2.0.3
the query
SELECT powa_take_snapshot()
runs for hours on end, eating all the memory. Can I stop it with apg_cancel_backend(pid)
?Seems to have happened under high load.
As you can see, the amount of data in
powa_all_relations_history_current
is quite substantial, about 20 % of database disk footprint in this case.First question is, can I cancel it without restarting the database? Will
powa.frequency
then take effect immediately, so I can stop powa from running in production?We have had powa running fine for six months after turning off pg_qualstats (not loading it at all). Now after upgrading to postgres-9.6, I wanted to reactive it and expected pg_qualstats to work, as it did in our test environments. Apparently, there is still a problem somewhere with high loads?
Suggestions?
Palle