grayhemp / pgtoolkit

Tools for PostgreSQL maintenance
Other
184 stars 16 forks source link

pgcompact locks database while pg_dump is running #8

Closed kustodian closed 9 years ago

kustodian commented 9 years ago

A few days ago I had an issue where pgcompact was running a little longer than usual so it continued to run when pg_dump started to dump the database. Because of this my application locked for more than 10 minutes. I'm not sure if pgcompact should be safe against pg_dump, but it would be great if it is. Here is part of the postgres log which caused the lock:

2015-03-09 01:32:44 CET [6226]: [1037-1] ERROR:  canceling statement due to statement timeout
2015-03-09 01:32:44 CET [6226]: [1038-1] STATEMENT:  ALTER TABLE public.stats DROP CONSTRAINT stats_pkey;
2015-03-09 01:32:45 CET [6226]: [1039-1] ERROR:  canceling statement due to statement timeout
2015-03-09 01:32:45 CET [6226]: [1040-1] STATEMENT:  ALTER TABLE public.stats DROP CONSTRAINT stats_pkey;
2015-03-09 01:32:46 CET [6226]: [1041-1] ERROR:  canceling statement due to statement timeout
2015-03-09 01:32:46 CET [6226]: [1042-1] STATEMENT:  ALTER TABLE public.stats DROP CONSTRAINT stats_pkey;
2015-03-09 01:32:47 CET [6226]: [1043-1] ERROR:  canceling statement due to statement timeout
2015-03-09 01:32:47 CET [6226]: [1044-1] STATEMENT:  ALTER TABLE public.stats DROP CONSTRAINT stats_pkey;
2015-03-09 01:32:48 CET [6226]: [1045-1] LOG:  process 6226 still waiting for AccessExclusiveLock on relation 350629 of database 349582 after 1000.808 ms
2015-03-09 01:32:48 CET [6226]: [1046-1] STATEMENT:  DROP INDEX public.pgcompact_index_6192;
2015-03-09 01:46:45 CET [6226]: [1047-1] LOG:  process 6226 acquired AccessExclusiveLock on relation 350629 of database 349582 after 837082.864 ms
2015-03-09 01:46:45 CET [6226]: [1048-1] STATEMENT:  DROP INDEX public.pgcompact_index_6192;

This is on Postgres 9.1 and I'm sorry for not having more information. Also I haven't checked into detail, but this issue could be related to issue #5.

XooR commented 9 years ago

Just to update this issue with more clarifications (issue is written by my collegue). pg_dump took AccessShareLock on "table_a", and pg_compact later tried to lock it with AccessExclusive lock to remove contstraint on primary key. It got canceled three times because of timeout, and after that it blocked for 10 minutes waiting for pg_dump to release its lock and it got stucked. It didn't retry for 10 times and gave up, it waited for 10 minutes, and all other queries that wanted to update this table got stucked waiting for pgcompact to release request for AccessExclusiveLock.

grayhemp commented 9 years ago

@kustodian @XooR thank you for your feedback. It looks like a serious issue, all 3 of them actually, and I'm definitely going to review them in the nearest future. Unfortunately, I'm pretty restricted in terms of time now so I can not promise results earlier than the next week, but I'm going to do my best here. Thanks again.

grayhemp commented 9 years ago

Thank you for reporting that. It is definitely a bug.

The code has a block https://github.com/grayhemp/pgtoolkit/blob/master/lib/PgToolkit/Compactor/Table.pm#L707-L748 that implements a retry loop with 500ms statement timeout for heavy locking statements to soften the locks impact. And this blocking DROP INDEX should actually be inside of it, but it is not https://github.com/grayhemp/pgtoolkit/blob/master/lib/PgToolkit/Compactor/Table.pm#L775.

So I'm putting this issue into my TODO list for the next release.

grayhemp commented 9 years ago

As I can see there are no straight solution for this problem, but it can be worked around as I made in the ac34520 commit.