Two failure cases exist in the garbage collection logic.
In one case, a partition can get into a state Postgres describes as "pending" detachment. The process of detaching a partition is a two transaction process. When the first transaction succeeds, the partition is now "pending". The second transaction needs a more restrictive lock on the partition and can sometimes fail. Users will see errors with a hint to finalize the process, and no subsequent garbage collections will be able to make progress because Postgres only allows a single partition to be pending at a given time.
ERROR: partition "reports_20240915z" already pending detach in partitioned table "public.reports_historical"
HINT: Use ALTER TABLE ... DETACH PARTITION ... FINALIZE to complete the pending detach operation.
In the second case, the partition was fully detached from the parent, but was not successfully dropped from the database. In this case, subsequent garbage collections will continue to work, but PuppetDB will never remove that "stranded" partition.
Steps to Reproduce
I have been unable to reproduce the two bugs using just PuppetDB (due to some extra safeguards there), but manually I can construct operations that reproduce both situations.
The easiest way to reproduce the first case is to create a transaction, issue a select on the partition you will be dropping, but leave the transaction open. Then get another connection, shorten the statement_timeout and issue a detach command. Manually, this looks something like
BEGIN TRANSACTION;
select * from reports_20240914z ;
SET statement_timeout=1000;
ALTER TABLE reports_historical DETACH PARTITION reports_20240914z CONCURRENTLY;
The second case can be reproduced simply by running ALTER TABLE reports_historical DETACH PARTITION reports_20240914z CONCURRENTLY; on a partition manually.
Describe the Bug
Two failure cases exist in the garbage collection logic.
In one case, a partition can get into a state Postgres describes as "pending" detachment. The process of detaching a partition is a two transaction process. When the first transaction succeeds, the partition is now "pending". The second transaction needs a more restrictive lock on the partition and can sometimes fail. Users will see errors with a hint to finalize the process, and no subsequent garbage collections will be able to make progress because Postgres only allows a single partition to be pending at a given time.
In the second case, the partition was fully detached from the parent, but was not successfully dropped from the database. In this case, subsequent garbage collections will continue to work, but PuppetDB will never remove that "stranded" partition.
Steps to Reproduce
I have been unable to reproduce the two bugs using just PuppetDB (due to some extra safeguards there), but manually I can construct operations that reproduce both situations.
The easiest way to reproduce the first case is to create a transaction, issue a select on the partition you will be dropping, but leave the transaction open. Then get another connection, shorten the statement_timeout and issue a detach command. Manually, this looks something like
The second case can be reproduced simply by running
ALTER TABLE reports_historical DETACH PARTITION reports_20240914z CONCURRENTLY;
on a partition manually.