Question: Partitioning existing table using partition_data_proc is taking a long time, and getting slower - is this expected?

Hi,

We're using pg-partman to partition an existing table in an AWS Aurora PostgreSQL DB (on an unused clone of the database). It's a table with approximately 23 billion rows, and we're trying to partition it into batches of around 180 million rows, using the first column in the composite primary key as the partition key (this is not an autoincrementing id, and is not unique).

We followed the offline partitioning guide, using the following to set up the partitioning and move the data:

SELECT partman.create_parent(
    p_parent_table => 'app.partitioned_table',
    p_control => 'some-non-unqiue-id',
    p_type => 'native',
    p_interval=> '1');

 CALL partman.partition_data_proc(
    p_parent_table => 'app.partitioned_table',
    p_source_table => 'app.original_table',
    p_interval => '1',
    p_batch => '2147483647');

The first batch of 180 million rows took around 2 hours to complete, but since then, it's been getting slower on average. We're now up to batch 20, which took over 8 hours to complete (for a very similar number of rows)!

We don't appear to have any bottlenecks in the DB - CPU, memory, etc all look fine. The DB is a temporary clone of the production DB, so there is nothing else talking to the DB. Looking at the metrics, it seems like it reads in small batches - the read IOPs are high for 10 minutes, before dropping back down for another 10 minutes. None of the other metrics seem to spike while the read IOPs are low.

The table itself is pretty straightforward - there are 9 columns - 3 of them are timestamps, the rest and int4 values.

We expected the batches to speed up as the source table emptied, but the table appears to be the same size (in terms of GB) as when it started (it's too large to do a count in any sane amount of time!).

Is this what we should generally expect to happen when partitioning the data? We have another ~100 batches to get through, which is likely to take weeks at the current rate! Is there anything we should be doing to speed it up (either before we started partitioning, or during)? As this is just a test, we can start the partitioning process again if there is anything we should have done beforehand.

Thanks in advance!

This does almost sound a bit like IO saturation or possibly contention in shared_buffers. I know you stated that it doesn't appear to be the case, but might still be good to investigate a bit more. If you just run one batch at a time, do the next batches also slow down, or are you only seeing it slow down during the batches that are run with a single call? I ask this because one thing you could do to help diagnose this would be to set track_io_timing to on and then try and run the partitioning call with EXPLAIN ANALYZE. You'd only get the plan back after the call finished, but if it's slowing down with subsequent calls, you could then see how the IO timing looks. Or maybe even see if the plan is somehow changing.

All the data being moved around has to pass through shared_buffers, so possibly try increasing that to give it more room to work with? You can use the pg_buffercache extension to try and investigate what the state of shared_buffers looks like and what objects/pages are taking up space there.

Have you also tried reducing the batch sizes? Since you're doing this offline, there's no requirement that the batch size has to match the partitioning interval. Reducing the batch size should reduce the need to pull quite as much data at one time for a single read/write operation.

The source table is not going to shrink in actual disk size during these calls, especially if it's only done with a single call. The rows are only marked as deleted and even if a regular VACUUM runs, those only mark the space as re-usable for future writes. The only time a normal VACUUM returns disk space is if the final page in the entire object is completely empty and that requires a slightly heavier lock that is likely not able to be granted in this situation.

Speaking of VACUUM, you may possibly want to disable it while this is running. If that is kicking in at the same time, it could be causing some contention. However, if vacuum/analyze doesn't run, that means the statistics, visibility map and freespace maps are not being updated. One possible solution to this may be to manually VACUUM/ANALYZE between batches. If you run a VACUUM FREEZE on the entire old table beforehand, that should help normal VACUUMS run a lot faster since that ensures all pages with FROZEN rows are marked in the visibility map and VACUUM can skip over them. Subsequent VACUUMs do not need the FREEZE clause. If you reduce the batch size, you don't have to vacuum after every batch. May take some experimentation to figure out how often works best.

If you actually need the disk space back while this is running, you would need to run a VACUUM FULL on the original table as batches are completed. That requires a complete table rewrite so isn't realistic to do after each batch either. But could possibly make sense to do after a certain number of batches depending on how many rows are done. If there are 100 batches maybe every 20? That would likely show a marked improvement in read time as the partitioning takes place. This would require downtime for the duration VACUUM FULL calls, though.

Another question would then be, are you looking to be able to be doing this with the online partitioning method for production? If so, that will make it a bit more difficult because you cannot do batch sizes smaller than the partition interval at this time. I do plan on adding a feature to do that in the future, but it will basically still be a partially offline method because the data being moved will be unavailable to the users of the table.

Just a few thoughts so far. If I can think of anything else I'll let you know. But if you're able to try any of these recommendations, hopefully they help.

Did you have any further questions on this? Otherwise I will be closing this issue during my next review.

Hi!

Sorry for the delayed reply, we've been knee-deep trying to get our large table to perform better! Thanks for the detailed reply - it was really useful.

We tried several things following your reply to try and speed up the partitioning, including disabling vacuuming, reducing the batch size, and doing the partitioning on a much larger DB Instance (the largest AWS provide). Unfortunately, we couldn't get the partitioning to speed up. We were in touch with AWS support throughout, and they could not see any bottlenecks, or reasons why the partitioning should be slow. We saw the same behaviour as before, where the batches slowed down over time.

In the end, we found that doing a pg_dump from the original table, and a pg_restore into the partitioned table worked much quicker (a couple of days vs potentially several weeks!), as long as the partitions were created up-front. Without creating them up-front, all the rows went into the default partition. We also found that doing an INSERT INTO partitioned_table SELECT * FROM original_table was also quick (I think it was quicker than the dump and restore method).

Thanks again!

Sophie

Hmm that is odd that the INSERT INTO ... SELECT ... didn't see the same slow-down as doing the batch mode. The entire reason for the batch mode was to help speed that sort of operation up. I wouldn't expect it to speed up as you initially thought, since the original table isn't actually shrinking. But why it's getting slower I'm not sure why.

Glad you were able to figure out another solution!

pgpartman / pg_partman

Question: Partitioning existing table using partition_data_proc is taking a long time, and getting slower - is this expected? #476