Design: REVERT for fast ADD|DROP PARTITION for range partitioned tables

This issue is to explain how I wish to implement fast revert for ADD|DROP PARTITION operations. Some initial context: https://github.com/vitessio/vitess/pull/10315

The problem

Consider the following range partitioned table:

CREATE TABLE tp (
    id INT NOT NULL,
    ts TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
    primary key (id)
)
PARTITION BY RANGE (id) (
    PARTITION p1 VALUES LESS THAN (10),
    PARTITION p2 VALUES LESS THAN (20),
    PARTITION p3 VALUES LESS THAN (30),
    PARTITION p4 VALUES LESS THAN (40),
    PARTITION p5 VALUES LESS THAN (50),
    PARTITION p6 VALUES LESS THAN (60)
);

Assuming the table is very large, and assuming the user wants to routinely rotate it, we can expect periodic ALTER TABLE ... DROP PARTITION and ALTER TABLE ... ADD PARTITION DDLs, which the user expect to operate very quickly. Moreover, in the spirit of Online DDL, we want to be able to revert such statements.

Nothing is trivial here and the problems begin with almost the simplest query.

We cannot use good-old online schema change to add/drop partitions because it's slow

The canonical way for OSC is to create a new empty shadow table, alter that table, then copy over the data from the original table. If the table is large, then the operation is slow. But range partitioning's greatest incentive is that we can operate on smaller tables (partitions), and where operations are proportional to the size of affected partitions, not of the entire table. In particular, ALTER TABLE ... ADD PARTITION is a near instant operation. But OSC can take hours to complete it.

We cannot use good-old online schema change to drop partitions because it's incorrect

This is true for all OSC tools. Consider that we issue a ALTER TABLE tp DROP PARTITION p1 on our sample table above. Alkso, consider the table is populated, and has an id value of 2, which belongs in p1.

The user assumes that dropping p1 also drops the row where id = 2. But in an OSC we build a new table:

CREATE TABLE tp (
    id INT NOT NULL,
    ts TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
    primary key (id)
)
PARTITION BY RANGE (id) (
    PARTITION p2 VALUES LESS THAN (20),
    PARTITION p3 VALUES LESS THAN (30),
    PARTITION p4 VALUES LESS THAN (40),
    PARTITION p5 VALUES LESS THAN (50),
    PARTITION p6 VALUES LESS THAN (60)
);

(p1 is missing) and then copy data over from the original table. The partition p1 now comfortably hosts our row (id = 2 is LESS THAN (20)). We end up with no rows removed!

We therefore need to have a special support for DROP PARTITION statements

But as we'll soon show, we also need a special support for ADD PARTITION. But, before that:

DROP PARTITION is not fast

In MySQL, DROP PARTITION maps to a DROP TABLE. This operation is proportional to the size of the table. Even with recent MySQL 8.0.23 fix to DROP TABLE performance, there is still the issue of removing the table from the filesyste, In MySQL, a DROP PARTITION locks the entire table for writes for the duration of the operation. This can be seconds or minutes. This is too much. We want Online DDL to be non-blocking.

DROP PARTITION is not revertible

Vitess's REVERT not only brings back the previous schema, but also the data. But a DROP PARTITION loses the data. We want to be able to keep hold of it.

Proposal: fast and safe DROP PARTITION implementation

We turn DROP PARTITION into the following multi-step operation. Illustrated by our sample table and query ALTER TABLE tp DROP PARTITION p1:

CREATE TABLE _shadow_p1 LIKE tp
ALTER TABLE _shadow_p1 REMOVE PARTITIONING
ALTER TABLE tp EXCHANGE PARTITION p1 WITH TABLE _shadow_p1
ALTER TABLE tp DROP PARTITION p1

The above sequence of events is fast and safe. We swap the existing partition with an empty table (fast). We then drop the now-empty partition (fast) and are left with a copy of the data (safe).

_shadow_p1 is a pseudocode name for the table. In reality, this will be a _vt_HOLD_* table and will be garbage collected at a later stage.

DROP PARTITION has no counter-query

We'd like to think that the counter-query to a ALTER TABLE ... DROP PARTITION is a ALTER TABLE ... ADD PARTITION, but that is incorrect in range partitioned tables. You may only ADD PARTITION to the end of the partition range, and so this can only (kind of) work if you dropped the last partition_.

Proposal: fast and safe REVERT for DROP PARTITION (last partition) implementation

Assuming we just dropped the last partition, the REVERT is not just about creating a new partition in its place, but also about populating it. Suggested flow for reverting ALTER TABLE tp DROP PARTITION p6 (last partition inour table):

ALTER TABLE tp ADD PARTITION p6 (PARTITION p6 VALUES LESS THAN (60))
ALTER TABLE tp EXCHANGE PARTITION p6 WITH TABLE _shadow_p6
DROP TABLE _shadow_p6

(this assumes _shadow_p6 was created by a previous DROP PARTITION)

This is a fast and safe operation. It is not atomic: there can be a moment where the user reads empty data from p6 just as it is created and before swapped with _shadow_p6. But this behavior is consistent with not having a partition in the first place.

Important notes:

The table has no memory of p6 and has lost its definition. We must be able to retain the definition PARTITION p6 VALUES LESS THAN (60) somewhere.
Because the operation is not atomic, we need to audit our steps so that if the tablet crashes between ADD PARTITION and EXCHANGE PARTITION, the successor tablet can still complete the opration from that point.

Proposal: fast and safe REVERT for DROP PARTITION (non-last partition) implementation

If we drop p1, we can't ADD it back. Same for p2 though p5. To recreate the partitions we'd have to use a REORGANIZE PARTITION statement. But that is proportional to the partition size and can take a long time.

We now illustrate the steps to REVERT a ALTER TABLE tp DROP PARTITION p1 migration:

CREATE TABLE _shadow_p1 LIKE tp
ALTER TABLE _shadow_p1 REMOVE PARTITIONING
CREATE TABLE _shadow_p2 LIKE tp
ALTER TABLE _shadow_p2 REMOVE PARTITIONING
ALTER TABLE tp EXCHANGE PARTITION p2 WITH TABLE _shadow_p2
ALTER TABLE tp REORGANIZE PARTITION p2 INTO (PARTITION p1 VALUES LESS THAN (10), PARTITION p2 VALUES LESS THAN (20))
ALTER TABLE tp EXCHANGE PARTITION p2 WITH TABLE _shadow_p2
ALTER TABLE tp EXCHANGE PARTITION p1 WITH TABLE _shadow_p1

The above is fast, but not atomic/safe, and we will iterate on it. But first, some analysis:

Because REORGANIZE PARTITION is proportional to table size, we don't want to run it on a populated partition.
This is why we exchange p2 with an empty table. REORGANIZE PARTITION is fast on the now empty partition.
We then repopulate p2, and then repopulate p1.

Important notes:

We need to somehow aquire the information that p2 was the next partition after p1. If we want to drop p4 for example, then we need to know that p5 was its subsequent partition.
We need to be able to restore the p1 definition, which was lost from the table.
There's a point in time where p2 is suddenly empty. This is an unexpected and undesired behavior.
The operation is non-atomic. If the tablet fails after EXCHANGE PARTITION p2 WITH TABLE _shadow_p2 then its successor needs to resume operation or revert it, to gain back the data lost from p2.

Making the operation atomic to the user

To prevent the user from suddenly reading empty data from p2, or inserting/updating data in that partition, we have to prevent reads and writes to the table, momentarily. This wil be achieved by using query buffering similarly to how we cut-over a VReplication migration.

Reverting a ADD PARTITION

While a ADD PARTITION migration creates an empty new partition, the user may populate the partition at any time. The REVERT of a ADD PARTITION is not to DROP it. Rather, we want to invoke our Online DDL equivalent of DROP PARTITION, as illustrated above.

Summary

There is a lot of state management to handle: a dropped partition's definition, where the data is stashed, journaling of non-atomic steps, state of query buffering. The solution is complex but achievable.

vitessio / vitess