vitessio / vitess

Vitess is a database clustering system for horizontal scaling of MySQL.
http://vitess.io
Apache License 2.0
18.56k stars 2.09k forks source link

Design: REVERT for fast ADD|DROP PARTITION for range partitioned tables #10317

Open shlomi-noach opened 2 years ago

shlomi-noach commented 2 years ago

This issue is to explain how I wish to implement fast revert for ADD|DROP PARTITION operations. Some initial context: https://github.com/vitessio/vitess/pull/10315

The problem

Consider the following range partitioned table:

CREATE TABLE tp (
    id INT NOT NULL,
    ts TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
    primary key (id)
)
PARTITION BY RANGE (id) (
    PARTITION p1 VALUES LESS THAN (10),
    PARTITION p2 VALUES LESS THAN (20),
    PARTITION p3 VALUES LESS THAN (30),
    PARTITION p4 VALUES LESS THAN (40),
    PARTITION p5 VALUES LESS THAN (50),
    PARTITION p6 VALUES LESS THAN (60)
);

Assuming the table is very large, and assuming the user wants to routinely rotate it, we can expect periodic ALTER TABLE ... DROP PARTITION and ALTER TABLE ... ADD PARTITION DDLs, which the user expect to operate very quickly. Moreover, in the spirit of Online DDL, we want to be able to revert such statements.

Nothing is trivial here and the problems begin with almost the simplest query.

We cannot use good-old online schema change to add/drop partitions because it's slow

The canonical way for OSC is to create a new empty shadow table, alter that table, then copy over the data from the original table. If the table is large, then the operation is slow. But range partitioning's greatest incentive is that we can operate on smaller tables (partitions), and where operations are proportional to the size of affected partitions, not of the entire table. In particular, ALTER TABLE ... ADD PARTITION is a near instant operation. But OSC can take hours to complete it.

We cannot use good-old online schema change to drop partitions because it's incorrect

This is true for all OSC tools. Consider that we issue a ALTER TABLE tp DROP PARTITION p1 on our sample table above. Alkso, consider the table is populated, and has an id value of 2, which belongs in p1.

The user assumes that dropping p1 also drops the row where id = 2. But in an OSC we build a new table:

CREATE TABLE tp (
    id INT NOT NULL,
    ts TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
    primary key (id)
)
PARTITION BY RANGE (id) (
    PARTITION p2 VALUES LESS THAN (20),
    PARTITION p3 VALUES LESS THAN (30),
    PARTITION p4 VALUES LESS THAN (40),
    PARTITION p5 VALUES LESS THAN (50),
    PARTITION p6 VALUES LESS THAN (60)
);

(p1 is missing) and then copy data over from the original table. The partition p1 now comfortably hosts our row (id = 2 is LESS THAN (20)). We end up with no rows removed!

We therefore need to have a special support for DROP PARTITION statements

But as we'll soon show, we also need a special support for ADD PARTITION. But, before that:

DROP PARTITION is not fast

In MySQL, DROP PARTITION maps to a DROP TABLE. This operation is proportional to the size of the table. Even with recent MySQL 8.0.23 fix to DROP TABLE performance, there is still the issue of removing the table from the filesyste, In MySQL, a DROP PARTITION locks the entire table for writes for the duration of the operation. This can be seconds or minutes. This is too much. We want Online DDL to be non-blocking.

DROP PARTITION is not revertible

Vitess's REVERT not only brings back the previous schema, but also the data. But a DROP PARTITION loses the data. We want to be able to keep hold of it.

Proposal: fast and safe DROP PARTITION implementation

We turn DROP PARTITION into the following multi-step operation. Illustrated by our sample table and query ALTER TABLE tp DROP PARTITION p1:

The above sequence of events is fast and safe. We swap the existing partition with an empty table (fast). We then drop the now-empty partition (fast) and are left with a copy of the data (safe).

_shadow_p1 is a pseudocode name for the table. In reality, this will be a _vt_HOLD_* table and will be garbage collected at a later stage.

DROP PARTITION has no counter-query

We'd like to think that the counter-query to a ALTER TABLE ... DROP PARTITION is a ALTER TABLE ... ADD PARTITION, but that is incorrect in range partitioned tables. You may only ADD PARTITION to the end of the partition range, and so this can only (kind of) work if you dropped the last partition_.

Proposal: fast and safe REVERT for DROP PARTITION (last partition) implementation

Assuming we just dropped the last partition, the REVERT is not just about creating a new partition in its place, but also about populating it. Suggested flow for reverting ALTER TABLE tp DROP PARTITION p6 (last partition inour table):

(this assumes _shadow_p6 was created by a previous DROP PARTITION)

This is a fast and safe operation. It is not atomic: there can be a moment where the user reads empty data from p6 just as it is created and before swapped with _shadow_p6. But this behavior is consistent with not having a partition in the first place.

Important notes:

Proposal: fast and safe REVERT for DROP PARTITION (non-last partition) implementation

If we drop p1, we can't ADD it back. Same for p2 though p5. To recreate the partitions we'd have to use a REORGANIZE PARTITION statement. But that is proportional to the partition size and can take a long time.

We now illustrate the steps to REVERT a ALTER TABLE tp DROP PARTITION p1 migration:

The above is fast, but not atomic/safe, and we will iterate on it. But first, some analysis:

Important notes:

Making the operation atomic to the user

To prevent the user from suddenly reading empty data from p2, or inserting/updating data in that partition, we have to prevent reads and writes to the table, momentarily. This wil be achieved by using query buffering similarly to how we cut-over a VReplication migration.

Reverting a ADD PARTITION

While a ADD PARTITION migration creates an empty new partition, the user may populate the partition at any time. The REVERT of a ADD PARTITION is not to DROP it. Rather, we want to invoke our Online DDL equivalent of DROP PARTITION, as illustrated above.

Summary

There is a lot of state management to handle: a dropped partition's definition, where the data is stashed, journaling of non-atomic steps, state of query buffering. The solution is complex but achievable.

shlomi-noach commented 2 years ago

Status: we have implemented fast DROP PARTITION and ADD PARTITION, supported by schemadiff and used by Vitess with a special, yet-undocumented ddl_strategy flag. We will continue to experiment with that until we either document the flag or just activate the new behavior by default without flags.

REVERT for those operations is ON HOLD because of the complexity it takes; it works, but auditing and recovering from a failed/interrupted REVERT introduces some difficult scenarios.