Open shlomi-noach opened 2 years ago
Status: we have implemented fast DROP PARTITION
and ADD PARTITION
, supported by schemadiff
and used by Vitess with a special, yet-undocumented ddl_strategy
flag. We will continue to experiment with that until we either document the flag or just activate the new behavior by default without flags.
REVERT
for those operations is ON HOLD because of the complexity it takes; it works, but auditing and recovering from a failed/interrupted REVERT
introduces some difficult scenarios.
This issue is to explain how I wish to implement fast revert for
ADD|DROP PARTITION
operations. Some initial context: https://github.com/vitessio/vitess/pull/10315The problem
Consider the following range partitioned table:
Assuming the table is very large, and assuming the user wants to routinely rotate it, we can expect periodic
ALTER TABLE ... DROP PARTITION
andALTER TABLE ... ADD PARTITION
DDLs, which the user expect to operate very quickly. Moreover, in the spirit of Online DDL, we want to be able to revert such statements.Nothing is trivial here and the problems begin with almost the simplest query.
We cannot use good-old online schema change to add/drop partitions because it's slow
The canonical way for OSC is to create a new empty shadow table, alter that table, then copy over the data from the original table. If the table is large, then the operation is slow. But range partitioning's greatest incentive is that we can operate on smaller tables (partitions), and where operations are proportional to the size of affected partitions, not of the entire table. In particular,
ALTER TABLE ... ADD PARTITION
is a near instant operation. But OSC can take hours to complete it.We cannot use good-old online schema change to drop partitions because it's incorrect
This is true for all OSC tools. Consider that we issue a
ALTER TABLE tp DROP PARTITION p1
on our sample table above. Alkso, consider the table is populated, and has an id value of2
, which belongs inp1
.The user assumes that dropping
p1
also drops the row whereid = 2
. But in an OSC we build a new table:(
p1
is missing) and then copy data over from the original table. The partitionp1
now comfortably hosts our row (id = 2
isLESS THAN (20)
). We end up with no rows removed!We therefore need to have a special support for DROP PARTITION statements
But as we'll soon show, we also need a special support for
ADD PARTITION
. But, before that:DROP PARTITION is not fast
In MySQL,
DROP PARTITION
maps to aDROP TABLE
. This operation is proportional to the size of the table. Even with recent MySQL8.0.23
fix toDROP TABLE
performance, there is still the issue of removing the table from the filesyste, In MySQL, aDROP PARTITION
locks the entire table for writes for the duration of the operation. This can be seconds or minutes. This is too much. We want Online DDL to be non-blocking.DROP PARTITION is not revertible
Vitess's
REVERT
not only brings back the previous schema, but also the data. But aDROP PARTITION
loses the data. We want to be able to keep hold of it.Proposal: fast and safe DROP PARTITION implementation
We turn
DROP PARTITION
into the following multi-step operation. Illustrated by our sample table and queryALTER TABLE tp DROP PARTITION p1
:CREATE TABLE _shadow_p1 LIKE tp
ALTER TABLE _shadow_p1 REMOVE PARTITIONING
ALTER TABLE tp EXCHANGE PARTITION p1 WITH TABLE _shadow_p1
ALTER TABLE tp DROP PARTITION p1
The above sequence of events is fast and safe. We swap the existing partition with an empty table (fast). We then drop the now-empty partition (fast) and are left with a copy of the data (safe).
_shadow_p1
is a pseudocode name for the table. In reality, this will be a_vt_HOLD_*
table and will be garbage collected at a later stage.DROP PARTITION has no counter-query
We'd like to think that the counter-query to a
ALTER TABLE ... DROP PARTITION
is aALTER TABLE ... ADD PARTITION
, but that is incorrect in range partitioned tables. You may onlyADD PARTITION
to the end of the partition range, and so this can only (kind of) work if you dropped the last partition_.Proposal: fast and safe REVERT for DROP PARTITION (last partition) implementation
Assuming we just dropped the last partition, the
REVERT
is not just about creating a new partition in its place, but also about populating it. Suggested flow for revertingALTER TABLE tp DROP PARTITION p6
(last partition inour table):ALTER TABLE tp ADD PARTITION p6 (PARTITION p6 VALUES LESS THAN (60))
ALTER TABLE tp EXCHANGE PARTITION p6 WITH TABLE _shadow_p6
DROP TABLE _shadow_p6
(this assumes
_shadow_p6
was created by a previousDROP PARTITION
)This is a fast and safe operation. It is not atomic: there can be a moment where the user reads empty data from
p6
just as it is created and before swapped with_shadow_p6
. But this behavior is consistent with not having a partition in the first place.Important notes:
p6
and has lost its definition. We must be able to retain the definitionPARTITION p6 VALUES LESS THAN (60)
somewhere.ADD PARTITION
andEXCHANGE PARTITION
, the successor tablet can still complete the opration from that point.Proposal: fast and safe REVERT for DROP PARTITION (non-last partition) implementation
If we drop
p1
, we can'tADD
it back. Same forp2
thoughp5
. To recreate the partitions we'd have to use aREORGANIZE PARTITION
statement. But that is proportional to the partition size and can take a long time.We now illustrate the steps to
REVERT
aALTER TABLE tp DROP PARTITION p1
migration:CREATE TABLE _shadow_p1 LIKE tp
ALTER TABLE _shadow_p1 REMOVE PARTITIONING
CREATE TABLE _shadow_p2 LIKE tp
ALTER TABLE _shadow_p2 REMOVE PARTITIONING
ALTER TABLE tp EXCHANGE PARTITION p2 WITH TABLE _shadow_p2
ALTER TABLE tp REORGANIZE PARTITION p2 INTO (PARTITION p1 VALUES LESS THAN (10), PARTITION p2 VALUES LESS THAN (20))
ALTER TABLE tp EXCHANGE PARTITION p2 WITH TABLE _shadow_p2
ALTER TABLE tp EXCHANGE PARTITION p1 WITH TABLE _shadow_p1
The above is fast, but not atomic/safe, and we will iterate on it. But first, some analysis:
REORGANIZE PARTITION
is proportional to table size, we don't want to run it on a populated partition.p2
with an empty table.REORGANIZE PARTITION
is fast on the now empty partition.p2
, and then repopulatep1
.Important notes:
p2
was the next partition afterp1
. If we want to dropp4
for example, then we need to know thatp5
was its subsequent partition.p1
definition, which was lost from the table.p2
is suddenly empty. This is an unexpected and undesired behavior.EXCHANGE PARTITION p2 WITH TABLE _shadow_p2
then its successor needs to resume operation or revert it, to gain back the data lost fromp2
.Making the operation atomic to the user
To prevent the user from suddenly reading empty data from
p2
, or inserting/updating data in that partition, we have to prevent reads and writes to the table, momentarily. This wil be achieved by using query buffering similarly to how we cut-over aVReplication
migration.Reverting a ADD PARTITION
While a
ADD PARTITION
migration creates an empty new partition, the user may populate the partition at any time. TheREVERT
of aADD PARTITION
is not toDROP
it. Rather, we want to invoke our Online DDL equivalent ofDROP PARTITION
, as illustrated above.Summary
There is a lot of state management to handle: a dropped partition's definition, where the data is stashed, journaling of non-atomic steps, state of query buffering. The solution is complex but achievable.