trinodb / trino

Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
https://trino.io
Apache License 2.0
10.36k stars 2.98k forks source link

How should truncate and drop partition be implemented for Hive ACID tables? #5035

Open djsstarburst opened 4 years ago

djsstarburst commented 4 years ago

PR #5026 adds support for row-by-row delete for Hive ACID tables. How should table truncate and drop partition be implemented for ACID tables? Here are options:

The argument for the first behavior is that it is familiar and fast. The argument for the second is that it preserves all the history. One possible outcome is that different customers prefer different behaviors, and we decide to "make it a mode" via a session property.

findepi commented 4 years ago

What does Hive 3 do?

djsstarburst commented 4 years ago

What does Hive 3 do?

5049 documents what Hive ACID does. For all DELETE FROM table WHERE ... requests, Hive ACID does row-by-row delete. For ALTER table DROP PARTITION or TRUNCATE table requests, Hive ACID deletes all the files in a non-transactional way.

It's a bit different for Presto (unless we "make it a mode" via a session property) because "metadata delete" causes partitions to be dropped, even though the DELETE request looks superficially like a row-by-row DELETE request.

findepi commented 4 years ago

non-transactional tables

Current, "metadata delete" is appropriate for non-transactional tables.

ORC ACID (full) transactional tables

does Hive's ALTER TABLE .. COMPACT 'MAJOR' unregister partitions when no rows are left there?

insert-only transactional tables

How does Hive do DELETE? Is it allowed in Hive?

djsstarburst commented 4 years ago

we should do row-level delete:

Yes, I agree: for Hive ACID, it seems to me that row-level delete is enough. One thing that convinces me we should not create a special case for "metadata delete" in Hive ACID is that the delete deltas will be tiny: 4 of 5 of the ACID columns will usually run-length-encode to a single value for each chunk deleted, and the 5th - - the rowId column - - should compress very well.

@electrum wonders if some customers will still need metadata delete for Hive ACID tables, and whether we should "make it a mode".

we could still do "metadata delete" when WHERE condition matches whole partitions (is expressed on part keys only). For this, we would still need to do proper locking, so that the difference is not end-user visible.

We could (and the PR now does, mistakenly in my opinion) translate Hive ACID whole partition deletes into metadata deletes rather than row-by-row deletes. However, the Hive ACID metastore treats partition dropping as a "non-transactional" operation. The lock you acquire is of type NO_TXN.

Since the only form of deletion supported by non-ACID Hive is partition dropping, it seems clear we must continue to support "metadata delete" for non-ACID Hive tables.

electrum commented 4 years ago

We discussed this further and it sounds like always doing normal ACID delete for transactional tables is the right behavior. We can add a drop_partition procedure later if needed.