trinodb / trino

Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
https://trino.io
Apache License 2.0
10.33k stars 2.97k forks source link

Cleanup delta log after creating the checkpoint #16207

Open findinpath opened 1 year ago

findinpath commented 1 year ago

Introduction

https://docs.delta.io/2.2.0/delta-batch.html#data-retention

delta.logRetentionDuration = "interval ": controls how long the history for a table is kept. Each time a a checkpoint is written, automatically cleans up log entries older than the retention interval. If you set this config to a large enough value, many log entries are retained. This should not impact performance as operations against the log are constant time. Operations on history are parallel (but will become more expensive as the log size increases). The default is interval 30 days.

The main benefit of having this functionality is that the _delta_log directory of the Delta Lake tables will get rid of dangling transaction log files.

Implementation details

Trino Delta Lake connector should handle similarly as Delta Lake OSS does by cleaning up the dangling transaction log files after creating a checkpoint file

https://github.com/delta-io/delta/blob/e0e9b0095dcc5b1c4372474a54c87428340fe899/core/src/main/scala/org/apache/spark/sql/delta/Checkpoints.scala#L359-L365

The actual cleaning up happens in Delta Lake OSS by truncating the cutoff date. This means that even if the log retention is set to an interval of 0 hours, there will be removed only transaction log files older than the current date.

https://github.com/delta-io/delta/blob/e0e9b0095dcc5b1c4372474a54c87428340fe899/core/src/main/scala/org/apache/spark/sql/delta/MetadataCleanup.scala#L51-L66

Test hints

If Trino decides to implement the functionality corresponding to delta.logRetentionDuration in the same fashion as Delta Lake OSS , we have the constraint that the transaction log files have to be at least one day old in testing. This may be solved by using register_table system call applied to delta lake tables stored in the testing resources directory.

alexjo2144 commented 1 year ago

The actual cleaning up happens in Delta Lake OSS by truncating the cutoff date. This means that even if the log retention is set to an interval of 0 hours, there will be removed only transaction log files older than the current date.

I don't think we really need to mimic this behavior unless there's a good reason to that I'm not seeing. I'd expect if you set it to zero you lose all your history.

One thing we should decide on is if we want this to be a part of the checkpoint writing process, or if it should be done as a part of the vacuum procedure.

homar commented 1 year ago

I'd go with adding this to vacuum as this seems more explicit and doesn't perform any unexpected 'cleaning'. This would be also very helpful for CDF, currently even after vacuum is executed log files exist and table_changes function fails with missing dataFile instead of missing delta log for specific version which is not good.