Open findinpath opened 1 year ago
The actual cleaning up happens in Delta Lake OSS by truncating the cutoff date. This means that even if the log retention is set to an interval of 0 hours, there will be removed only transaction log files older than the current date.
I don't think we really need to mimic this behavior unless there's a good reason to that I'm not seeing. I'd expect if you set it to zero you lose all your history.
One thing we should decide on is if we want this to be a part of the checkpoint writing process, or if it should be done as a part of the vacuum
procedure.
I'd go with adding this to vacuum as this seems more explicit and doesn't perform any unexpected 'cleaning'.
This would be also very helpful for CDF, currently even after vacuum is executed log files exist and table_changes
function fails with missing dataFile
instead of missing delta log for specific version which is not good.
Introduction
https://docs.delta.io/2.2.0/delta-batch.html#data-retention
The main benefit of having this functionality is that the
_delta_log
directory of the Delta Lake tables will get rid of dangling transaction log files.Implementation details
Trino Delta Lake connector should handle similarly as Delta Lake OSS does by cleaning up the dangling transaction log files after creating a checkpoint file
https://github.com/delta-io/delta/blob/e0e9b0095dcc5b1c4372474a54c87428340fe899/core/src/main/scala/org/apache/spark/sql/delta/Checkpoints.scala#L359-L365
The actual cleaning up happens in Delta Lake OSS by truncating the cutoff date. This means that even if the log retention is set to an interval of 0 hours, there will be removed only transaction log files older than the current date.
https://github.com/delta-io/delta/blob/e0e9b0095dcc5b1c4372474a54c87428340fe899/core/src/main/scala/org/apache/spark/sql/delta/MetadataCleanup.scala#L51-L66
Test hints
If Trino decides to implement the functionality corresponding to
delta.logRetentionDuration
in the same fashion as Delta Lake OSS , we have the constraint that the transaction log files have to be at least one day old in testing. This may be solved by usingregister_table
system call applied to delta lake tables stored in the testingresources
directory.