The Open Table Formats occasionally need to delete files as part of routine maintenance. For example, Delta deletes old log files, configured via table property delta.logRetentionDuration.
For the Snowplow Lake Loader, this can mean deleting a very large number of files; bearing in mind this is a streaming loader that commits frequently. And deleting files from cloud storage is relatively slow. I observed that the loader could pause for several minutes on a single commit waiting for files to be deleted.
Here I implement a customized Hadoop FileSystem where delete returns immediately and then operates asynchronously. This means deleting never blocks the loader's main fiber of execution. It is safe to run delete tasks asynchronously because the Open Table Formats do not have a hard requirement that files are deleted immediately.
The Open Table Formats occasionally need to delete files as part of routine maintenance. For example, Delta deletes old log files, configured via table property
delta.logRetentionDuration
.For the Snowplow Lake Loader, this can mean deleting a very large number of files; bearing in mind this is a streaming loader that commits frequently. And deleting files from cloud storage is relatively slow. I observed that the loader could pause for several minutes on a single commit waiting for files to be deleted.
Here I implement a customized Hadoop FileSystem where
delete
returns immediately and then operates asynchronously. This means deleting never blocks the loader's main fiber of execution. It is safe to rundelete
tasks asynchronously because the Open Table Formats do not have a hard requirement that files are deleted immediately.