mrpowers-io / levi

Delta Lake helper methods. No Spark dependency.
MIT License
21 stars 8 forks source link

Helper function to get recently updated partitions #23

Open huynguyent opened 7 months ago

huynguyent commented 7 months ago

Is your feature request related to a problem? Please describe.

Similar to https://github.com/MrPowers/mack/issues/130 , but for non-Spark projects

For streaming systems (or batch systems that run in high frequency) that write data into delta tables, it's a common problem to have lots of small files. In many cases, it's not practical to auto compact because of various reasons, for example

One way to solve this is to have a separate process that perform optimization regularly on these delta tables. However it's not a good idea to optimize the entire table whenever without any constraint. A few example reasons:

Describe the solution you'd like A helper function to find out which partitions have been updated between some time period, for example

def get_updated_partitions(delta_table: DeltaTable, start_time: datetime.datetime, end_time: datetime.datetime, exclude_optimize_operations: bool) -> list[dict[str, str]]

The exclude_optimize_operations flag is necessary because optimization operations themselves are also update operations. If the operations are not excluded, they might cause a feedback loop since they will keep showing up in the output.

All the information needed for this features should be available in the transaction log.

Describe alternatives you've considered Optimizing the entire table and accept the overhead

Not sure what's a good alternative once z-order is used however

Additional context

N/A

Willingness to contribute

Would you be willing to contribute an implementation of this feature?

MrPowers commented 7 months ago

Thanks @huynguyent. Assigned this one to you! Looking forward to seeing what you can build!