ray-project / deltacat

A portable Pythonic Data Catalog API powered by Ray that brings exabyte-level scalability and fast, ACID-compliant, change-data-capture to your big data workloads.
Apache License 2.0
166 stars 23 forks source link

refactored compaction_session.py #333

Closed akindu-amazon closed 3 months ago

akindu-amazon commented 4 months ago

Refactored compaction_session.py with more modular functions that are called within _execute_compaction. These functions are:

_process_merge_results: processes the results of merge and returns merged delta _merge: produce merge results _run_local_merge: gets called if hash_bucket_count == 1 _discover_deltas: returns uniform deltas to compact _hash_bucket: hashes passed in uniform deltas These functions will allow for easier support for multiple rounds for large tables, while previously compactable tables are compacted the same (all deltacat pytest tests pass).