ray-project / deltacat

A portable Pythonic Data Catalog API powered by Ray that brings exabyte-level scalability and fast, ACID-compliant, change-data-capture to your big data workloads.
Apache License 2.0
147 stars 22 forks source link

Support limiting deltas entries in a compaction round #69

Open raghumdani opened 1 year ago

raghumdani commented 1 year ago

Currently, we only limit deltas in a compaction round based on total object store memory available in a cluster. When there is a very large delta that contains many manifest files, we still have to limit them and perform re-batching.

pdames commented 1 year ago

From https://github.com/ray-project/deltacat/pull/70:

... the current contract of compaction assumes that each round must be able to compact at least one delta. To work with extremely large deltas we'll need to drive that down to at least file-level granularity (which will drive subsequent changes into the Round Completion File and each round that reads it to determine a starting point). Future improvements would then include driving each round down to record-level granularity to work with files that are too large to complete in a single round.

raghumdani commented 1 year ago

Primary key index building is a pre-requisite to running multiple rounds: #63