ray-project / deltacat

A portable Pythonic Data Catalog API powered by Ray that brings exabyte-level scalability and fast, ACID-compliant, change-data-capture to your big data workloads.
Apache License 2.0
162 stars 23 forks source link

Clear object store between rounds #367

Closed yankevn closed 2 weeks ago

yankevn commented 3 weeks ago

Summary

These changes clear the object store at the end of each round during multi-round compaction. This replaces the existing behavior, which calls delete_many() on the list of object refs created during that round.

Rationale

In E2E testing, delete_many showed to take far too long when deleting a large number of object refs. This made compaction latency infeasible, and so clear() is being used instead. For this to work, only one partition may be running compaction at a time, otherwise clearing the shared object store will lead to issues.

Changes

Impact

Executing clear() rather than delete_many() should lead to better performance. However, any jobs that run multi-round compaction with multiple partitions compacting in parallel will fail.

Testing

Unit tests were written.

Regression Risk

There is a risk the clear(), like delete_many(), will also take an extremely long time to run. To mitigate this, we will have to perform additional E2E testing on a large table.

The multi-round tests also had to be made more lax, as the FileObjectStore class cannot support clear(). Thus, we cannot check if the files were actually deleted, although we still check if clear() was called.

Checklist