A portable Pythonic Data Catalog API powered by Ray that brings exabyte-level scalability and fast, ACID-compliant, change-data-capture to your big data workloads.
These changes clear the object store at the end of each round during multi-round compaction. This replaces the existing behavior, which calls delete_many() on the list of object refs created during that round.
Rationale
In E2E testing, delete_many showed to take far too long when deleting a large number of object refs. This made compaction latency infeasible, and so clear() is being used instead. For this to work, only one partition may be running compaction at a time, otherwise clearing the shared object store will lead to issues.
Changes
Switch delete_many to clear
Move the pull request template
Impact
Executing clear() rather than delete_many() should lead to better performance. However, any jobs that run multi-round compaction with multiple partitions compacting in parallel will fail.
Testing
Unit tests were written.
Regression Risk
There is a risk the clear(), like delete_many(), will also take an extremely long time to run. To mitigate this, we will have to perform additional E2E testing on a large table.
The multi-round tests also had to be made more lax, as the FileObjectStore class cannot support clear(). Thus, we cannot check if the files were actually deleted, although we still check if clear() was called.
Checklist
[x] Unit tests covering the changes have been added
[x] If this is a bugfix, regression tests have been added
Summary
These changes clear the object store at the end of each round during multi-round compaction. This replaces the existing behavior, which calls
delete_many()
on the list of object refs created during that round.Rationale
In E2E testing,
delete_many
showed to take far too long when deleting a large number of object refs. This made compaction latency infeasible, and soclear()
is being used instead. For this to work, only one partition may be running compaction at a time, otherwise clearing the shared object store will lead to issues.Changes
delete_many
toclear
Impact
Executing
clear()
rather thandelete_many()
should lead to better performance. However, any jobs that run multi-round compaction with multiple partitions compacting in parallel will fail.Testing
Unit tests were written.
Regression Risk
There is a risk the
clear()
, likedelete_many()
, will also take an extremely long time to run. To mitigate this, we will have to perform additional E2E testing on a large table.The multi-round tests also had to be made more lax, as the
FileObjectStore
class cannot supportclear()
. Thus, we cannot check if the files were actually deleted, although we still check ifclear()
was called.Checklist
[x] Unit tests covering the changes have been added
[ ] E2E testing has been performed