ray-project / deltacat

A portable Pythonic Data Catalog API powered by Ray that brings exabyte-level scalability and fast, ACID-compliant, change-data-capture to your big data workloads.
Apache License 2.0
166 stars 23 forks source link

Upgrade ray and commit rcf before catalog commit #324

Closed raghumdani closed 4 months ago

raghumdani commented 5 months ago

This commit fixes two main issues:

  1. Upgrade Ray version to >=2.20.0 which has a fix for the issue https://github.com/ray-project/ray/issues/45311.
  2. The new Ray version will also allow us to enable structured logging for ray logs.
  3. Removes usage of BlockWritePathProvider as it's no longer supported in ray data as of 2.30.0
  4. Fixes an unique issue when committing RCF (scenario explained below)

Before: RCF read failure scenario

Below are runs in chronological order:

Run 1

  1. Happy case. Nothing goes wrong.

Run 2

  1. Incremental run uses the previous round completion file.
  2. Commits a new partition.
  3. But fails to write RCF to S3 due to some reason (this is more likely when s3 throttles or the cluster is terminated due to some other issue)

Run 3

  1. Tries to read from the stale RCF as we didn't write new RCF in run 2.
  2. Fails to read because the catalog has already deleted the files.

All subsequent runs fail.

After: RCF read failure scenario

Assume Run 1 goes through.

Run 2

  1. Incremental run uses the previous round completion file.
  2. If RCF could not written, partition is never commit so RCF and partition are consistent. Assume that RCF was written successfully.
  3. Commit to catalog fails for some reason (same as explained above).

Run 3

  1. This PR ensures that previous RCF is written based on the current partition locator in the destination. So, correct RCF is read.
  2. Compaction runs successfully even if we saw some intermittent failure.
Zyiqin-Miranda commented 4 months ago

Thanks for the fix!