risingwavelabs / risingwave

Best-in-class stream processing, analytics, and management. Perform continuous analytics, or build event-driven applications, real-time ETL pipelines, and feature stores in minutes. Unified streaming and batch. PostgreSQL compatible.
https://go.risingwave.com/slack
Apache License 2.0
7.07k stars 581 forks source link

Delta Lake Sink: Add periodic checkpointing based on table configuration. #16546

Closed erikamundson closed 6 months ago

erikamundson commented 6 months ago

Is your feature request related to a problem? Please describe.

With one write every second, a delta table's log files quickly become cumbersome to read, slowing down spark/trino queries that run on the tables. Periodic checkpointing greatly reduces the number of log files required to read to determine the table state, reducing that overhead on downstream jobs.

Describe the solution you'd like

Recent updates to the delta-rs crate have added automatic checkpointing based on delta table configuration. I think adding these changes into the risingwavelabs/delta-rs fork should meet the requirements.

Describe alternatives you've considered

We could potentially run a separate scheduled job on an interval to checkpoint the tables. I think a periodic optimize would also accomplish the same if we run it from spark.

Additional context

No response

fuyufjh commented 6 months ago

cc. @xxhZs PTAL

erikamundson commented 5 months ago

Hi @xxhZs , I think there might have been a miscommunication here, my apologies if the issue wasn't clear. I was referring to delta lake checkpoints which truncate the json table metadata files into parquet files periodically.

Without these checkpoints calculating the table state can become very slow, especially since Risingwave writes files with such high frequency.