Closed erikamundson closed 6 months ago
cc. @xxhZs PTAL
Hi @xxhZs , I think there might have been a miscommunication here, my apologies if the issue wasn't clear. I was referring to delta lake checkpoints which truncate the json table metadata files into parquet files periodically.
Without these checkpoints calculating the table state can become very slow, especially since Risingwave writes files with such high frequency.
Is your feature request related to a problem? Please describe.
With one write every second, a delta table's log files quickly become cumbersome to read, slowing down spark/trino queries that run on the tables. Periodic checkpointing greatly reduces the number of log files required to read to determine the table state, reducing that overhead on downstream jobs.
Describe the solution you'd like
Recent updates to the delta-rs crate have added automatic checkpointing based on delta table configuration. I think adding these changes into the risingwavelabs/delta-rs fork should meet the requirements.
Describe alternatives you've considered
We could potentially run a separate scheduled job on an interval to checkpoint the tables. I think a periodic optimize would also accomplish the same if we run it from spark.
Additional context
No response