pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.23k stars 1.95k forks source link

Native `delta` reader and writer using delta-kernel-rs #17244

Open ion-elgreco opened 4 months ago

ion-elgreco commented 4 months ago

Description

With the release of delta-kernel-rs it has become easier to built a native reader/writer for delta tables. They also target Polars as a user, so that could be beneficial if changes are required: https://github.com/delta-incubator/delta-kernel-rs/issues/48

Kernel is currently limited to reads, but this would already be beneficial so we can drop the dependency on python deltalake and the pyarrow datasets way of reading these tables. For writing it would enable Polars to add streaming sink support for delta tables, since sink_parquet already exists. Native support makes it also a good replacement of Spark + delta setups.

With kernel it's also easier to keep up to date with newer protocol versions and support things such as column mapping which is essentially when columns got functionally renamed and support for deletion vectors.

DuckDB has already built an extension using kernel.

dylan-lee94 commented 2 months ago

Is this a feature that is still on the roadmap? The latest databricks runtime have deletion vectors enabled by default and our admin won't turn it off. Reading these tables via polars is currently not possible.

A temporary workaround that i'm currently implementing is reading delta tables with deletion vectors using the duckdb delta extension based on delta kernel not delta_rs.

It would be great to get this natively in polars.

ion-elgreco commented 2 months ago

@dylan-lee94 I started with this in here: https://github.com/ion-elgreco/polars-deltalake/tree/feat/delta_io_plugin

But I won't be able to work on this anymore