pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.57k stars 1.88k forks source link

Add `read_delta_changefeed` (change data feed) #16338

Open ion-elgreco opened 4 months ago

ion-elgreco commented 4 months ago

Description

@stinodego with python v0.17.3, a change data feed reader got added for deltalake, are you ok with me adding a new method:

pl.read_delta_cdf()

Essentially it would just be a shortcut for pl.DataFrame(dt.load_cdf(starting_version=<>, ending_version=<>) https://github.com/delta-io/delta-rs/blob/11ab3f68493d32c620f76c8e33671e626d8f0dde/python/deltalake/table.py#L687-L688

alexander-beedie commented 4 months ago

Without taking a position on whether it should be there or not (though it doesn't seem unreasonable), how about read_delta_changes (or something similar) as the name? "cdf" doesn't seem very descriptive 🤔

ion-elgreco commented 4 months ago

Without taking a position on whether it should be there or not (though it doesn't seem unreasonable), how about read_delta_changes (or something similar) as the name? "cdf" doesn't seem very descriptive 🤔

True, you would need to know delta to know it. How about read_delta_changefeed?

alexander-beedie commented 4 months ago

True, you would need to know delta to know it. How about read_delta_changefeed?

Yup, that would work for me; even clearer ;)

stinodego commented 4 months ago

I'm not a big fan of expanding our API surface for third-party integrations like this. If Delta adds 10 more features, do we add 10 more methods?

Perhaps it's sufficient to add an example to the user guide that shows how to read a changelog into Polars using the deltalake package directly?

ion-elgreco commented 4 months ago

@stinodego I get your point, but this is probably the last thing. I was thinking of adding it into read/scan_delta but it probably won't work nicely into one api

An example could work yeah!