pola-rs / r-polars

Bring polars to R
https://pola-rs.github.io/r-polars/
Other
420 stars 36 forks source link

support delta lake reader #221

Open dseynaev opened 1 year ago

dseynaev commented 1 year ago

polars seems to support it but it's implemented on the python side: https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.read_delta.html

the underlying delta lake interface lib is written in Rust though: https://docs.rs/deltalake/latest/deltalake/

sorhawell commented 1 year ago

The current py-polars implementation:

deltalake.read_table() ->
deltalake_tbl.to_pyarrow() -> polars.from_arrow() -> polars_table outer py-function to_pyarrow py

delta-rs has first class support from python.

A potential r-polars pathway via rust api could be:

Read with delta-rs (I'm not sure if this could work out of the box with any cloudstorage uri): https://docs.rs/deltalake/0.11.0/deltalake/delta/fn.open_table.html

make a record-batch-reader with delta-rs: https://docs.rs/deltalake/0.11.0/deltalake/table_state/struct.DeltaTableState.html#method.add_actions_table

import from a record batch reader to r-polars via arrow2-rs...

sorhawell commented 1 year ago

delta-rs, issue 908: R bindings for deltalake-rs

... and also issue 537

sorhawell commented 1 year ago

Hi @wjones127 can I ask, do you think it is realistic to make a minimal data-lake reader for r-polars via delta-rs rust-api and arrow2 ? Or is there some filesystem magic from python which is also needed?

wjones127 commented 1 year ago

I don't think filesystems are a blocker there; you can use the object stores that come with delta-rs.

But, especially if you are using arrow2, there's no ready-to-use scan function in delta-rs that you could plug into, so there's quite a bit of code you would have to read. Currently in the python package, delta-rs provides the file list and their statistics, and then the Python package provides the actual file scanners through PyArrow. Eventually, we'll have the scanner available in delta-rs and then it will be a lot easier to implement the R package, but that will take time.

dseynaev commented 1 year ago

@sorhawell @wjones127 myself and @Ploppz might have some capacity to investigate/contribute but will need some pointers/guidance

would it be helpful to connect over Discord?

sorhawell commented 1 year ago

@dseynaev sure :) what discord channel do you prefer? it could be the r-polars subchannel of polars discord

One stepping stone would be an interface for r-arrow dataset, then r-polars must a make a scanner-adaptor to that. It will take a week or two for me to write I think, but very parallel to the py-polars/py-arrow interface. Then would be to good reasons to go ahead with #165