kedro-org / kedro-plugins

First-party plugins maintained by the Kedro team.
Apache License 2.0
91 stars 84 forks source link

Use `pl.sink_*` in `LazyPolarsDataset._save` #702

Open AnH0ang opened 4 months ago

AnH0ang commented 4 months ago

Description

When passing a lazy DataFrame to LazyPolarsDataset, it is currently collected into an eager DataFrame before writing it using the appropriate pl.write_* function. This can be skipped by writing the lazy dataframe using pl.sink_*.

Context

In some cases, it may be faster to collect the lazy DataFrame in streaming mode. Additionally, it is not always possible to collect the entire DataFrame (e.g., if the data is too large). Using pl.sink_*, the entire data set does not need to be loaded.

Possible Implementation

In the _save function, the input DataFrame could first be coerced into a lazy DataFrame and then written to disk using pl.sink_*.

Possible Alternatives

AnH0ang commented 4 months ago

I just tried to implement it myself and encountered a problem. LazyPolarsDataset first writes the dataframe into a buffer before writing the buffer to disk with fsspec. However, sink_* does not support writing to a buffer.

https://github.com/kedro-org/kedro-plugins/blob/58ef629a41f6f6b030aeaf7fb1a3443bfc5071f3/kedro-datasets/kedro_datasets/polars/lazy_polars_dataset.py#L207-L236

MatthiasRoels commented 3 months ago

I would hold off such a change for now. As mentioned in the docs:

Streaming mode is considered unstable. It may be changed at any point without it being considered a breaking change.

On top of that, I heard the Polars team is working on completely rewriting their streaming engine. So I would just stick with the current implementation...

astrojuanlu commented 3 months ago

Indeed, streaming mode is unstable, but lazy non-streaming methods are considered stable, as far as I understand?

About what to do with remote storages, maybe we can offer a sink_* path for local files?

MatthiasRoels commented 3 months ago

Actually, the sink_* methods also use the streaming engine under the hood. Hence these methods should also be considered unstable as is explicitly mentioned in the docs. So I wouldn’t recommend using it either.

Regular methods on the other hand are stable. As a matter of fact, almost all eager methods use the corresponding lazy method under the hood (e.g. .lazy().op().collect()).