Open AnH0ang opened 4 months ago
I just tried to implement it myself and encountered a problem. LazyPolarsDataset
first writes the dataframe into a buffer before writing the buffer to disk with fsspec. However, sink_*
does not support writing to a buffer.
I would hold off such a change for now. As mentioned in the docs:
Streaming mode is considered unstable. It may be changed at any point without it being considered a breaking change.
On top of that, I heard the Polars team is working on completely rewriting their streaming engine. So I would just stick with the current implementation...
Indeed, streaming mode is unstable, but lazy non-streaming methods are considered stable, as far as I understand?
About what to do with remote storages, maybe we can offer a sink_*
path for local files?
Actually, the sink_*
methods also use the streaming engine under the hood. Hence these methods should also be considered unstable as is explicitly mentioned in the docs. So I wouldn’t recommend using it either.
Regular methods on the other hand are stable. As a matter of fact, almost all eager methods use the corresponding lazy method under the hood (e.g. .lazy().op().collect()
).
Description
When passing a lazy DataFrame to
LazyPolarsDataset
, it is currently collected into an eager DataFrame before writing it using the appropriatepl.write_*
function. This can be skipped by writing the lazy dataframe usingpl.sink_*
.Context
In some cases, it may be faster to collect the lazy DataFrame in streaming mode. Additionally, it is not always possible to collect the entire DataFrame (e.g., if the data is too large). Using
pl.sink_*
, the entire data set does not need to be loaded.Possible Implementation
In the
_save
function, the input DataFrame could first be coerced into a lazy DataFrame and then written to disk usingpl.sink_*
.Possible Alternatives