pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.27k stars 1.85k forks source link

Generator output support #18382

Open EpicUsaMan opened 3 weeks ago

EpicUsaMan commented 3 weeks ago

Description

lazy_df = pl.scan_parquet('PATH_TO_SOME_FILE')

lazy_df = lazy_df.select(...some long running logic...).batch(56).next(streaming=True)

So, basically it's something like syntax sugar for

lazy_df.head(skip + batch_size).collect(...params...)

But with preserving caches and all benefits of lazy mode

Very useful when dealing with ML/DL to train it

deanm0000 commented 3 weeks ago

I think duplicate of https://github.com/pola-rs/polars/issues/12611

EpicUsaMan commented 3 weeks ago

I think duplicate of #12611

Yep, probably

But the problem is that there is PR, but it's not approved due to it was developed for streaming-v1

For streaming-v2 there is no specs for it

BTW, streaming-v2 is in main branch as I see, so if I will compile it be myself I will be able to try it? Or it's not ready to be at least tried?

cmdlineluser commented 3 weeks ago

It is available in collect via new_streaming: .collect(new_streaming=True)

But it's not documented, so probably not intended for "public usage".

EpicUsaMan commented 2 weeks ago

It is available in collect via new_streaming: .collect(new_streaming=True)

But it's not documented, so probably not intended for "public usage".

Great, I will make it a try!