pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.09k stars 1.94k forks source link

Cache Common Subgraph Evaluation in Lazy Mode #18292

Open barmanroys opened 2 months ago

barmanroys commented 2 months ago
Dependency Dag (Example)

wiB44IpY

Description

I am reading some parquet files from disk using polars which are the source of data. Doing some moderately heavy duty processing (a few million rows) to generate an intermediate data frame, then generating two results which need to be written back to some database

Current Behaviour

The source to intermediate transformation happens twice, even the disk read happens twice.

Desired Behaviour

Somehow reuse the evaluation of common subgraphs so that the calculation (and disk read) are not repeated.

ritchie46 commented 2 months ago

If you call collect twice, we will start from scratch. Currently the best think you can do is persist an intermediate result with cached_df = collect().lazy()