I am reading some parquet files from disk using polars which are the source of data. Doing some moderately heavy duty processing (a few million rows) to generate an intermediate data frame, then generating two results which need to be written back to some database
Current Behaviour
The source to intermediate transformation happens twice, even the disk read happens twice.
Desired Behaviour
Somehow reuse the evaluation of common subgraphs so that the calculation (and disk read) are not repeated.
If you call collect twice, we will start from scratch. Currently the best think you can do is persist an intermediate result with cached_df = collect().lazy()
Dependency Dag (Example)
Description
I am reading some parquet files from disk using polars which are the source of data. Doing some moderately heavy duty processing (a few million rows) to generate an intermediate data frame, then generating two results which need to be written back to some database
Current Behaviour
The source to intermediate transformation happens twice, even the disk read happens twice.
Desired Behaviour
Somehow reuse the evaluation of common subgraphs so that the calculation (and disk read) are not repeated.