Open jackxxu opened 4 days ago
The cache will turn of branch parallelization to prevent deadlocks. You can turn of CSE, if it is slower in your case.
BTW, I would really replace this snippet:
with ThreadPoolExecutor() as executor:
# Execute the processing function for each file concurrently
lazy_frames = list(executor.map(process_file, file_paths))
With a pl.collect_all
as now your threads are contending with our threads collect_all
can use Polars' thread pool.
@ritchie46:
turning off CSE indeed did the trick. thank you!
With a pl.collect_all as now your threads are contending with our threads collect_all can use Polars' thread pool.
to your suggestion for pl.collect_all
, the snippet is mean to create a list of lazy_dataframe (by stitiching a bunch of parquets horizontally) and then the list is passed to pl.concat to stitch vertically into 1 lazy dataframe.
the idea of using ThreadPoolExecutor is so that the reading the parquet schema to create the lazy dataframes can be done in parallel (for hundreds of them). Also, I though the input of the collect_all
is a list of lazy frames. so I suspect collect_all may not help me with creating the lazy dataframes?
or perhaps I am missing something?
Checks
Reproducible example
Log output
this the partial output of the explain() method
Issue description
I am trying switch multiple parquets both horizontally (2 parquets) and vertically (hundreds of parquets) into a large dataframe. I join first horizontally and then vertically stack them.
i tried two ways. one way is each time dataframe is unique, and the second one, one of the 2 dataframe is the same for each layer. as expected, polars cached the 2nd dataframe as the explain() output shows.
however, the cache version is 40% slower than the hte uncached version in my experiments. and also, CPU utilization is limited in the cached version, which may indicate some kind of locking.
the code can be found here https://github.com/jackxxu/polars_merge/tree/main.
Expected behavior
i expected the cached version to be faster, but it turns out the opposite.
also, the CPU usage of the cached version is much lower, which explains why it is slower.
Installed versions