pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.28k stars 1.85k forks source link

Specify thread pool size at collection #4558

Open OneRaynyDay opened 2 years ago

OneRaynyDay commented 2 years ago

Problem Description

Each query has its own performance characteristics and it's hard to prescribe a single threadpool count to all jobs. Some jobs work wonders with maximum threadcount while others OOM since increased threadpool count correlates with increased memory consumption. It would be great if we can tune this on a per query basis, maybe something like:

lazy_frame.collect(num_threads=100)

Or something like that. Since python has GIL, I don't think there would be any conditions where multiple queries are running and may request different number of threads. In this case I think we can just set the POLARS_MAX_THREADS to num_threads during the execution of collect() and read that value dynamically. Would this negatively impact performance?

ritchie46 commented 2 years ago

Would this negatively impact performance?

This would. It would mean that we must put our threadpool behind a mutex/rwlock and lock it in every access. Given that we do parallelism on so many levels this would really hurt performance and is not something that is really feasible in a way we'd like to see.