Closed pickfire closed 1 month ago
If I understand you correctly, you would like to read in data with twice as many threads as you use to process it? What do you think about using two separate pools to do that, i.e. roughly
let cpu_pool = ThreadPoolBuilder::new().build().unwrap();
let io_pool = ThreadPoolBuilder::new().num_threads(cpu_pool.current_num_threads() * 2).build().unwrap();
io_pool.scope(|io_scope| {
glob(...).par_iter().map(polars_read).for_each(|item| {
cpu_pool.in_place_scope(move |cpu_scope| {
cpu_scope.spawn(move || polars_process(item));
});
});
});
(I have not even compiled this, the code is only meant to illustrate using two threads pool to explicitly implement the two-IO-per-CPU-thread approach.)
Interesting, something like this could indeed work. But it seemed like a lot of work given that I only want to glob files. I guess I can close this issue now.
My use case is to process many files in a cpu heavy workload, what I did is do a glob of files, par_bridge it into a cpu intensive task, but in that task it needs to first loads the file which is IO heavy instead of CPU heavy.
I am thinking if it is possible to use another thread to do read the file into memory first then only par_iter it?
The part of buffering reads it into memory and prepare am extra set of items for each cpu intensive function to process
polars_process
without having to waste cpu time doing IO.