rayon-rs / rayon

Rayon: A data parallelism library for Rust
Apache License 2.0
10.8k stars 492 forks source link

Par bridge with optional buffering IO handling #1173

Closed pickfire closed 1 month ago

pickfire commented 3 months ago

My use case is to process many files in a cpu heavy workload, what I did is do a glob of files, par_bridge it into a cpu intensive task, but in that task it needs to first loads the file which is IO heavy instead of CPU heavy.

I am thinking if it is possible to use another thread to do read the file into memory first then only par_iter it?

glob(...).par_buffer(polars_read).par_iter(polars_process);

The part of buffering reads it into memory and prepare am extra set of items for each cpu intensive function to process polars_process without having to waste cpu time doing IO.

adamreichold commented 2 months ago

If I understand you correctly, you would like to read in data with twice as many threads as you use to process it? What do you think about using two separate pools to do that, i.e. roughly

let cpu_pool = ThreadPoolBuilder::new().build().unwrap();

let io_pool = ThreadPoolBuilder::new().num_threads(cpu_pool.current_num_threads() * 2).build().unwrap();

io_pool.scope(|io_scope| {
   glob(...).par_iter().map(polars_read).for_each(|item| {
       cpu_pool.in_place_scope(move |cpu_scope| {
           cpu_scope.spawn(move || polars_process(item));
       });
   });
});

(I have not even compiled this, the code is only meant to illustrate using two threads pool to explicitly implement the two-IO-per-CPU-thread approach.)

pickfire commented 1 month ago

Interesting, something like this could indeed work. But it seemed like a lot of work given that I only want to glob files. I guess I can close this issue now.