Inefficient scheduling for par_iter

rayon-rs / rayon

Rayon: A data parallelism library for Rust

Apache License 2.0

11.01k stars 501 forks source link

Inefficient scheduling for par_iter #1204

Open morgante opened 1 week ago

morgante commented 1 week ago

I have an iterator where the work to be done for each task varies widely—most items finish very quickly, but a few take orders of magnitude longer.

It appears from my quick experiment that Rayon is being somewhat inefficient here since spawn performs much better than the ParallelIterator. It looks like Rayon is still queuing up fast tasks behind the slow task on the same thread pool, or otherwise running fast tasks after the slow task.

The optimal strategy here, which spawn seems to support, is to continue chewing through tasks on other threads while 1 thread stays stuck on the slow task.

Is there a way to do this with par_iter which has much nicer iteration semantics, or do I need to manage spawning myself?

cuviper commented 1 week ago

It's a tough balancing act, because usually I hear that the adaptive scheduler was too aggressive. :)

If your input par_iter() is an IndexedParallelIterator, then you can force fine granularity by adding .with_max_len(1). That will get you roughly the same processing strategy as a bunch of individual spawns, but still based on join recursion. That has its own work-stealing gotchas with latency though -- see #1054.

morgante commented 1 week ago

Thanks for the pointer, I could probably get indexed iteration working but I'm actually not sure I understand what wins I'm getting there over just spawning each thread separately.

My ideal would actually be something like this:

I use the iterator/into_some_iter() with for_each.
The iterator is consumed for tasks into different threads.
When no thread is available for a task, we don't continue pulling from the iterator until there's actually a thread available for the next task.

Is there a way to do this with rayon or will I need to build my own scheduler / switch to tokio?

cuviper commented 1 week ago

but I'm actually not sure I understand what wins I'm getting there over just spawning each thread separately.

The benefit is that you would still be using the shared thread pool, even if your number of items is much larger than the number of threads. They'll just be "scheduled" individually if you add .with_max_len(1).

What is your source that you're calling par_iter() on? Many types do support the indexed API already.