Closed glowcoil closed 3 months ago
(Oh, this is a familiar name! 👋 )
Oh wow, I was not aware of this at all! I will attempt to correct any unfair comparison against Rayon, but it might take some days as I have busy schedule these week and I'd also like to understand it fully.
Given that Rayon's docs push you pretty heavily to make use of parallel iterators, ignoring them entirely makes that aspect of the comparison feel somewhat misleading to me.
I'll have to admit that I haven't used Rayon extensively before and I mainly knew about its overall design (work-stealing) and not its inner details. However, I did go to the documentation and did honestly attempt to pick the most efficient method:
My thinking process was follows:
&mut [T]
slices (or vectors) in parallel." => Nope, I don't have a &mut [T]
split
isn't even mentioned from the initial page)And following the documentation for join
it does not in any way mention that there's any gotchas around it, but rather quite opposite claims that the overhead is tiny.
Could we fix the documentation maybe? Highlight that join
is more of an internal thing and that there are real benefits around using a ParallelIterator
? For me, it was not obvious at all that the value of ParallelIterator
also went beyond working on iterators, but also in its capability of splitting up work effectively.
Also, after reading a tiny bit about iter::split
(I will read more; promise!) it sounds like it's slightly different from fork/join, no? It's more of a split/map/reduce model? Which intuitively feels like it should be pretty equivalent to fork/join 🤔 What are the types of patterns which are not expressed well by iter::split
, but can be expressed with fork/join? Or why can't rayon::join
fall back to executing the two closures inline (if that's what split
essentially achieves) when there's no need of sharing work? I think I need to understand a bit more around how split
makes its decision…
The first thing isn't actually unique to Spice, and is in fact the purpose of Rayon's parallel iterator abstraction. […] However, the second thing is something that Rayon genuinely lacks the machinery for
Aaaaah, I think maybe here I've made the wrong conclusion. I've seen plenty of comments on forums/Reddit from people using Rayon and seeing code becoming much slower. I attributed this to the overhead (the first point), but maybe all of these reports are actually around thread contention.
(Oh, this is a familiar name! 👋 )
Nice running into you, it's been a while! 👋
Could we fix the documentation maybe? Highlight that
join
is more of an internal thing and that there are real benefits around using aParallelIterator
? For me, it was not obvious at all that the value ofParallelIterator
also went beyond working on iterators, but also in its capability of splitting up work effectively.
The docs do say that the high-level parallel constructs are "typically the most efficient", so to me there is an implication that you should use them if possible, and that you are incurring a bit more responsibility with regard to performance if you choose to use join
or scope
manually instead. But I agree that the messaging could be clarified.
And following the documentation for
join
it does not in any way mention that there's any gotchas around it, but rather quite opposite claims that the overhead is tiny.
"Tiny" is relative. 🙂 No threads are spawned, there are usually no heap allocations (the data for both closures remains on the stack), and dynamic dispatch is avoided where possible. The overhead is not negligible in comparison to the size of the work items in the benchmark (essentially a single ADD instruction), or in comparison to the single atomic load that Spice's tick
function performs in the non-heartbeat case, but it is still very lightweight in comparison to e.g. thread::spawn
or sending a Box<FnOnce>
over a channel.
why can't
rayon::join
fall back to executing the two closures inline (if that's whatsplit
essentially achieves) when there's no need of sharing work?
When called from a thread which is in the thread pool, join
actually does always execute the first closure inline without any dynamic dispatch, and if the second closure doesn't get stolen while the first one is running, it executes that one inline as well. But it still has to do the work to advertise the second closure as available for stealing, then later determine whether it was stolen, and if it was, potentially fall back to working on other tasks while waiting for it to be finished. (This is described at a high level in the docs for join
, though if you're curious for more detail, the source is pretty readable.)
You can more or less think of join
as unconditionally doing the work that Spice's tick
does only during a heartbeat. In the common case, the operations it performs are all non-blocking and are unlikely to be dealing with contention, so the overhead really is quite low; it's just that there isn't a mechanism like Spice's heartbeat scheduler to skip that work entirely most of the time, so it's intended to be used in a somewhat more coarse-grained way than Spice's call
. If you're writing code using join
directly, rather than wrapping every little piece of work in a join
call and then letting the scheduler sort it out, Rayon leaves it up to you to determine which tasks are large enough to be split up using join
and which tasks should just be executed as a single unit.
It doesn't seem like iter::split
is the right tool for the problem of "summing a binary tree". It's perfect if you know that the tree is perfectly balanced, but that was originally not a constraint for what I was trying to solve.
To understand more how it's working I made a slight tweak to the tree to make it more unbalanced:
When I create a make_unbalanced_tree(1, 100_000_000, 1000)
the split
implementation you posted here gets 0% performance improvements as you add more threads to the system. And this isn't very surprising: The split
implementation is calling sum
(the sequential one) internally and once the scheduler then "commits" to using sum
on one subtree then it's stuck. This tree is initially perfectly balanced, but then there's one heavy subtree. The split
functionality isn't capable of splitting anything inside that subtree because it can't see it.
(I'd also say that this type of uneven balanced seem pretty common to me: Imagine a webpage which has one huge table deeply nested somewhere. Overall the DOM is balanced and nice, but then there's this one heavy subtree – which also is internally balanced nicely.)
It seems that in order to use split
effectively you have to be able to split your problem into roughly equal halves (in terms of work). However, what I struggle with is: Why even use work-stealing then? The whole value proposition of work-stealing is that it's hard to balance tasks evenly across your threads, and therefore you make them capable of stealing more work when one of the thread run out. If you have a way of evenly split your problem into halves, there's a much simpler implementation: Split it in halves until it reaches n
number of threads, send each task to each thread, wait for the results. No shared queues, no lock contention etc etc. Here even the n=1000
should be extremely efficient.
What Rayon's join
is capable of achieving here is far more impressive: It's able to dynamically adapt to different types of inputs and keep parallelizing the problem regardless of how uneven the branches are.
I still think iter::split
deserves to be highlighted in the README, but it's quite a different beast from Rayon's join
and Spice – which are far more comparable and have similar nice properties.
It doesn't seem like
iter::split
is the right tool for the problem of "summing a binary tree". It's perfect if you know that the tree is perfectly balanced, but that was originally not a constraint for what I was trying to solve.
You're right, split
does make that assumption and it isn't appropriate for arbitrary unbalanced binary trees. I suppose that parallel iterators were a bit of a red herring in this context. However, I still maintain that the intended level of granularity for Rayon's join
is coarser than the way it's used in sum_rayon
, and parallel iterators, when they're applicable, use it at that correct level of granularity by default.
It seems that in order to use
split
effectively you have to be able to split your problem into roughly equal halves (in terms of work). However, what I struggle with is: Why even use work-stealing then?
Because your entire program probably doesn't consist of a single call to iter::split
. Work-stealing will still effectively load-balance a workload consisting of multiple different uses of parallel iterators and custom uses of join
all composed together.
Here's a simple modification to sum_rayon
that has very similar performance characteristics to the Spice version, and I think illustrates my point about join
granularity really clearly:
const THRESHOLD: usize = 10_000;
fn sum_timer_inner(pool: &rayon::ThreadPool, node: &Node<i64>, iter: &mut usize) -> i64 {
*iter += 1;
if let (Some(left), Some(right)) = (&node.left, &node.right) {
if *iter >= THRESHOLD {
let (left_result, right_result) = pool.join(
|| sum_timer_inner(pool, left, &mut 0),
|| sum_timer_inner(pool, right, &mut 0),
);
return node.value + left_result + right_result;
} else {
return sum_timer_inner(pool, left, iter) + sum_timer_inner(pool, right, iter);
}
}
let mut result = node.value;
if let Some(child) = &node.left {
result += sum_timer_inner(pool, child, iter);
}
if let Some(child) = &node.right {
result += sum_timer_inner(pool, child, iter);
}
return result;
}
pub fn sum_timer(pool: &rayon::ThreadPool, node: &Node<i64>) -> i64 {
sum_timer_inner(pool, node, &mut 0)
}
Here are the results for 10,000,000 nodes:
Cores | Baseline | Rayon (timer) |
---|---|---|
1 | 55.829 ms | 71.147 ms |
2 | 37.123 ms | |
4 | 21.821 ms | |
8 | 18.665 ms | |
16 | 18.894 ms | |
32 | 20.672 ms |
1,000 nodes:
Cores | Baseline | Rayon (timer) |
---|---|---|
1 | 1.9568 µs | 2.2209 µs |
2 | 2.2116 µs | |
4 | 2.2277 µs | |
8 | 2.2165 µs | |
16 | 2.2141 µs | |
32 | 2.2412 µs |
And for make_unbalanced_tree(1, 10_000_000, 10_000_000)
:
Cores | Baseline | Rayon (timer) |
---|---|---|
1 | 109.60 ms | 141.29 ms |
2 | 73.697 ms | |
4 | 42.734 ms | |
8 | 35.946 ms | |
16 | 36.363 ms | |
32 | 39.289 ms |
Here, the iter
counter is effectively playing the same role as the heartbeat clock in Spice, by determining when a work item is large enough that parallelizing it is worth the overhead of participating in work stealing. The primary difference between Rayon and Spice here is that THRESHOLD
has to be tuned manually, whereas Spice automatically chooses an appropriate granularity based on wall-clock time.
Yes! I agree with everything you're saying. The intention of this benchmark is to show an apples-to-apples comparison of using the same API, not to claim that it's impossible for Rayon to handle this problem. You're also proving the value proposition of Spice though: With Spice you don't have to think about any of this nor tweak any parameters.
I've opened a PR to clarify the comparison, but I'm also open for other suggestions of how to phrase the README. I find it a bit hard to balance between "explain everything" and "getting to the point", especially as different readers have different understanding of the fork/join mechanism.
And for
make_unbalanced_tree(1, 10_000_000, 10_000_000)
:
And a small note: Is this the actual tree? In which case it's not very unbalanced since the last parameter is the size of the balanced left-child. I'm not sure that it matters though. It does look like this depth-based granularity control would be able to handle this type of unbalanced tree. But yea, now you have a manual parameter to tweak.
You're also proving the value proposition of Spice though: With Spice you don't have to think about any of this nor tweak any parameters.
Yeah, agreed.
And for
make_unbalanced_tree(1, 10_000_000, 10_000_000)
:And a small note: Is this the actual tree? In which case it's not very unbalanced since the last parameter is the size of the balanced left-child. I'm not sure that it matters though. It does look like this depth-based granularity control would be able to handle this type of unbalanced tree. But yea, now you have a manual parameter to tweak.
Whoops, sorry, that was just a silly mistake on my part (I misread the arguments to make_unbalanced_tree
in your earlier post). But sum_timer_inner
handles make_unbalanced_tree(1, 100_000_000, 1000)
just as well:
Cores | Baseline | Rayon (timer) |
---|---|---|
1 | 538.35 ms | 670.90 ms |
2 | 350.42 ms | |
4 | 197.66 ms | |
8 | 168.81 ms | |
16 | 166.30 ms | |
32 | 169.40 ms |
In fact, I would expect it to be able to handle any shape of unbalanced binary tree just about as well as the Spice version. The iter
counter doesn't keep track of tree depth; it keeps track of the total number of nodes that the depth-first search has visited so far. iter
ends up being a good proxy for the total amount of time spent so far on a task because this problem consists of a large number of subtasks with mostly identical cost, and I would expect this strategy to work well on any problem with that property. Of course, more heterogeneous tasks would require more manual effort to design heuristics, and not having to do so is the value proposition of heartbeat scheduling.
Ah, yes of course, you're right. Using the numbers from the benchmarks: This ensures that we're doing 10000 of those ~7 ns operations before adding ~15 ns of overhead. That makes the overhead negligible.
(I think you're missing a node.value +
in the return sum_timer_inner(pool, left, iter) + sum_timer_inner(pool, right, iter);
-branch, but that doesn't matter much.)
Rayon has a
ParallelIterator
trait which is based around recursively splitting tasks into subtasks until the set of workers in the pool is saturated.ParallelIterator
is fairly central to Rayon's overall design, and I would argue that thesum_rayon
implementation in the benchmark here, which callsjoin
for every single node in the tree, is really not idiomatic:Here's an alternate version that makes use of Rayon's
iter::split
function:When I benchmark these two implementations on a tree with 10,000,000 nodes (on a Intel Core i5-8400 CPU with 6 cores), the version that uses
split
is significantly faster than the version that callsjoin
for every node, and is never slower than the baseline:However, it is true that the
split
version still shows the same poor scaling behavior for a tree with only 1,000 nodes:This is because the task is being eagerly split up to saturate the available workers without any regard for how small it is. Of course, there are usually good ad hoc solutions to this in the context of any particular problem: in this case, you could introduce a heuristic to make
split
bottom out for tasks below a certain size, andIndexedParallelIterator
has methods likewith_min_len
/with_max_len
andby_uniform_blocks
/by_exponential_blocks
. However, it's true that Rayon doesn't offer any completely automatic solution here, and the fact that Spice does makes it genuinely different and interesting.Essentially, I think that there are two main factors responsible for the differences in the benchmarks between Rayon and Spice: first, that Spice is able to avoid the overhead of dynamic task dispatch by scheduling contiguous chunks of tasks for single workers; and second, that Spice is able to preempt those low-overhead contiguous chunks of tasks after they're already running in order to split them adaptively. The first thing isn't actually unique to Spice, and is in fact the purpose of Rayon's parallel iterator abstraction. Given that Rayon's docs push you pretty heavily to make use of parallel iterators, ignoring them entirely makes that aspect of the comparison feel somewhat misleading to me. However, the second thing is something that Rayon genuinely lacks the machinery for, which means that parallel iterators can sometimes make suboptimal splitting decisions, and so I think that aspect of the comparison is valid.