Comparison with Rayon should mention `ParallelIterator`

judofyr / spice

Fine-grained parallelism with sub-nanosecond overhead in Zig

BSD Zero Clause License

748 stars 13 forks source link

Comparison with Rayon should mention `ParallelIterator` #5

Closed glowcoil closed 3 months ago

glowcoil commented 3 months ago

Rayon has a ParallelIterator trait which is based around recursively splitting tasks into subtasks until the set of workers in the pool is saturated. ParallelIterator is fairly central to Rayon's overall design, and I would argue that the sum_rayon implementation in the benchmark here, which calls join for every single node in the tree, is really not idiomatic:

pub fn sum_rayon(pool: &rayon::ThreadPool, node: &Node<i64>) -> i64 {
    if let (Some(left), Some(right)) = (&node.left, &node.right) {
        let (left_result, right_result) =
            pool.join(|| sum_rayon(pool, left), || sum_rayon(pool, right));
        return node.value + left_result + right_result;
    }

    let mut result = node.value;
    if let Some(child) = &node.left {
        result += sum_rayon(pool, child);
    }
    if let Some(child) = &node.right {
        result += sum_rayon(pool, child);
    }
    return result;
}

Here's an alternate version that makes use of Rayon's iter::split function:

pub fn sum_split(pool: &rayon::ThreadPool, node: &Node<i64>) -> i64 {
    use rayon::{iter, iter::ParallelIterator};

    pool.install(|| {
        iter::split(node, |node| {
            if let (Some(left), Some(right)) = (&node.left, &node.right) {
                (left, Some(right))
            } else {
                (node, None)
            }
        })
        .map(sum)
        .reduce(|| 0, |left, right| left + right)
    })
}

When I benchmark these two implementations on a tree with 10,000,000 nodes (on a Intel Core i5-8400 CPU with 6 cores), the version that uses split is significantly faster than the version that calls join for every node, and is never slower than the baseline:

Threads	Baseline	Rayon	Rayon (split)
1	57.833 ms	230.48 ms	57.814 ms
2		115.98 ms	30.517 ms
4		59.531 ms	17.483 ms
8		41.588 ms	15.079 ms
16		41.426 ms	14.772 ms
32		41.431 ms	14.799 ms

However, it is true that the split version still shows the same poor scaling behavior for a tree with only 1,000 nodes:

Cores	Baseline	Rayon	Rayon (split)
1	2.4961 µs	25.781 µs	3.9588 µs
2		17.522 µs	8.5428 µs
4		16.180 µs	12.508 µs
8		20.689 µs	18.707 µs
16		28.541 µs	28.069 µs
32		50.097 µs	52.966 µs

This is because the task is being eagerly split up to saturate the available workers without any regard for how small it is. Of course, there are usually good ad hoc solutions to this in the context of any particular problem: in this case, you could introduce a heuristic to make split bottom out for tasks below a certain size, and IndexedParallelIterator has methods like with_min_len/with_max_len and by_uniform_blocks/by_exponential_blocks. However, it's true that Rayon doesn't offer any completely automatic solution here, and the fact that Spice does makes it genuinely different and interesting.

Essentially, I think that there are two main factors responsible for the differences in the benchmarks between Rayon and Spice: first, that Spice is able to avoid the overhead of dynamic task dispatch by scheduling contiguous chunks of tasks for single workers; and second, that Spice is able to preempt those low-overhead contiguous chunks of tasks after they're already running in order to split them adaptively. The first thing isn't actually unique to Spice, and is in fact the purpose of Rayon's parallel iterator abstraction. Given that Rayon's docs push you pretty heavily to make use of parallel iterators, ignoring them entirely makes that aspect of the comparison feel somewhat misleading to me. However, the second thing is something that Rayon genuinely lacks the machinery for, which means that parallel iterators can sometimes make suboptimal splitting decisions, and so I think that aspect of the comparison is valid.

judofyr commented 3 months ago

(Oh, this is a familiar name! 👋 )

Oh wow, I was not aware of this at all! I will attempt to correct any unfair comparison against Rayon, but it might take some days as I have busy schedule these week and I'd also like to understand it fully.

Given that Rayon's docs push you pretty heavily to make use of parallel iterators, ignoring them entirely makes that aspect of the comparison feel somewhat misleading to me.

I'll have to admit that I haven't used Rayon extensively before and I mainly knew about its overall design (work-stealing) and not its inner details. However, I did go to the documentation and did honestly attempt to pick the most efficient method:

My thinking process was follows:

"Parallel iterators make it easy to convert a sequential iterator to execute in parallel" => Nope, I don't have a sequential iterator.
"The par_sort method sorts &mut [T] slices (or vectors) in parallel." => Nope, I don't have a &mut [T]
"par_extend can be used to efficiently grow collections with items produced by a parallel iterator." => Nope, I don't have a parallel iterator in the first place.
"join is used to subdivide a task into two pieces." => Aha, perfect! I have exactly two pieces.
(split isn't even mentioned from the initial page)

And following the documentation for join it does not in any way mention that there's any gotchas around it, but rather quite opposite claims that the overhead is tiny.

Could we fix the documentation maybe? Highlight that join is more of an internal thing and that there are real benefits around using a ParallelIterator? For me, it was not obvious at all that the value of ParallelIterator also went beyond working on iterators, but also in its capability of splitting up work effectively.

Also, after reading a tiny bit about iter::split (I will read more; promise!) it sounds like it's slightly different from fork/join, no? It's more of a split/map/reduce model? Which intuitively feels like it should be pretty equivalent to fork/join 🤔 What are the types of patterns which are not expressed well by iter::split, but can be expressed with fork/join? Or why can't rayon::join fall back to executing the two closures inline (if that's what split essentially achieves) when there's no need of sharing work? I think I need to understand a bit more around how split makes its decision…

The first thing isn't actually unique to Spice, and is in fact the purpose of Rayon's parallel iterator abstraction. […] However, the second thing is something that Rayon genuinely lacks the machinery for

Aaaaah, I think maybe here I've made the wrong conclusion. I've seen plenty of comments on forums/Reddit from people using Rayon and seeing code becoming much slower. I attributed this to the overhead (the first point), but maybe all of these reports are actually around thread contention.

glowcoil commented 3 months ago

(Oh, this is a familiar name! 👋 )

Nice running into you, it's been a while! 👋

Could we fix the documentation maybe? Highlight that join is more of an internal thing and that there are real benefits around using a ParallelIterator? For me, it was not obvious at all that the value of ParallelIterator also went beyond working on iterators, but also in its capability of splitting up work effectively.

The docs do say that the high-level parallel constructs are "typically the most efficient", so to me there is an implication that you should use them if possible, and that you are incurring a bit more responsibility with regard to performance if you choose to use join or scope manually instead. But I agree that the messaging could be clarified.

And following the documentation for join it does not in any way mention that there's any gotchas around it, but rather quite opposite claims that the overhead is tiny.

"Tiny" is relative. 🙂 No threads are spawned, there are usually no heap allocations (the data for both closures remains on the stack), and dynamic dispatch is avoided where possible. The overhead is not negligible in comparison to the size of the work items in the benchmark (essentially a single ADD instruction), or in comparison to the single atomic load that Spice's tick function performs in the non-heartbeat case, but it is still very lightweight in comparison to e.g. thread::spawn or sending a Box<FnOnce> over a channel.

why can't rayon::join fall back to executing the two closures inline (if that's what split essentially achieves) when there's no need of sharing work?

When called from a thread which is in the thread pool, join actually does always execute the first closure inline without any dynamic dispatch, and if the second closure doesn't get stolen while the first one is running, it executes that one inline as well. But it still has to do the work to advertise the second closure as available for stealing, then later determine whether it was stolen, and if it was, potentially fall back to working on other tasks while waiting for it to be finished. (This is described at a high level in the docs for join, though if you're curious for more detail, the source is pretty readable.)

You can more or less think of join as unconditionally doing the work that Spice's tick does only during a heartbeat. In the common case, the operations it performs are all non-blocking and are unlikely to be dealing with contention, so the overhead really is quite low; it's just that there isn't a mechanism like Spice's heartbeat scheduler to skip that work entirely most of the time, so it's intended to be used in a somewhat more coarse-grained way than Spice's call. If you're writing code using join directly, rather than wrapping every little piece of work in a join call and then letting the scheduler sort it out, Rayon leaves it up to you to determine which tasks are large enough to be split up using join and which tasks should just be executed as a single unit.

judofyr commented 3 months ago

It doesn't seem like iter::split is the right tool for the problem of "summing a binary tree". It's perfect if you know that the tree is perfectly balanced, but that was originally not a constraint for what I was trying to solve.

To understand more how it's working I made a slight tweak to the tree to make it more unbalanced:

Rust code for making an unbalanced tree

The intention is for the tree to look something like this: ``` / \ / - \ / - - \ / - - - \ / - - - - + / \ / - \ / - - \ ``` Sorry if there are some off-by-one bugs. ```rust pub fn make_unbalanced_tree(from: i64, to: i64, left_size: i64) -> Node { if left_size == 0 { return make_balanced_tree(from, to); } let value = from + left_size; return Node { value: value, left: Some(Box::new(make_balanced_tree(from, value - 1))), right: Some(Box::new(make_unbalanced_tree(value + 1, to, left_size / 2))), }; } ```

When I create a make_unbalanced_tree(1, 100_000_000, 1000) the split implementation you posted here gets 0% performance improvements as you add more threads to the system. And this isn't very surprising: The split implementation is calling sum (the sequential one) internally and once the scheduler then "commits" to using sum on one subtree then it's stuck. This tree is initially perfectly balanced, but then there's one heavy subtree. The split functionality isn't capable of splitting anything inside that subtree because it can't see it.

(I'd also say that this type of uneven balanced seem pretty common to me: Imagine a webpage which has one huge table deeply nested somewhere. Overall the DOM is balanced and nice, but then there's this one heavy subtree – which also is internally balanced nicely.)

It seems that in order to use split effectively you have to be able to split your problem into roughly equal halves (in terms of work). However, what I struggle with is: Why even use work-stealing then? The whole value proposition of work-stealing is that it's hard to balance tasks evenly across your threads, and therefore you make them capable of stealing more work when one of the thread run out. If you have a way of evenly split your problem into halves, there's a much simpler implementation: Split it in halves until it reaches n number of threads, send each task to each thread, wait for the results. No shared queues, no lock contention etc etc. Here even the n=1000 should be extremely efficient.

What Rayon's join is capable of achieving here is far more impressive: It's able to dynamically adapt to different types of inputs and keep parallelizing the problem regardless of how uneven the branches are.

I still think iter::split deserves to be highlighted in the README, but it's quite a different beast from Rayon's join and Spice – which are far more comparable and have similar nice properties.

glowcoil commented 3 months ago

It doesn't seem like iter::split is the right tool for the problem of "summing a binary tree". It's perfect if you know that the tree is perfectly balanced, but that was originally not a constraint for what I was trying to solve.

You're right, split does make that assumption and it isn't appropriate for arbitrary unbalanced binary trees. I suppose that parallel iterators were a bit of a red herring in this context. However, I still maintain that the intended level of granularity for Rayon's join is coarser than the way it's used in sum_rayon, and parallel iterators, when they're applicable, use it at that correct level of granularity by default.

It seems that in order to use split effectively you have to be able to split your problem into roughly equal halves (in terms of work). However, what I struggle with is: Why even use work-stealing then?

Because your entire program probably doesn't consist of a single call to iter::split. Work-stealing will still effectively load-balance a workload consisting of multiple different uses of parallel iterators and custom uses of join all composed together.

Here's a simple modification to sum_rayon that has very similar performance characteristics to the Spice version, and I think illustrates my point about join granularity really clearly:

const THRESHOLD: usize = 10_000;

fn sum_timer_inner(pool: &rayon::ThreadPool, node: &Node<i64>, iter: &mut usize) -> i64 {
    *iter += 1;

    if let (Some(left), Some(right)) = (&node.left, &node.right) {
        if *iter >= THRESHOLD {
            let (left_result, right_result) = pool.join(
                || sum_timer_inner(pool, left, &mut 0),
                || sum_timer_inner(pool, right, &mut 0),
            );
            return node.value + left_result + right_result;
        } else {
            return sum_timer_inner(pool, left, iter) + sum_timer_inner(pool, right, iter);
        }
    }

    let mut result = node.value;
    if let Some(child) = &node.left {
        result += sum_timer_inner(pool, child, iter);
    }
    if let Some(child) = &node.right {
        result += sum_timer_inner(pool, child, iter);
    }
    return result;
}

pub fn sum_timer(pool: &rayon::ThreadPool, node: &Node<i64>) -> i64 {
    sum_timer_inner(pool, node, &mut 0)
}

Here are the results for 10,000,000 nodes:

Cores	Baseline	Rayon (timer)
1	55.829 ms	71.147 ms
2		37.123 ms
4		21.821 ms
8		18.665 ms
16		18.894 ms
32		20.672 ms

1,000 nodes:

Cores	Baseline	Rayon (timer)
1	1.9568 µs	2.2209 µs
2		2.2116 µs
4		2.2277 µs
8		2.2165 µs
16		2.2141 µs
32		2.2412 µs

And for make_unbalanced_tree(1, 10_000_000, 10_000_000):

Cores	Baseline	Rayon (timer)
1	109.60 ms	141.29 ms
2		73.697 ms
4		42.734 ms
8		35.946 ms
16		36.363 ms
32		39.289 ms

Here, the iter counter is effectively playing the same role as the heartbeat clock in Spice, by determining when a work item is large enough that parallelizing it is worth the overhead of participating in work stealing. The primary difference between Rayon and Spice here is that THRESHOLD has to be tuned manually, whereas Spice automatically chooses an appropriate granularity based on wall-clock time.

judofyr commented 3 months ago

Yes! I agree with everything you're saying. The intention of this benchmark is to show an apples-to-apples comparison of using the same API, not to claim that it's impossible for Rayon to handle this problem. You're also proving the value proposition of Spice though: With Spice you don't have to think about any of this nor tweak any parameters.

I've opened a PR to clarify the comparison, but I'm also open for other suggestions of how to phrase the README. I find it a bit hard to balance between "explain everything" and "getting to the point", especially as different readers have different understanding of the fork/join mechanism.

And for make_unbalanced_tree(1, 10_000_000, 10_000_000):

And a small note: Is this the actual tree? In which case it's not very unbalanced since the last parameter is the size of the balanced left-child. I'm not sure that it matters though. It does look like this depth-based granularity control would be able to handle this type of unbalanced tree. But yea, now you have a manual parameter to tweak.

glowcoil commented 3 months ago

You're also proving the value proposition of Spice though: With Spice you don't have to think about any of this nor tweak any parameters.

Yeah, agreed.

And for make_unbalanced_tree(1, 10_000_000, 10_000_000):

And a small note: Is this the actual tree? In which case it's not very unbalanced since the last parameter is the size of the balanced left-child. I'm not sure that it matters though. It does look like this depth-based granularity control would be able to handle this type of unbalanced tree. But yea, now you have a manual parameter to tweak.

Whoops, sorry, that was just a silly mistake on my part (I misread the arguments to make_unbalanced_tree in your earlier post). But sum_timer_inner handles make_unbalanced_tree(1, 100_000_000, 1000) just as well:

Cores	Baseline	Rayon (timer)
1	538.35 ms	670.90 ms
2		350.42 ms
4		197.66 ms
8		168.81 ms
16		166.30 ms
32		169.40 ms

In fact, I would expect it to be able to handle any shape of unbalanced binary tree just about as well as the Spice version. The iter counter doesn't keep track of tree depth; it keeps track of the total number of nodes that the depth-first search has visited so far. iter ends up being a good proxy for the total amount of time spent so far on a task because this problem consists of a large number of subtasks with mostly identical cost, and I would expect this strategy to work well on any problem with that property. Of course, more heterogeneous tasks would require more manual effort to design heuristics, and not having to do so is the value proposition of heartbeat scheduling.

judofyr commented 3 months ago

Ah, yes of course, you're right. Using the numbers from the benchmarks: This ensures that we're doing 10000 of those ~7 ns operations before adding ~15 ns of overhead. That makes the overhead negligible.

(I think you're missing a node.value + in the return sum_timer_inner(pool, left, iter) + sum_timer_inner(pool, right, iter);-branch, but that doesn't matter much.)