Parallelism without Rayon

Shnatsel commented 3 years ago

Rayon introduces a lot of unsafe code to the dependency graph - at about 2000 unsafe expressions according to cargo-geiger, accounting for over 80% of all unsafe code. This undermines the value proposition of the safe Rust decoder: the entire point of it is implementing a decoder in safe code; if you're OK with depending on unsafe code that's probably fine, you should just use libjpeg-turbo.

It would be nice to experiment and see if we can match the performance of Rayon without pulling in all of its dependency tree.

Right now jpeg-decoder uses Rayon for just one line of code: https://github.com/image-rs/jpeg-decoder/blob/97742b3fd692e04eaae4f368110275c7151eacf8/src/decoder.rs#L853

It should be fairly trivial to replicate what Rayon does here using manually spawned threads and an MPMC channel to distribute work items to the threads. flume is a 100% safe code implementation of an MPMC channel that's on par with crossbeam in the use cases that are relevant here. I'm not sure if that's the right approach, but it's probably worth a shot.

Shnatsel commented 3 years ago

I've taken an experimental stab at this. It seems we'll still need crossbeam-utils for scoped threads, but we could still get rid of crossbeam-deque and rayon dependencies which are ~500 unsafe expressions each.

Shnatsel commented 3 years ago

Which might be a very good thing, since crossbeam-deque is pulling some dubiously legal tricks: https://github.com/rayon-rs/rayon/issues/812

and all major single-threaded deque implementations in Rust had serious memory bugs, and crossbeam-deque has even less chance to be correct than the much simpler single-threaded impls...

Shnatsel commented 3 years ago

My initial implementation - full of unwrap(), but probably functional - can be found here: https://github.com/Shnatsel/jpeg-decoder/tree/rayonless

So far it's decoding https://commons.wikimedia.org/wiki/File:Sun_over_Lake_Hawea,_New_Zealand.jpg in 332ms as opposed to 308ms with Rayon, but the code is quite naive so far.

Profile on the same image: https://share.firefox.dev/36T6adS

This shows that the worker threads are under-utilized. Dividing the work into larger chunks should do the trick; I believe that's what rayon does as well.

Shnatsel commented 3 years ago

The addition of some naive batching has put the custom impl in line with Rayon, and even made it faster by a tiny bit - 305ms vs 308ms with Rayon. That means the approach is viable!

Code: https://github.com/image-rs/jpeg-decoder/compare/master...Shnatsel:rayonless?expand=1

fintelia commented 3 years ago

This is exciting to see! It looks like it will be possible to avoid the rayon dependency without too much added complexity and without incurring a performance regression.

Shnatsel commented 3 years ago

I've tried adding some heuristics for batching and choosing the number of CPUs, but so far all it did was regress my benchmarks across the board

image-rs / jpeg-decoder

Parallelism without Rayon #171