Open Shnatsel opened 3 years ago
I've taken an experimental stab at this. It seems we'll still need crossbeam-utils for scoped threads, but we could still get rid of crossbeam-deque
and rayon
dependencies which are ~500 unsafe expressions each.
Which might be a very good thing, since crossbeam-deque
is pulling some dubiously legal tricks: https://github.com/rayon-rs/rayon/issues/812
and all major single-threaded deque implementations in Rust had serious memory bugs, and crossbeam-deque
has even less chance to be correct than the much simpler single-threaded impls...
My initial implementation - full of unwrap()
, but probably functional - can be found here: https://github.com/Shnatsel/jpeg-decoder/tree/rayonless
So far it's decoding https://commons.wikimedia.org/wiki/File:Sun_over_Lake_Hawea,_New_Zealand.jpg in 332ms as opposed to 308ms with Rayon, but the code is quite naive so far.
Profile on the same image: https://share.firefox.dev/36T6adS
This shows that the worker threads are under-utilized. Dividing the work into larger chunks should do the trick; I believe that's what rayon does as well.
The addition of some naive batching has put the custom impl in line with Rayon, and even made it faster by a tiny bit - 305ms vs 308ms with Rayon. That means the approach is viable!
Code: https://github.com/image-rs/jpeg-decoder/compare/master...Shnatsel:rayonless?expand=1
This is exciting to see! It looks like it will be possible to avoid the rayon dependency without too much added complexity and without incurring a performance regression.
I've tried adding some heuristics for batching and choosing the number of CPUs, but so far all it did was regress my benchmarks across the board
Rayon introduces a lot of unsafe code to the dependency graph - at about 2000 unsafe expressions according to cargo-geiger, accounting for over 80% of all unsafe code. This undermines the value proposition of the safe Rust decoder: the entire point of it is implementing a decoder in safe code; if you're OK with depending on unsafe code that's probably fine, you should just use libjpeg-turbo.
It would be nice to experiment and see if we can match the performance of Rayon without pulling in all of its dependency tree.
Right now jpeg-decoder uses Rayon for just one line of code: https://github.com/image-rs/jpeg-decoder/blob/97742b3fd692e04eaae4f368110275c7151eacf8/src/decoder.rs#L853
It should be fairly trivial to replicate what Rayon does here using manually spawned threads and an MPMC channel to distribute work items to the threads. flume is a 100% safe code implementation of an MPMC channel that's on par with crossbeam in the use cases that are relevant here. I'm not sure if that's the right approach, but it's probably worth a shot.