image-rs / jpeg-decoder

JPEG decoder written in Rust
Apache License 2.0
150 stars 87 forks source link

Threaded rayon #246

Open HeroicKatora opened 2 years ago

HeroicKatora commented 2 years ago

Intends to address the issue of effectively serialized decode (#245), where all tasks end up being executed on the main thread instead of being distributed into other workers.

We had neglected that most work is scheduled in sync (apppend_row such as in decoder.rs:903 instead of apppend_rows). This meant most were executed with an immediate strategy.

The change pushes all items into a bounded task queue that is emptied and actively worked on when it reaches a capacity maximum, as well as when any component result is requested. This is in contrast to std::multithreading where items are worked on while decoding is in progress but task queueing itself has more overhead.

cc: @Shnatsel

Shnatsel commented 2 years ago

According to the profile, this did actually parallelize things, but now it requires Huffman decoding to complete before Rayon kicks in.

The performance results are a toss-up: on some images this approach is 20% faster; on others it is 25% slower. See corrected data below.

According to the profile, Huffman decoding completes before we start performing IDCT - in contrast to the rayonless version. Is there perhaps a way to combine the pipelining and Rayon's low-overhead scheduling? Say, spawn worker threads up front and have them pick tasks off the task queue immediately?

HeroicKatora commented 2 years ago

Say, spawn worker threads up front and have them pick tasks off the task queue immediately?

We can't, or at the very least I don't see how. rayon doesn't offer something a JoinHandle. We can either wait on the task immediately in the method that schedules it (rayon::scope etc.) or we schedule it at a later date. It sucks.

We can try adjusting the constants here, or adding specifically some serialization points though.

Shnatsel commented 2 years ago

Wait, I have actually measured it incorrectly. I need to turn off all the OTHER parallelization that Rayon does to make an apples-to-apples comparison. It might very well be that this approach with IDCT on Rayon is always slower than the DIY parallelization.

HeroicKatora commented 2 years ago

Actually, I have a devious and plainly stupid idea as well regarding join handles. I had assumed that scopes can only end in code by stepping out (side note: that was a longheld error in my thought pattern, in an totally unrelated topic as well) but with async fn that is no longer true. If we could convince rayon that async_scope is useful then we can indeed have join handles for (basically) free—hold on to the future that will finalize and terminate the scope when dropped.

Apart from being a very rough idea that I haven't present well.. that's …. interesting.

Shnatsel commented 2 years ago

Ah, turning off parallel color conversion also gated on rayon feature makes this PR 25% slower even on its previously "good" cases, and 30% slower on the large CMYK test image.

I'm afraid this approach strictly inferior to manual parallelization.

If parallelization overhead is a problem for small images, could we perhaps simply skip it heuristically based on the expected image size?