Open HeroicKatora opened 2 years ago
According to the profile, this did actually parallelize things, but now it requires Huffman decoding to complete before Rayon kicks in.
The performance results are a toss-up: on some images this approach is 20% faster; on others it is 25% slower. See corrected data below.
According to the profile, Huffman decoding completes before we start performing IDCT - in contrast to the rayonless version. Is there perhaps a way to combine the pipelining and Rayon's low-overhead scheduling? Say, spawn worker threads up front and have them pick tasks off the task queue immediately?
Say, spawn worker threads up front and have them pick tasks off the task queue immediately?
We can't, or at the very least I don't see how. rayon
doesn't offer something a JoinHandle
. We can either wait on the task immediately in the method that schedules it (rayon::scope
etc.) or we schedule it at a later date. It sucks.
We can try adjusting the constants here, or adding specifically some serialization points though.
Wait, I have actually measured it incorrectly. I need to turn off all the OTHER parallelization that Rayon does to make an apples-to-apples comparison. It might very well be that this approach with IDCT on Rayon is always slower than the DIY parallelization.
Actually, I have a devious and plainly stupid idea as well regarding join handles. I had assumed that scopes can only end in code by stepping out (side note: that was a longheld error in my thought pattern, in an totally unrelated topic as well) but with async fn
that is no longer true. If we could convince rayon
that async_scope
is useful then we can indeed have join handles for (basically) free—hold on to the future that will finalize and terminate the scope when dropped.
Apart from being a very rough idea that I haven't present well.. that's …. interesting.
Ah, turning off parallel color conversion also gated on rayon
feature makes this PR 25% slower even on its previously "good" cases, and 30% slower on the large CMYK test image.
I'm afraid this approach strictly inferior to manual parallelization.
If parallelization overhead is a problem for small images, could we perhaps simply skip it heuristically based on the expected image size?
Intends to address the issue of effectively serialized decode (#245), where all tasks end up being executed on the main thread instead of being distributed into other workers.
We had neglected that most work is scheduled in sync (apppend_row such as in decoder.rs:903 instead of apppend_rows). This meant most were executed with an immediate strategy.
The change pushes all items into a bounded task queue that is emptied and actively worked on when it reaches a capacity maximum, as well as when any component result is requested. This is in contrast to std::multithreading where items are worked on while decoding is in progress but task queueing itself has more overhead.
cc: @Shnatsel