image-rs / image-webp

Apache License 2.0
43 stars 9 forks source link

Parallelization opportunities #120

Open Shnatsel opened 1 month ago

Shnatsel commented 1 month ago

Part of the gap observed in #119 is due to webp-animation using multi-threading. libwebp also has a multi-threading option for still images. But right now image-webp is single-threaded.

I think there are some opportunities for easy wins from parallelization.

Decoding frames in parallel

Each frame is contained in its own ANMF chunk which specifies the length, so we should be able to read the raw bytes and spawn the decoding on a thread pool quite easily.

Complication: this may dramatically increase memory usage on machines with lots of cores, or if the decoding of the frames performs faster than compositing them and uncomposited frames pile up. We will need to introduce some sort of backpressure mechanism so that it doesn't happen. For example, decoding is only allowed to have at most X frames or at most Y megapixels in flight. We know all the frame dimensions so it's not too hard to enforce, but still adds some complexity to the code as opposed to just slapping .par_iter() on it.

Compositing frames in parallel

In the general case compositing is inherently sequential: it requires the previous frame to be fully composited before the next one can start. However, each row in the image can be processed independently, so we should be able to just use .par_chunks_mut() on the buffer and composite each row separately.

The profile in #119 shows that we're spending a lot of time just doing alpha blending, so this could translate to significant performance gains.

Still images?

libwebp also has optional multi-threading support for still images. I've benchmarked dwebp compared to dwebp -mt on a lossy image and only saw a 20% improvement in performance. The single-threaded bottleneck in #71 also makes it seem like we won't get much out of single-image parallelism. I don't think it's worth the trouble for static images.

Shnatsel commented 3 weeks ago

I think I've overcomplicated parallelizing animations. Having just two threads - one for decoding, one for compositing - is going to be almost as good as it's going to get.

Since compositing is inherently sequential, decoding many frames at the same time wouldn't get us improved performance. They will be consumed one by one anyway, so if we cap the amount of frames held in memory at the same time the pipeline is going to fill up to that amount and then process one frame at a time anyway.

And compositing in parallel isn't super useful because it only needs to go as fast as decoding, and beyond that we won't get any gains because the decoding thread becomes the bottleneck.

fintelia commented 3 weeks ago

if we cap the amount of frames held in memory at the same time the pipeline is going to fill up to that amount and then process one frame at a time anyway.

This isn't true. If decoding is 10x slower than compositing then once the pipeline fills up, we'll start by pulling one frame to composite and starting one frame to decode. But nine additional frames will be pulled from the pipeline for compositing before the first additional frame decode finishes.