Open Shnatsel opened 4 days ago
I think it is worth taking a moment to think about precisely what use case we want to address.
The fdeflate
parallelism was able to increase encoding speed from a couple hundred million pixels per second, to about one billion pixels per second. It gets the same -- roughly on-par with QOI -- compression ratio as the single thread version. Helpful if you have lots of image data to encode, but only if you don't care much about the compression ratio.
The gzp
approach gets a compression almost as good single-threaded, but each core operates on 128 KB chunks of uncompressed image data. On small images there'd be no speedup at all. And even big images are probably going to much slower that single-threaded fdeflate
.
Streaming encode is another question. It is probably doable to implement, but would be more effort. Is it worth trying to design for at the same time or only if there's interest later?
The use case I had in mind was AI image generators. They typically run on beefy machines and need to encode the generated images losslessly, both to store later and to transmit over a network via a web UI. So both encoding speed and compression ratio matter.
Admittedly I'm not building one of those, but I think it is applicable anything that needs a trade-off between encoding speed and compression ratio. That is, everything that doesn't crank compression up to 9.
Regarding streaming: it only introduces issues for filtering, and I'm not convinced that parallelizing filtering is even profitable. We can implement parallel Gzip compression first, and then evaluate whether parallel filtering helps at all. If we find that it does, we can initially implement it for full-image encoding only, and support streaming later if/when someone asks.
That's helpful framing. It probably means images in the single digits megapixel range, so enough to keep a modest number of cores fed with 128 KB chunks of raw data. But not so much that's there any concern about keeping one (or even several) copies of the full image in memory.
In this scenario, there might be some control over which image format to use. If so, it may make more sense to focus on the WebP encoder, which already does much better on the encoding speed vs. compression ratio tradeoff.
The reason to parallelize filtering is probably mostly to use memory bandwidth better. Right now, the code filters a row into a temporary buffer and then immediately compresses it. The full filtered image data never exists fully in memory. In #478, the parallelization applies to both steps with each core allocating one row's worth of temporary buffer to filter into and then compress.
Every part of the decoding process for PNGs is inherently sequential, but there is some performance to be gained by parallelizing the encoding process.
Zlib compression
We have two Zlib compressors that can be used:
fdeflate
andflate2
.https://github.com/image-rs/image-png/pull/478 implemented this for the
fdeflate
mode.https://github.com/sstadick/crabz/ provides a parallelizing wrapper around
flate2
.Filtering
Filtering can be parallelized fairly easily, since it operates on each row individually. However, for
up
andpaeth
filters each operation needs not one but two rows - the previous row and the current one.If we had the entire image in memory at the same time this would be trivial, but implementing this in a streaming fashion may require more somewhat more complex logic.
I don't expect huge gains here because filtering is extremely fast already, but it might become relevant once the rest of the process is parallelized.