image-rs / image-png

PNG decoding and encoding library in pure Rust
https://docs.rs/png
Apache License 2.0
357 stars 140 forks source link

Avoid 32kB decompression lag + compact less often. #447

Closed anforowicz closed 9 months ago

anforowicz commented 9 months ago

Avoiding 32kB decompression lag

Before this commit, decompressed data would be accumulated in ZlibStream::out_buffer and returned via image_data with 32kB lag corresponding to CHUNCK_BUFFER_SIZE:

```
fn transfer_finished_data(&mut self, image_data: &mut Vec<u8>) -> usize {
    let safe = self.out_pos.saturating_sub(CHUNCK_BUFFER_SIZE);
    image_data.extend(self.out_buffer.drain(..safe));
    ...
```

32kB is a typical size of L1 cache, so the lag would mean that the data passed to image_data.extend(...) would be already cold and evicted from the L1 cache.

This commit avoids the lag by always returning into image_data all the data from out_buffer (i.e. data up to out_pos):

```
fn transfer_finished_data(&mut self, image_data: &mut Vec<u8>) -> usize {
    let transferred = &self.out_buffer[self.read_pos..self.out_pos];
    image_data.extend_from_slice(transferred);
    self.read_pos = self.out_pos;
    ...
```

Compacting less often

The changes above mean that Vec::drain no longer compacts out_buffer. Therefore this commit also refactors how this compaction works.

Before this commit, not-yet-returned data would be shifted to the beginning of out_buffer every time transfer_finished_data is called. This could potentially mean that for 1 returned byte, N bytes have to be copied during compaction.

After this commit, compaction is only done when the compaction cost if offset by many read bytes - for 3 returned bytes 1 byte has to be copied during compaction.

Performance impact

The commit has a positive impact on performance, except for:

The results below have been gathered by running the decoder benchmark. First a baseline was saved before this commit, and then a comparison was done after the commit. This (the baseline + the comparison) was repeated a total of 3 times. All results below are for the relative impact on the runtime. All results are with p = 0.00 < 0.05.

anforowicz commented 9 months ago

I think this may be worth landing - the performance improvements can be seen everywhere except the Transparency testcase.

I don't really understand what happens in the Transparency testcase. I can recover some of the regression with an additional commit in https://github.com/image-rs/image-png/compare/master...anforowicz:image-png:cap-incremental-decompression-size. Maybe we should also land this other commit (as a separate PR?). OTOH after this other commit the performance gains elsewhere are still there, but slightly smaller. I don't know how to decide between 1) landing just this commit/PR vs 2) landing this commit + the other commit. I think I am leaning toward just landing this commit/PR for now.

FWIW, I've also tested this commit on some small images from the top-500 benchmark described in https://github.com/image-rs/image-png/discussions/416#discussioncomment-7436871. I've measured 10% - 15% performance improvements for those small images (Transparency.jpg is also small, but apparently being small is not the deciding factor).

fintelia commented 9 months ago

The Transparency regression is a bit weird, but I'm OK with merging this given all the other improvements.

While looking at this PR, I noticed we're not properly handling images with extra data. Specifically, it is possible that there are additional bytes in the IDATs beyond the deflate checksum. If that happens fdeflate will set is_done and stop consuming additional input, and the ZlibStream then no-longer needs to maintain the lookback buffer. In libpng at least, this case is considered a "benign error" and by default doesn't prevent the image from being decoded.

anforowicz commented 9 months ago

Thanks for looking!

The Transparency regression is a bit weird, but I'm OK with merging this given all the other improvements.

Ack.

I won't have time for it this week, but I plan to eventually try to understand https://github.com/image-rs/image-png/compare/master...anforowicz:image-png:cap-incremental-decompression-size better:

While looking at this PR, I noticed we're not properly handling images with extra data. Specifically, it is possible that there are additional bytes in the IDATs beyond the deflate checksum. If that happens fdeflate will set is_done and stop consuming additional input, and the ZlibStream then no-longer needs to maintain the lookback buffer. In libpng at least, this case is considered a "benign error" and by default doesn't prevent the image from being decoded.

Ack.

Question: is this something independent from the PR-under-review? Or is this a problem that you think the PR-under-review may be introducing or making worse?

I can help with adding a test for this and checking what happens. I think the new test can mostly mimic test_idat_bigger_than_image_size_from_ihdr (but instead of writing an IDAT chunk with a bigger image, it can just write the right-size-image and append some extraneous bytes).


BTW, I forgot to say that I hope that the PR-under-review doesn't get in the way for your idea from https://github.com/image-rs/image-png/pull/429#issuecomment-1875718076 of "changing ZlibStream to directly write into a caller provided buffer instead of using its own out_buffer". Avoiding the 32kB lag seems orthogonal/independent from this idea. Changing how out_buffer compaction works is somewhat related, but hopefully still doesn't get in the way too much.

FWIW, I mostly got motivated to look here because of how the Read/BufRead PR happens to interfere with L1 cache, hardware prefetches, etc. (https://github.com/image-rs/image-png/pull/427)

anforowicz commented 8 months ago

Just a quick follow-up to https://github.com/image-rs/image-png/pull/447#discussion_r1443537365 - I re-measured the performance and the results were mostly the same (i.e. as before addressing the comment / as with truncating and re-zero-ing out out_buffer). For the record, the results against the landed commit look as follows:

decode/kodim23.png: [-3.1886% -3.0384% -2.8277%] [-3.6101% -3.5022% -3.3806%] [-3.9165% -3.7417% -3.6003%]

decode/kodim07.png: [-1.1782% -1.0010% -0.8080%] [-1.8537% -1.7999% -1.7421%] [-1.5936% -1.3810% -1.2046%]

decode/kodim02.png: [-1.9088% -1.7352% -1.5557%] [-2.0096% -1.9522% -1.8895%] [-2.2580% -2.0380% -1.8558%]

decode/kodim17.png: [-0.9286% -0.7875% -0.5969%] [-0.6255% -0.5532% -0.4817%] [-1.5831% -1.3538% -1.1580%]

decode/Lohengrin_-_Illustrated_Sporting_and_Dramatic_News.png: [-1.9583% -1.6850% -1.3973%] [-1.5138% -1.3103% -1.1106%] [-1.4271% -1.1675% -0.8888%]

decode/Transparency.png: [+14.271% +14.950% +15.689%] [+14.740% +15.057% +15.443%] [+13.514% +14.260% +15.206%]

generated-noncompressed-4k-idat/8x8.png: [-1.1170% -0.8610% -0.6001%] [-3.3542% -3.2026% -3.0460%] [-6.8021% -6.6249% -6.4542%]

generated-noncompressed-4k-idat/128x128.png: [-62.153% -62.038% -61.913%] [-60.042% -59.859% -59.711%] [-61.204% -61.130% -61.037%]

generated-noncompressed-4k-idat/2048x2048.png: [-67.161% -67.121% -67.082%] [-67.566% -67.470% -67.370%] [-67.706% -67.663% -67.614%]

generated-noncompressed-4k-idat/12288x12288.png: [-63.990% -63.956% -63.921%] [-63.580% -63.504% -63.427%] [-63.302% -63.266% -63.234%]

generated-noncompressed-64k-idat/128x128.png: [+28.550% +28.972% +29.330%] [+45.749% +46.366% +46.873%] [-13.531% -13.292% -13.085%]

generated-noncompressed-64k-idat/2048x2048.png: [-13.441% -12.212% -11.403%] [-22.578% -22.234% -21.879%] [-20.424% -20.222% -19.999%

generated-noncompressed-64k-idat/12288x12288.png: [-45.895% -45.826% -45.749%] [-41.836% -41.724% -41.617%] [-41.325% -41.252% -41.165%]

generated-noncompressed-2g-idat/2048x2048.png: [-40.284% -39.676% -39.277%] [-38.249% -37.947% -37.701%] [-42.957% -42.806% -42.651%]

generated-noncompressed-2g-idat/12288x12288.png: [-38.550% -38.469% -38.388%] [-37.827% -37.691% -37.509%] [-37.335% -37.269% -37.207%]