Closed anforowicz closed 9 months ago
I think this may be worth landing - the performance improvements can be seen everywhere except the Transparency testcase.
I don't really understand what happens in the Transparency testcase. I can recover some of the regression with an additional commit in https://github.com/image-rs/image-png/compare/master...anforowicz:image-png:cap-incremental-decompression-size. Maybe we should also land this other commit (as a separate PR?). OTOH after this other commit the performance gains elsewhere are still there, but slightly smaller. I don't know how to decide between 1) landing just this commit/PR vs 2) landing this commit + the other commit. I think I am leaning toward just landing this commit/PR for now.
FWIW, I've also tested this commit on some small images from the top-500 benchmark described in https://github.com/image-rs/image-png/discussions/416#discussioncomment-7436871. I've measured 10% - 15% performance improvements for those small images (Transparency.jpg is also small, but apparently being small is not the deciding factor).
The Transparency regression is a bit weird, but I'm OK with merging this given all the other improvements.
While looking at this PR, I noticed we're not properly handling images with extra data. Specifically, it is possible that there are additional bytes in the IDATs beyond the deflate checksum. If that happens fdeflate will set is_done
and stop consuming additional input, and the ZlibStream then no-longer needs to maintain the lookback buffer. In libpng at least, this case is considered a "benign error" and by default doesn't prevent the image from being decoded.
Thanks for looking!
The Transparency regression is a bit weird, but I'm OK with merging this given all the other improvements.
Ack.
I won't have time for it this week, but I plan to eventually try to understand https://github.com/image-rs/image-png/compare/master...anforowicz:image-png:cap-incremental-decompression-size better:
Read
/BufRead
regressions and I don't want to advocate blindly for the other commit without understanding why it helps on its own)While looking at this PR, I noticed we're not properly handling images with extra data. Specifically, it is possible that there are additional bytes in the IDATs beyond the deflate checksum. If that happens fdeflate will set
is_done
and stop consuming additional input, and the ZlibStream then no-longer needs to maintain the lookback buffer. In libpng at least, this case is considered a "benign error" and by default doesn't prevent the image from being decoded.
Ack.
Question: is this something independent from the PR-under-review? Or is this a problem that you think the PR-under-review may be introducing or making worse?
I can help with adding a test for this and checking what happens. I think the new test can mostly mimic test_idat_bigger_than_image_size_from_ihdr
(but instead of writing an IDAT
chunk with a bigger image, it can just write the right-size-image and append some extraneous bytes).
BTW, I forgot to say that I hope that the PR-under-review doesn't get in the way for your idea from https://github.com/image-rs/image-png/pull/429#issuecomment-1875718076 of "changing ZlibStream to directly write into a caller provided buffer instead of using its own out_buffer
". Avoiding the 32kB lag seems orthogonal/independent from this idea. Changing how out_buffer
compaction works is somewhat related, but hopefully still doesn't get in the way too much.
FWIW, I mostly got motivated to look here because of how the Read
/BufRead
PR happens to interfere with L1 cache, hardware prefetches, etc. (https://github.com/image-rs/image-png/pull/427)
Just a quick follow-up to https://github.com/image-rs/image-png/pull/447#discussion_r1443537365 - I re-measured the performance and the results were mostly the same (i.e. as before addressing the comment / as with truncating and re-zero-ing out out_buffer
). For the record, the results against the landed commit look as follows:
decode/kodim23.png: [-3.1886% -3.0384% -2.8277%] [-3.6101% -3.5022% -3.3806%] [-3.9165% -3.7417% -3.6003%]
decode/kodim07.png: [-1.1782% -1.0010% -0.8080%] [-1.8537% -1.7999% -1.7421%] [-1.5936% -1.3810% -1.2046%]
decode/kodim02.png: [-1.9088% -1.7352% -1.5557%] [-2.0096% -1.9522% -1.8895%] [-2.2580% -2.0380% -1.8558%]
decode/kodim17.png: [-0.9286% -0.7875% -0.5969%] [-0.6255% -0.5532% -0.4817%] [-1.5831% -1.3538% -1.1580%]
decode/Lohengrin_-_Illustrated_Sporting_and_Dramatic_News.png: [-1.9583% -1.6850% -1.3973%] [-1.5138% -1.3103% -1.1106%] [-1.4271% -1.1675% -0.8888%]
decode/Transparency.png: [+14.271% +14.950% +15.689%] [+14.740% +15.057% +15.443%] [+13.514% +14.260% +15.206%]
generated-noncompressed-4k-idat/8x8.png: [-1.1170% -0.8610% -0.6001%] [-3.3542% -3.2026% -3.0460%] [-6.8021% -6.6249% -6.4542%]
generated-noncompressed-4k-idat/128x128.png: [-62.153% -62.038% -61.913%] [-60.042% -59.859% -59.711%] [-61.204% -61.130% -61.037%]
generated-noncompressed-4k-idat/2048x2048.png: [-67.161% -67.121% -67.082%] [-67.566% -67.470% -67.370%] [-67.706% -67.663% -67.614%]
generated-noncompressed-4k-idat/12288x12288.png: [-63.990% -63.956% -63.921%] [-63.580% -63.504% -63.427%] [-63.302% -63.266% -63.234%]
generated-noncompressed-64k-idat/128x128.png: [+28.550% +28.972% +29.330%] [+45.749% +46.366% +46.873%] [-13.531% -13.292% -13.085%]
generated-noncompressed-64k-idat/2048x2048.png: [-13.441% -12.212% -11.403%] [-22.578% -22.234% -21.879%] [-20.424% -20.222% -19.999%
generated-noncompressed-64k-idat/12288x12288.png: [-45.895% -45.826% -45.749%] [-41.836% -41.724% -41.617%] [-41.325% -41.252% -41.165%]
generated-noncompressed-2g-idat/2048x2048.png: [-40.284% -39.676% -39.277%] [-38.249% -37.947% -37.701%] [-42.957% -42.806% -42.651%]
generated-noncompressed-2g-idat/12288x12288.png: [-38.550% -38.469% -38.388%] [-37.827% -37.691% -37.509%] [-37.335% -37.269% -37.207%]
Avoiding 32kB decompression lag
Before this commit, decompressed data would be accumulated in
ZlibStream::out_buffer
and returned viaimage_data
with 32kB lag corresponding toCHUNCK_BUFFER_SIZE
:32kB is a typical size of L1 cache, so the lag would mean that the data passed to
image_data.extend(...)
would be already cold and evicted from the L1 cache.This commit avoids the lag by always returning into
image_data
all the data fromout_buffer
(i.e. data up toout_pos
):Compacting less often
The changes above mean that
Vec::drain
no longer compactsout_buffer
. Therefore this commit also refactors how this compaction works.Before this commit, not-yet-returned data would be shifted to the beginning of
out_buffer
every timetransfer_finished_data
is called. This could potentially mean that for 1 returned byte, N bytes have to be copied during compaction.After this commit, compaction is only done when the compaction cost if offset by many read bytes - for 3 returned bytes 1 byte has to be copied during compaction.
Performance impact
The commit has a positive impact on performance, except for:
decode/Transparency.png
- regression between 15% and 20% is reported in 3-out-of-3 measurements.decode/kodim17.png
- a regression of 2.1% has been reported in 1-out-of-3 measurements (an improvement of 0.6% - 1.13% has been reported in the other 2-out-of-3 measurements).generated-noncompressed-64k-idat/128x128.png
- a regression of 25% has been reported in 1-out-of-3 measurements (an improvement of 21% - 29% has been reported in the other 2-out-of-3 measurements).The results below have been gathered by running the
decoder
benchmark. First a baseline was saved before this commit, and then a comparison was done after the commit. This (the baseline + the comparison) was repeated a total of 3 times. All results below are for the relative impact on the runtime. All results are with p = 0.00 < 0.05.decode/kodim23.png:
decode/kodim07.png:
decode/kodim02.png:
decode/kodim17.png:
decode/Lohengrin_-_Illustrated_Sporting_and_Dramatic_News.png:
decode/Transparency.png:
generated-noncompressed-4k-idat/8x8.png:
generated-noncompressed-4k-idat/128x128.png:
generated-noncompressed-4k-idat/2048x2048.png:
generated-noncompressed-4k-idat/12288x12288.png:
generated-noncompressed-64k-idat/128x128.png:
generated-noncompressed-64k-idat/2048x2048.png:
generated-noncompressed-64k-idat/12288x12288.png:
generated-noncompressed-2g-idat/2048x2048.png:
generated-noncompressed-2g-idat/12288x12288.png: