veluca93 / fpnge

Demo of a fast PNG encoder.
Apache License 2.0
81 stars 8 forks source link

Ideas for improving compression ratio? #7

Open animetosho opened 2 years ago

animetosho commented 2 years ago

fpnge is pretty fast, so I was wondering what, in theory, could be done to improve compression slightly without sacrificing too much speed. I was hoping you might have some ideas on the topic (regardless of if they'd ever be implemented).
Thoughts I had:

RLE

Currently only full vectors of zeroes are candidates for RLE.

Unfortunately due to how the Huffman table is currently constructed, the LZ77 symbols always get assigned rather long sequences, making RLE unattractive (particularly with 0s) unless it's a rather long sequence. Which leads to the next point.

Huffman encoding

The current implementation doesn't consider symbol counts for many symbols, as the Huffman table has to be specially crafted for the algorithm used. Trying to relax some length restrictions often leads to ComputeCodeLengths becoming rather slow - from what I can tell, it appears to be finding the optimal tree via a brute-force approach. Would you happen to know of a faster algorithm for this?

The other idea is to remove almost all length restrictions, which allows normal optimal Huffman tree algorithms to be used. I can't think of a quick way to allow the middle section (symbols 16-239) to have differing bit lengths, so you'd need to manipulate with symbol counts there a bit to ensure they all end up getting the same bit length. Otherwise, the unrestricted Huffman code could be handled with some slowdown, as, in addition to shuffle-lookup for the low bits, you'd need to do it for the high bits as well.

Another thing to consider might be having multiple deflate blocks and sampling the image at various points, to be able to adapt to changes which occur vertically down the image.

I suspect a more optimal Huffman code might generally give a low single-digit percentage gain most of the time, but at some speed cost.

LZ77

It seems that there's a fair bit of scope here to improve compression through better match finding, but this is also often the slowest part of deflate implementations. I can't think of a way to implement this without dropping to scalar code either.

Furthermore, I'm not sure that knowing the data is an image, provides much benefit over generic LZ77 implementations - except that perhaps you can only look for matches on pixel boundaries instead of for every byte.

Any ideas here?

veluca93 commented 2 years ago

fpnge is pretty fast, so I was wondering what, in theory, could be done to improve compression slightly without sacrificing too much speed. I was hoping you might have some ideas on the topic (regardless of if they'd ever be implemented). Thoughts I had:

RLE

Currently only full vectors of zeroes are candidates for RLE.

  • Detecting sequences starting/ending partway through a vector might help density very slightly, with relatively little performance cost

That's true, but OTOH I'm not sure it would have significant compression benefits either.

  • Detecting sequences completely within a vector may be possible without too much cost, but I suspect would have little impact on compression, due to it being relatively short sequences

I don't think this would help much, if at all.

  • Detecting sequences of bytes other than 0 - not sure of the effectiveness

    • Perhaps use the distance of one pixel to the left instead of one byte?

I believe I had tried this without much success.

Unfortunately due to how the Huffman table is currently constructed, the LZ77 symbols always get assigned rather long sequences, making RLE unattractive (particularly with 0s) unless it's a rather long sequence. Which leads to the next point.

Huffman encoding

The current implementation doesn't consider symbol counts for many symbols, as the Huffman table has to be specially crafted for the algorithm used. Trying to relax some length restrictions often leads to ComputeCodeLengths becoming rather slow - from what I can tell, it appears to be finding the optimal tree via a brute-force approach. Would you happen to know of a faster algorithm for this?

IIRC it's a quadratic-time dynamic programming algorithm, and I think the best you can do when you have both lower and upper bounds on symbol lengths. At least, it's the best I am aware of ;)

The other idea is to remove almost all length restrictions, which allows normal optimal Huffman tree algorithms to be used. I can't think of a quick way to allow the middle section (symbols 16-239) to have differing bit lengths, so you'd need to manipulate with symbol counts there a bit to ensure they all end up getting the same bit length.

I had initially tried exactly this, but I never found a way to make it work, so I went for the dynamic programming approach above :)

Otherwise, the unrestricted Huffman code could be handled with some slowdown, as, in addition to shuffle-lookup for the low bits, you'd need to do it for the high bits as well.

I'm not sure how you would SIMDfy that - the current shuffle lookup logic relies on there being at most 16 values to lookup for any given vector, which is not the case here.

Another thing to consider might be having multiple deflate blocks and sampling the image at various points, to be able to adapt to changes which occur vertically down the image.

Yes, that would indeed be an option - I don't know how useful it would be though.

I suspect a more optimal Huffman code might generally give a low single-digit percentage gain most of the time, but at some speed cost.

LZ77

It seems that there's a fair bit of scope here to improve compression through better match finding, but this is also often the slowest part of deflate implementations. I can't think of a way to implement this without dropping to scalar code either.

Furthermore, I'm not sure that knowing the data is an image, provides much benefit over generic LZ77 implementations - except that perhaps you can only look for matches on pixel boundaries instead of for every byte.

Any ideas here?

The simplest idea that comes to mind is to try to do LZ77 copies from the same pixel in the row above, which might work reasonably well in practice (probably mostly useful in combination with predictor 0 and on nonphoto content). I can't think of other ways to do LZ77 without significant encoding speed drops.

animetosho commented 2 years ago

Thanks for the response! :)

I think the best you can do when you have both lower and upper bounds on symbol lengths.

Would it be easier if you removed the lower bound? (it's always 1, except for the length symbols to avoid clobbering the 16-239 symbols, but there's an easy workaround for that)

I'm not sure how you would SIMDfy that - the current shuffle lookup logic relies on there being at most 16 values to lookup for any given vector, which is not the case here.

You'd still have 3x16 values, with symbols split in the same way, just that the low/high symbols can also exceed 8 bits, and the mid can exceed 8+4 bits.

The AVX-512 two vector VPERMB instruction could theoretically make a full 256-entry table lookup not be so bad, but I'm avoiding AVX-512 at this point.

The simplest idea that comes to mind is to try to do LZ77 copies from the same pixel in the row above, which might work reasonably well in practice

Yeah, I had also thought about limited searches around neighboring pixels, but thought it could be less useful after filtering. But since filtering needs to be per line, perhaps it could help with intra-line variations.
Also PNG's filters don't do much with diagonals - only Paeth even considers the top-left pixel, but nothing considers the top-right direction. Unfortunately LZ77 needs exact matches, limiting applicability of the idea.

veluca93 commented 2 years ago

Thanks for the response! :)

I think the best you can do when you have both lower and upper bounds on symbol lengths.

Would it be easier if you removed the lower bound? (it's always 1, except for the length symbols to avoid clobbering the 16-239 symbols, but there's an easy workaround for that)

I am not sure, the standard coin-collector algorithm only works if the length limit is global and not per-symbol, but I don't remember if it can be adapted.

That said, IIRC there should be a lower limit of 4 on the middle values, and they should use proper symbol counts ... I don't remember if there was a specific reason not to do that.

I'm not sure how you would SIMDfy that - the current shuffle lookup logic relies on there being at most 16 values to lookup for any given vector, which is not the case here.

You'd still have 3x16 values, with symbols split in the same way, just that the low/high symbols can also exceed 8 bits, and the mid can exceed 8+4 bits.

Then I'm not sure I understand what you're proposing exactly. IIRC the important restriction here is that every chunk of 16 consecutive symbols in the middle are has the same leading bits, and 0000 to 1111 for the lowest bits. This allows having just 3x16 tables to look up into.

I don't think there's any hard requirement for the middle part to all have the same length, except for potential shenanigans with canonical codes (but I think it might be enough to guarantee that all the symbols in the middle part have different lengths from every other symbol)

The AVX-512 two vector VPERMB instruction could theoretically make a full 256-entry table lookup not be so bad, but I'm avoiding AVX-512 at this point.

The simplest idea that comes to mind is to try to do LZ77 copies from the same pixel in the row above, which might work reasonably well in practice

Yeah, I had also thought about limited searches around neighboring pixels, but thought it could be less useful after filtering. But since filtering needs to be per line, perhaps it could help with intra-line variations. Also PNG's filters don't do much with diagonals - only Paeth even considers the top-left pixel, but nothing considers the top-right direction. Unfortunately LZ77 needs exact matches, limiting applicability of the idea.

Yep, probably filtering counteracts things somewhat. However, it might in some cases be useful to do predictor 0 + something like trying to LZ77 from pixel offsets, say, (-1, -1), (0, -1), (1, -1), (0, -2), (-1, 0) or something like that.

animetosho commented 2 years ago

That said, IIRC there should be a lower limit of 4 on the middle values

Any length chosen would automatically get +4 due to appending the lower nibble of the symbol, so I'm not sure why there should specifically be a lower limit for the upper nibble?

IIRC the important restriction here is that every chunk of 16 consecutive symbols in the middle are has the same leading bits, and 0000 to 1111 for the lowest bits

If they all have the same length, the canonical code assignment guarantees that all codes are sequential, which means they could actually be computed from the base value (for that length) without strictly needing to meet those requirements.
For example, if the code for symbol 16 is X, then the code for symbol 30 would be X+14 (with all of these bit-reversed).

it might in some cases be useful to do predictor 0 + something like trying to LZ77 from pixel offsets

Wouldn't the SUB predictor almost always be superior to NOOP? If the pixels are identical such that LZ77 works, applying the SUB filter doesn't change that, and it should be more beneficial elsewhere.


I used spng to experiment with what zlib can do with various settings, so to give a rough idea of what may be possible.
Testing using this image, I get the following:

                          (bytes)
fpnge                     5012152
fpnge-2                   5410265   +7.9%
fpnge-4                   5191472   +3.6%

fpng                      4825539   -3.7%
fpng -s                   4555412   -9.1%

zlib level 1 (32K dict):  3090939  -38.3%
zlib level 3 (32K dict):  2962963  -40.9%
zlib level 6 (32K dict):  2976506  -40.6%
zlib level 9 (32K dict):  2895722  -42.2%
zlib level 1 (0.5K dict): 3599504  -28.2%
zlib level 3 (0.5K dict): 3565979  -28.9%
zlib level 6 (0.5K dict): 3553788  -29.1%
zlib level 9 (0.5K dict): 3553613  -29.1%
zlib RLE:                 4279816  -14.6%
zlib Huffman only:        4479814  -10.6%
zlib Filtered level 1:    3090939  -38.3%
zlib Filtered level 9:    3223912  -35.7%

Considering the gains just from zlib's Huffman/RLE makes me wonder that, with the right image, there can be noticeable improvements without adopting more-than-RLE LZ77.

veluca93 commented 2 years ago

That said, IIRC there should be a lower limit of 4 on the middle values

Any length chosen would automatically get +4 due to appending the lower nibble of the symbol, so I'm not sure why there should specifically be a lower limit for the upper nibble?

IIRC the important restriction here is that every chunk of 16 consecutive symbols in the middle are has the same leading bits, and 0000 to 1111 for the lowest bits

If they all have the same length, the canonical code assignment guarantees that all codes are sequential, which means they could actually be computed from the base value (for that length) without strictly needing to meet those requirements. For example, if the code for symbol 16 is X, then the code for symbol 30 would be X+14 (with all of these bit-reversed).

I think I might have confused myself :) do you propose to still force same bit lengths for "middle" symbols that share the upper nibble, but relax the maximum length restriction for them? I'm not entirely sure that it would give significant benefits compared to leaving the max-8 bit length.

I think just relaxing the restriction that is present today of all middle symbols being equally long should be possible basically just by adding a variable length shift and possibly a table, or not?

it might in some cases be useful to do predictor 0 + something like trying to LZ77 from pixel offsets

Wouldn't the SUB predictor almost always be superior to NOOP? If the pixels are identical such that LZ77 works, applying the SUB filter doesn't change that, and it should be more beneficial elsewhere.

I think that wouldn't be the case when, for example, the row below is offset by 1 pixel: subtracting TOP basically makes the row equivalent to the residuals of LEFT on the previous row, no?

I used spng to experiment with what zlib can do with various settings, so to give a rough idea of what may be possible. Testing using this image, I get the following:

                          (bytes)
fpnge                     5012152
fpnge-2                   5410265   +7.9%
fpnge-4                   5191472   +3.6%

fpng                      4825539   -3.7%
fpng -s                   4555412   -9.1%

zlib level 1 (32K dict):  3090939  -38.3%
zlib level 3 (32K dict):  2962963  -40.9%
zlib level 6 (32K dict):  2976506  -40.6%
zlib level 9 (32K dict):  2895722  -42.2%
zlib level 1 (0.5K dict): 3599504  -28.2%
zlib level 3 (0.5K dict): 3565979  -28.9%
zlib level 6 (0.5K dict): 3553788  -29.1%
zlib level 9 (0.5K dict): 3553613  -29.1%
zlib RLE:                 4279816  -14.6%
zlib Huffman only:        4479814  -10.6%
zlib Filtered level 1:    3090939  -38.3%
zlib Filtered level 9:    3223912  -35.7%

Considering the gains just from zlib's Huffman/RLE makes me wonder that, with the right image, there can be noticeable improvements without adopting more-than-RLE LZ77.

The gains from Huffman/RLE are quite large - I wonder where exactly they come from, is there a reasonable way to inspect the zlib bitstream to look at Huffman tables?

animetosho commented 2 years ago

do you propose to still force same bit lengths for "middle" symbols that share the upper nibble, but relax the maximum length restriction for them?

Yes - all the middle symbols would still have the same length.

I'm not entirely sure that it would give significant benefits compared to leaving the max-8 bit length.

I don't think so either. The main point was to allow the usage of the usual Huffman table construction, instead of the current quadratic time algorithm.

I think just relaxing the restriction that is present today of all middle symbols being equally long should be possible basically just by adding a variable length shift and possibly a table, or not?

I can't think of a relatively efficient way to do it. The problem is that the canonical codes for 16, 32, 48 etc could be very different, if they're not the same length, so determining the code for, e.g. symbol 34, is a little tricky.

I think that wouldn't be the case when, for example, the row below is offset by 1 pixel: subtracting TOP basically makes the row equivalent to the residuals of LEFT on the previous row, no?

If we have something like:

abcdef
bcdef

Then TOP probably wouldn't be that useful, but SUB (or LEFT) wouldn't break the duplication (other than for the first pixel perhaps), so LZ77 would still work.

is there a reasonable way to inspect the zlib bitstream to look at Huffman tables?

I imagine it might be easier to step through a debugger than decode the bitstream. I might try that a bit later.
From memory, zlib creates new deflate blocks every 8 or 16KB of output or thereabouts, so it probably wouldn't be just one Huffman table.

veluca93 commented 2 years ago

On Wed, 29 Jun 2022 at 14:42, Anime Tosho @.***> wrote:

do you propose to still force same bit lengths for "middle" symbols that share the upper nibble, but relax the maximum length restriction for them?

Yes - all the middle symbols would still have the same length.

I'm not entirely sure that it would give significant benefits compared to leaving the max-8 bit length.

I don't think so either. The main point was to allow the usage of the usual Huffman table construction, instead of the current quadratic time algorithm.

Does that take a significant amount of time? Last I tried, the number of symbols was not big enough for this to be a significant fraction of runtime.

I think just relaxing the restriction that is present today of all middle symbols being equally long should be possible basically just by adding a variable length shift and possibly a table, or not?

I can't think of a relatively efficient way to do it. The problem is that the canonical codes for 16, 32, 48 etc could be very different, if they're not the same length, so determining the code for, e.g. symbol 34, is a little tricky.

I think that wouldn't be the case when, for example, the row below is offset by 1 pixel: subtracting TOP basically makes the row equivalent to the residuals of LEFT on the previous row, no?

If we have something like:

abcdef bcdef

Then TOP probably wouldn't be that useful, but SUB (or LEFT) wouldn't break the duplication (other than for the first pixel perhaps), so LZ77 would still work.

It might be worth to try both I guess - the question is how much it is possible to do without significantly harming speed, and whether it would be a good idea to allow to set an "effort" flag...

is there a reasonable way to inspect the zlib bitstream to look at Huffman tables?

I imagine it might be easier to step through a debugger than decode the bitstream. I might try that a bit later.

— Reply to this email directly, view it on GitHub https://github.com/veluca93/fpnge/issues/7#issuecomment-1169932557, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAOPAI6TKKSPMWTZF2UIPGLVRRACTANCNFSM52B3MTFA . You are receiving this because you commented.Message ID: @.***>

animetosho commented 2 years ago

Does that take a significant amount of time? Last I tried, the number of symbols was not big enough for this to be a significant fraction of runtime.

Depends on the image I suppose. With the current parameters, not so much, but if you increase the number of symbols covered, and expand the maximum length from 8 to 15, it ends up taking most of the time.

I made this experiment which removes all restrictions on the Huffman tree except that symbols 16-239 must have the same length. The tree build is very unoptimised, so it ends up being quite slow, but I'm hoping the speed hit during the image encode isn't too bad.
Seems to have minimal effect on the above image, but I get ~6% smaller files on some others. I suspect the above image needs shorter encodings for some symbols in the 16-239 range - perhaps special casing more symbols (other than just 0-15 and 240-255) could help, or maybe try to allow for different lengths for each group of 16 between 16-239 (though it feels like much of a speed hit).

whether it would be a good idea to allow to set an "effort" flag...

A 'compression level' flag sounds like it could indeed be useful.

veluca93 commented 2 years ago

Does that take a significant amount of time? Last I tried, the number of symbols was not big enough for this to be a significant fraction of runtime.

Depends on the image I suppose. With the current parameters, not so much, but if you increase the number of symbols covered, and expand the maximum length from 8 to 15, it ends up taking most of the time.

What would be the motivation for doing either of those?

whether it would be a good idea to allow to set an "effort" flag...

A 'compression level' flag sounds like it could indeed be useful.

animetosho commented 2 years ago

I am not sure, the standard coin-collector algorithm only works if the length limit is global and not per-symbol, but I don't remember if it can be adapted.

Looking at the package-merge algorithm, it seems like symbol specific lengths can easily be implemented by introducing the length-limited symbols during later iterations of the algorithm.
I don't know if this guarantees an optimal tree, but it seems to intuitively be so.

On the other hand, keeping the mid lengths to be the same is more challenging. One thing I noted is that the 14x16 symbols can also be grouped as 7x32. If the symbols are to have the same length, then there needs to be something to pair the 7 virtual symbols (giving 8 leaf nodes, which can be arranged into a 3 level balanced binary tree).
My current strategy is that, if giving the mid symbols the same probability doesn't yield the same length, try the following alternative strategy:

Use the average length for the 7 mid symbols, yielded from the above package-merge approach - this will be the 'ideal' length for the mid symbols. Then construct a regular unbounded Huffman tree, go through all nodes that can be put at the same level as the mid symbols, and find the one with closest probability to total mid count divided by 7.
The found node is grouped with the 7 mid symbols, to form a single virtual symbol, which then goes through the package-merge algorithm, to build the rest of the tree.
This doesn't feel like it's optimal, but should be close.

Speed hit isn't too bad overall, and can lead to some minor improvement:

Image 1 - fpnge (original)
   268.592 MP/s
    10.787 bits/pixel
Image 1 - fpnge (optimized)
   384.446 MP/s
    10.787 bits/pixel
Image 1 - new Huffman strat
   352.421 MP/s
    10.128 bits/pixel

Image 2 - fpnge (original)
   355.623 MP/s
    16.240 bits/pixel
Image 2 - fpnge (optimized)
   414.372 MP/s
    16.240 bits/pixel
Image 2 - new Huffman strat
   369.830 MP/s
    16.278 bits/pixel

Image 3 - fpnge (original)
   383.806 MP/s
    19.365 bits/pixel
Image 3 - fpnge (optimized)
   446.519 MP/s
    19.365 bits/pixel
Image 3 - new Huffman strat
   370.413 MP/s
    19.028 bits/pixel

Image 4 - fpnge (original)
   413.264 MP/s
     3.599 bits/pixel
Image 4 - fpnge (optimized)
   526.447 MP/s
     3.590 bits/pixel
Image 4 - new Huffman strat
   473.573 MP/s
     3.659 bits/pixel

The regressions are interesting though - this seems to be from the chosen sampling, because if I change it to always sample the whole image, it always seems to improve:

Image 1 - fpnge (optimized)
   170.762 MP/s
    10.780 bits/pixel
Image 1 - new Huffman strat
   160.922 MP/s
    10.128 bits/pixel

Image 2 - fpnge (optimized)
   257.951 MP/s
    16.237 bits/pixel
Image 2 - new Huffman strat
   236.648 MP/s
    16.100 bits/pixel

Image 3 - fpnge (optimized)
   264.737 MP/s
    19.344 bits/pixel
Image 3 - new Huffman strat
   240.575 MP/s
    18.557 bits/pixel

Image 4 - fpnge (optimized)
   296.054 MP/s
     3.571 bits/pixel
Image 4 - new Huffman strat
   279.037 MP/s
     3.539 bits/pixel

Perhaps splitting the image into multiple deflate blocks can get it closer to zlib's Huffman-only.