golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
122.8k stars 17.51k forks source link

compress/flate: add 3-byte matching for higher compression levels #15622

Open dsnet opened 8 years ago

dsnet commented 8 years ago

Using 9af83462

The recent changes to compress/flate has provided pretty good improvements overall to the compression speeds, but I have noticed a few extraordinary cases where the compression ratio suffers significantly.

This spreadsheet compares the output sizes from go1.6 to current go.tip on a number of compression benchmarks. All tests were done using the flate.DefaultCompression setting.

Most notably, the clegg.tiff image from the ACT corpus experiences a 44% increase in output size (511928 to 741046 bytes).

I think it would be worthwhile to understand what causes this fairly significant regression in compression ratio before the go1.7 release.

cc @nigeltao @klauspost

dsnet commented 8 years ago

I spent some time analyzing clegg.tiff, here's a histogram of the backwards references clustered by backwards distance and backwards copy length.

Lengths:           go1.6   go.tip
    1..1:          0       0
    2..2:          0       0
    3..3:          96798   0
    4..4:          19939   19534
    5..5:          3708    3683
    6..6:          19950   19349
    7..7:          4116    4691
    8..11:         2464    2557
    12..15:        874     873
    16..23:        157     162
    24..31:        17      18
    32..47:        5       6
    48..63:        44      44
    64..95:        46      46
    96..127:       57      57
    128..191:      7       5
    192..255:      313     315
    256..383:      4348    4348

Distances:
    1..1:          261     258
    2..2:          1       1
    3..3:          135100  38991
    4..4:          4       1
    5..5:          0       0
    6..6:          0       0
    7..7:          1       0
    8..11:         51      0
    12..15:        516     400
    16..23:        344     267
    24..31:        409     299
    32..47:        407     319
    48..63:        345     288
    64..95:        282     246
    96..127:       188     171
    128..191:      213     176
    192..255:      117     94
    256..383:      153     130
    384..511:      83      69
    512..767:      193     164
    768..1023:     186     160
    1024..1535:    373     305
    1536..2047:    348     301
    2048..3071:    7809    7502
    3072..4095:    168     126
    4096..6143:    70      70
    6144..8191:    1302    1267
    8192..12287:   858     873
    12288..16383:  498     516
    16384..24575:  1590    1674
    24576..32767:  973     1020

It seems that the ratio hit occurs because the new hash-chain algorithm only operates on 4-byte segments, rather than 3-byte segments. This tends to be pathological for bitmap-based images, which represent a pixel as a tuple of 3 bytes for (R, G, B). This explains why there is such a large number of {Length:3, Distance:3} references in go1.6 that cannot be generated in go.tip.

Given the dramatic increase in performance using 4-byte hashing, I support keeping 4-byte hashing even if it causes some inputs to suffer (especially RGB based images, and possibly other inputs).

@nigeltao, your thoughts?

nigeltao commented 8 years ago

Thanks for the analysis. It certainly sounds plausible that these are RGB 3-tuples in an uncompressed TIFF.

I agree that we don't need to revert 4-byte hashing yet.

For example, TIFF is not as common as it used to be, and more modern image formats use 2D-image-specific filtering before passing to a general purpose compressor like DEFLATE; those filters already take out any easy repeated-RGB-3-tuple wins. Using the standard library's image/png encoder yields 501111 bytes, smaller than both 511928 and 741046. WEBP Lossless (generated by the cwebp tool) is 379612 bytes.

klauspost commented 8 years ago

For "BestCompression" it could be feasible to have a version that could to 3-byte matches. The token writer could then be modified to choose between 3-length matches and 3-byte literals depending on the encoded size.

dsnet commented 8 years ago

Sounds good. Possibly adding 3-byte matching to BestCompression, might not be a bad idea. Since it seems we're in consensus about keeping the 4-byte matching, I'm going to relabel the issue to possibly adding 3-byte matching to high compression levels.

nigeltao commented 8 years ago

I ran both the Go tip (i.e. Go 1.7) and Go 1.6 image/png encoders over a corpus of 20,000 or so PNG images, comparing the output size of a decode-and-then-re-encode. I don't think image/png changed much; it should all be due to compress/flate. Below are the summary statistics (grouped by image type) of the ratios of tip size over 1.6 size; higher means tip is worse, a value of 1.000 means no change.

Gray      n=1252   avg=0.984  p25=0.977  p50=0.987  p75=0.994  p90=0.998  max=1.031
Gray16    n=7      avg=0.998  p25=0.985  p50=1.004  p75=1.009  p90=1.015  max=1.015
NRGBA     n=5365   avg=1.013  p25=0.985  p50=1.007  p75=1.039  p90=1.051  max=1.427
NRGBA64   n=154    avg=1.000  p25=0.994  p50=1.000  p75=1.005  p90=1.013  max=1.243
Paletted  n=3181   avg=0.989  p25=0.984  p50=0.990  p75=0.994  p90=0.998  max=1.023
RGBA      n=11644  avg=1.022  p25=0.991  p50=1.006  p75=1.036  p90=1.082  max=1.696
RGBA64    n=89     avg=1.006  p25=0.996  p50=1.001  p75=1.010  p90=1.032  max=1.118
overall   n=21692  avg=1.013  p25=0.986  p50=0.999  p75=1.027  p90=1.058  max=1.696

pXX are the percentiles, e.g. p50 is the median ratio. It's not too surprising that image.RGBA and image.NRGBA Go image types (i.e. what PNG calls RGB and RGBA images) miss the 3-byte matches the most. Nonetheless, I think the increase in compressed size, in the order of 1-2% on average and less than 1% on median, is acceptable for Go 1.7. Sure, we can always find specific examples (such as the TIFF in the original post) that re-encode relatively badly, but I think we're OK in general, at least for PNG.

In Go 1.8, we can consider re-introducing 3-byte matches, and see if that will bring the outliers back into the fold.