phoboslab / qoi

The “Quite OK Image Format” for fast, lossless image compression
MIT License
6.91k stars 330 forks source link

Recent changes in experimental branch #71

Closed phoboslab closed 2 years ago

phoboslab commented 2 years ago

With all that we learned through the analysis and ideas of a lot of people here, I refined QOI quite a bit. More than I thought I would.

The current state is in the experimental branch.

First of all, benchmark results for the new test suite using qoibench 1 images/ --nopng --onlytotals

## Total for images/textures_photo/
        decode ms   encode ms   decode mpps   encode mpps   size kb    rate
master:       8.2        11.6        127.43         90.52      2522   61.6%
experi:       5.9         8.2        178.14        127.56      1981   48.4%

## Total for images/textures_pk01/
master:       0.7         1.0        186.23        126.11       184   36.4%
experi:       0.6         0.9        214.67        145.87       178   35.2%

## Total for images/screenshot_game/
master:       2.7         3.9        231.42        162.40       534   21.6%
experi:       2.6         3.4        245.06        187.25       519   21.0%

## Total for images/textures_pk/
master:       0.3         0.5        138.31         93.64        83   48.1%
experi:       0.3         0.4        159.63        110.87        75   43.5%

## Total for images/textures_pk02/
master:       2.0         2.8        155.27        110.31       504   42.5%
experi:       1.7         2.3        182.73        133.00       479   40.4%

## Total for images/icon_64/
master:       0.0         0.0        251.06        163.38         4   28.3%
experi:       0.0         0.0        343.60        266.60         5   31.3%

## Total for images/icon_512/
master:       0.6         0.9        474.50        308.36        80    7.8%
experi:       0.6         0.7        474.62        378.76       102   10.1%

## Total for images/photo_kodak/
master:       2.9         4.2        137.76         92.77       771   50.2%
experi:       2.4         3.5        166.17        111.66       671   43.7%

## Total for images/textures_plants/
master:       3.8         6.2        281.64        170.37       951   22.9%
experi:       3.3         5.0        324.00        211.07       922   22.2%

## Total for images/screenshot_web/
master:      18.1        28.2        449.27        287.79      2775    8.7%
experi:      17.5        23.2        464.81        350.15      2649    8.3%

## Total for images/pngimg/
master:       6.5        10.0        279.91        180.57      1415   20.0%
experi:       5.9         8.6        307.44        210.93      1445   20.5%

## Total for images/photo_tecnick/
master:      10.1        15.2        142.74         95.00      2710   48.2%
experi:       8.8        13.6        163.36        105.69      2527   44.9%

## Total for images/photo_wikipedia/
master:       7.8        11.7        138.75         92.50      2260   53.4%
experi:       6.7        10.4        161.91        104.37      2102   49.6%

# Grand total for images/
master:       2.1         3.1        220.85        148.50       485   26.8%
experi:       1.9         2.7        245.67        173.24       465   25.7%

As you can see throughput improved a lot, as did the compression ratio for all files without an alpha channel (icon_*/ and pngimg/ suffered a bit, but the overall compression ratio for these files is already quite high. textures_plants/ still saw improvements). For photos or photo-like images QOI now often beats libpng!

What changed? After I switched the tags for QOI_RUN (previously 2-bit tag) and QOI_GDIFF_16 (previously 4-bit tag) I noticed that QOI_GDIFF covered almost all(!) cases that were previously encoded by QOI_DIFF_16/24. So... why not remove them?

#define QOI_OP_INDEX  0x00 // 00xxxxxx
#define QOI_OP_DIFF   0x40 // 01xxxxxx (aka QOI_DIFF_8)
#define QOI_OP_LUMA   0x80 // 10xxxxxx (aka QOI_GDIFF_16)
#define QOI_OP_RUN    0xc0 // 11xxxxxx
#define QOI_OP_RGB    0xfe // 11111110 (aka QOI_COLOR with RGB)
#define QOI_OP_RGBA   0xff // 11111111 (aka QOI_COLOR with RGBA)

(see the experimental file format documentation for the details)

That is, most tags are now 2-bit, while the run-length is limited to 62 and thus leaves some room for the two 8-bit QOI_OP_RGB and QOI_OP_RGBA tags. So QOI would be even simpler than before and (probably?) gain a lot more possibilities for performance improvements:

Yes, it means that a change in the alpha channel will always be encoded as a 5-byte QOI_OP_RGBA, but using the current test suit of images, this seems to be totally fine. The alpha channel is mostly either 255 or 0. The famous dice.png and FLIF's fish.png seem to be awfully "artificial" uses of PNG. (For comparison, in the experimental branch with the original tag-layout and QOI_DIFF_16/24 still present, the overal compression ratio was at 24.6% - but the win in simplicity and performance is imho worth this 1%).

The hash function changed to the following:

#define QOI_COLOR_HASH(C) (C.rgba.r * 3 + C.rgba.g * 5 + C.rgba.b * 7)

This is seriously the best performing hash function I could find and I tried quite a few. This also ignores the alpha channel, making it even more of a second-class citizen.

You may not like it (and I'm truly sorry for all the work that would need to be done in existent implementations), but I strongly believe that this is The Right Thing To Do™.

Thoughts?

Wulf0x67E7 commented 2 years ago

The only nitpick I can think of would be the GRB- instead of RGB-bit-order of QOI_OP_LUMA. Is there any technical reason to use the green channel as the guide instead of the red one?

phoboslab commented 2 years ago

The idea was that changes in lightness (luma) would be most represented in the green channel, since that is the color that the human eye is most sensitive to. I just tried it with a 6-bit red channel instead and got consistently worse results — but not by much. Overall compression rate would be at 25.9%.

Maybe it's worth changing it anyway, for clarity's sake.

Wulf0x67E7 commented 2 years ago

Huh, I didn't think that it would have any impact at all on a lossless format and that those human-perception-specific tricks only really matter when compressing lossily.

I'd personally change it, but it really is incredibly nitpicky and doesn't matter too much either way. ¯\_(ツ)_/¯

oscardssmith commented 2 years ago

I think the reason it matters is probably that it is fairly common for images to get subtly lighter or darker while maintaining the same hue, and in those cases, green on average is what changes the most.

rmn20 commented 2 years ago

Wouldn't removing DIFF_24 greatly degrade the compression ratio of semi-transparent textures? (like hair textures, or maybe water textures) I think these changes make qoi useless for semi-transparent images. Maybe in that case its better to stop encoding alpha levels at all and only write rgb + 0/1 alpha?

phoboslab commented 2 years ago

Yes, it performs worse for these, but I am not convinced that this is an issue.

From my game dev experience even textures for water, clouds, dust etc. use fairly distinct alpha values. Many of these don't even have an alpha channel at all and are just blended with GL_ONE GL_ONE (fire, flares, etc.) or GL_ZERO GL_SRC_COLOR (water, smoke, etc.).

The two worst performing examples I found are actually the ones most prominently featured by other image formats: dice.png and fish.png. As I said before, I believe these are fairly "artificial" examples.

## images/misc/fish.png size: 1969x1307
        decode ms   encode ms   decode mpps   encode mpps   size kb    rate
libpng:      28.1       207.8         91.44         12.39       520    5.2%
stbi:        18.5       197.8        139.20         13.01       804    8.0%
qoi_master:   5.9         9.2        438.07        280.36       862    8.6%
qoi_experi:   5.6         6.9        457.53        372.52      1148   11.4%

## images/misc/dice.png size: 800x600
        decode ms   encode ms   decode mpps   encode mpps   size kb    rate
libpng:       6.7        51.6         71.92          9.30       237   12.7%
stbi:         5.2        45.8         92.34         10.48       341   18.2%
qoi_master:   1.3         1.8        362.65        263.83       355   19.0%
qoi_experi:   1.2         1.3        414.23        356.69       508   27.1%

Compression rate wasn't really great before (compared to libpng/stbi), but it's still very good compared to uncompressed RGBA.

oscardssmith commented 2 years ago

One last change that might be worth trying is changing QOI_INDEX to actually use a LRU cache. This will probably be a little slower, but it would remove the effect of hash collisions on the file size.

vsonnier commented 2 years ago

Sorry @phoboslab I think there is a mistake in computiing the color differences:

char vr = px.rgba.r - px_prev.rgba.r;
char vg = px.rgba.g - px_prev.rgba.g;
char vb = px.rgba.b - px_prev.rgba.b;

char vg_r = vr - vg;
char vg_b = vb - vg;

You need int to hold [-255; 255] differences in vxxx, else someting funky will happen when differences are beyond +-127. I suppose the computation itself is fine by integer promotion ? Note that the master version indeed holds the differences in ints.

Edit : Got it the wrap around arithmetic, although the signed char troubled me a loong time: I finally thought of it in terms of

type Wrap_Around_Difference_Type is mod 256; 

in Ada, then everything was clear.

chocolate42 commented 2 years ago

Here's a tweak to the index, tested against the old data set because ISP's suck. qoi-exluma: Experimental branch baseline qoi-luma61: The index table is reduced to 61 values (prime) and the hash function is simply px.v%61. No other changes. Possible union endian issues that can be resolved with a compiler check (?) qoi-luma61.run64: The RGB and RGBA tags have been moved to the end of the index chunk so rle has a full 64 values, which seems to help decode speed a surprising amount (something something power of two?). There are still two unused 8 bit codes, but with a 6 bit RUN_8 there's no need for RUN_16 and I didn't want to theorycraft further here.

## Benchmarking ../qoitests/images/tango512/*.png -- 1 runs
## Totals (AVG) size: 0x0
               decode ms   encode ms   decode mpps   encode mpps   size kb
libpng:              3.1        20.5         84.87         12.76        51
stbi:                2.1        20.6        124.21         12.74        69
qoi-exluma:          0.7         0.8        353.49        336.18       102
qoi-luma61:          0.7         0.8        371.70        319.96        86
qoi-luma61.run64:    0.6         0.8        457.27        331.28        86

## Benchmarking ../qoitests/images/kodak/*.png -- 1 runs
## Totals (AVG) size: 0x0
               decode ms   encode ms   decode mpps   encode mpps   size kb
libpng:              9.6       179.6         41.10          2.19       717
stbi:               10.1        98.9         39.05          3.98       979
qoi-exluma:          3.1         4.2        127.20         94.29       671
qoi-luma61:          2.9         4.5        135.15         88.22       675
qoi-luma61.run64:    2.6         4.4        148.52         89.88       675

## Benchmarking ../qoitests/images/misc/*.png -- 1 runs
## Totals (AVG) size: 0x0
               decode ms   encode ms   decode mpps   encode mpps   size kb
libpng:             12.0       102.1         74.30          8.71       283
stbi:               10.4        97.2         85.39          9.14       415
qoi-exluma:          2.8         3.3        316.36        270.04       513
qoi-luma61:          2.8         3.3        322.46        272.58       504
qoi-luma61.run64:    2.4         3.2        364.61        280.07       503

## Benchmarking ../qoitests/images/textures/*.png -- 1 runs
## Totals (AVG) size: 0x0
               decode ms   encode ms   decode mpps   encode mpps   size kb
libpng:              3.0        41.7         42.96          3.12       163
stbi:                2.7        23.3         48.05          5.57       232
qoi-exluma:          0.8         1.1        163.08        120.30       178
qoi-luma61:          0.7         1.1        181.62        115.97       179
qoi-luma61.run64:    0.6         1.1        201.48        119.41       179

## Benchmarking ../qoitests/images/screenshots/*.png -- 1 runs
## Totals (AVG) size: 0x0
               decode ms   encode ms   decode mpps   encode mpps   size kb
libpng:             59.4       620.1        138.56         13.27      2219
stbi:               52.0       755.2        158.20         10.90      2821
qoi-exluma:         28.5        26.1        288.85        315.97      2476
qoi-luma61:         27.2        27.3        302.62        301.02      2478
qoi-luma61.run64:   23.9        27.1        344.47        303.73      2475

## Benchmarking ../qoitests/images/wallpaper/*.png -- 1 runs
## Totals (AVG) size: 0x0
               decode ms   encode ms   decode mpps   encode mpps   size kb
libpng:            191.4      2922.5         48.97          3.21      9224
stbi:              220.4      1814.6         42.53          5.16     13299
qoi-exluma:         76.9        85.8        121.91        109.21      9923
qoi-luma61:         75.4        94.0        124.24         99.72      9965
qoi-luma61.run64:   64.2        85.0        146.09        110.21      9964

qoi.luma61.run64.h.txt

phoboslab commented 2 years ago

Very cool. Reducing the index to 61 is rather clever! And I'd really like to know why a 64 run-length would help that much with performance.

@vsonnier this should be fine, as long as the decoder wraps these values around, too.

Edit: I can't really reproduce these throughput numbers. qoi-luma61.run64 performs slightly worse on my machine.

# Grand total for images
                decode ms   encode ms   decode mpps   encode mpps   size kb    rate
qoi-exluma:           1.9         2.7        244.79        172.72       465   25.7%
qoi-luma61.run64:     1.9         2.7        239.50        169.05       463   25.6%

Compression is better though. Images with alpha suffer quite a bit when the hash doesn't include alpha value. Might be worse to change it to:

#define QOI_COLOR_HASH(C) (C.rgba.r * 3 + C.rgba.g * 5 + C.rgba.b * 7 + C.rgba.a * 11)

(or just C.v % 61 as you proposed)

nigeltao commented 2 years ago

Some numbers vs qoi-demo10. qoi-experi is commit 28954f7. qoi-l61r64 is the luma61.run64 mentioned above.

For compressed size, qoi-demo10 wins on icon_512, icon_64 and pngimg, which all involve alpha. qoi-experi wins everywhere else, including on textures_plants (which also involves alpha). I know simplicity of format is a goal, but perhaps consider having two slightly different sets of opcodes (or opcode bit assignments) depending on whether channels == 3 or channels == 4. Edit: now that I've added the luma61.run64 numbers, the qoi-demo10 vs qoi-l61r64 gap is noticably smaller on icon_512, icon_64 and pngimg.

One more quick request is to raise QOI_PADDING from 4 to 7, so that the decoder can always do an 8 byte (64 bit) read.

        decode ms   encode ms   decode mpps   encode mpps   size kb    rate

## Total for images/icon_512
libpng:       5.0        33.9         52.74          7.74        51    5.0%
stbi:         3.9        32.8         66.47          8.00        69    6.8%
qoi-experi:   1.5         1.2        180.47        224.24       102   10.1%
qoi-demo10:   0.8         1.2        309.23        218.27        76    7.5%
qoi-l61r64:   1.0         1.4        254.71        190.54        86    8.4%

## Total for images/icon_64
libpng:       0.2         0.9         23.26          4.35         3   23.6%
stbi:         0.1         0.9         28.05          4.45         4   27.9%
qoi-experi:   0.0         0.0        103.12        106.03         5   31.3%
qoi-demo10:   0.0         0.0        148.43        110.04         4   27.6%
qoi-l61r64:   0.0         0.0        141.35        106.50         4   28.8%

## Total for images/photo_kodak
libpng:      15.2       295.7         25.88          1.33       717   46.7%
stbi:        16.9       157.8         23.33          2.49       979   63.8%
qoi-experi:   5.7         6.7         69.14         58.89       671   43.7%
qoi-demo10:   4.1         7.1         95.72         55.76       772   50.3%
qoi-l61r64:   4.5         6.6         87.26         59.26       675   44.0%

## Total for images/photo_tecnick
libpng:      44.9      1143.8         32.08          1.26      2414   42.9%
stbi:        57.9       608.3         24.88          2.37      3533   62.8%
qoi-experi:  20.9        26.3         68.99         54.67      2527   44.9%
qoi-demo10:  13.0        25.1        111.04         57.44      2737   48.7%
qoi-l61r64:  17.2        25.3         83.93         57.02      2529   45.0%

## Total for images/photo_wikipedia
libpng:      42.9       865.8         25.28          1.25      2046   48.3%
stbi:        54.6       456.6         19.87          2.38      2893   68.3%
qoi-experi:  15.7        19.9         69.17         54.39      2102   49.6%
qoi-demo10:   9.5        19.2        113.99         56.49      2289   54.0%
qoi-l61r64:  12.8        19.3         84.51         56.30      2103   49.7%

## Total for images/pngimg
libpng:      56.4       533.3         32.08          3.39      1201   17.0%
stbi:        59.6       399.9         30.35          4.52      1751   24.8%
qoi-experi:  15.5        15.7        116.99        114.90      1445   20.5%
qoi-demo10:   9.0        15.7        201.83        114.98      1429   20.2%
qoi-l61r64:  11.7        16.4        154.61        110.63      1437   20.3%

## Total for images/screenshot_game
libpng:      18.6       216.7         34.08          2.92       448   18.1%
stbi:        20.9       150.9         30.24          4.19       634   25.7%
qoi-experi:   6.3         6.3        101.13        100.93       519   21.0%
qoi-demo10:   4.1         6.1        156.06        104.53       535   21.7%
qoi-l61r64:   4.9         6.5        130.18         97.85       517   20.9%

## Total for images/screenshot_web
libpng:      92.0      1060.7         88.33          7.66      2402    7.6%
stbi:        78.8      1210.3        103.12          6.71      3076    9.7%
qoi-experi:  50.5        38.8        160.98        209.22      2649    8.3%
qoi-demo10:  28.3        40.4        287.55        201.06      2680    8.4%
qoi-l61r64:  37.7        44.7        215.30        181.93      2649    8.3%

## Total for images/textures_photo
libpng:      39.6       725.0         26.45          1.45      1977   48.3%
stbi:        46.6       370.7         22.49          2.83      2554   62.4%
qoi-experi:  13.9        15.8         75.30         66.35      1981   48.4%
qoi-demo10:  10.1        18.3        104.22         57.40      2506   61.2%
qoi-l61r64:  11.2        15.1         93.39         69.39      1990   48.6%

## Total for images/textures_pk
libpng:       0.9        24.6         47.24          1.81        89   51.5%
stbi:         0.8        17.0         55.03          2.62       121   70.0%
qoi-experi:   0.7         0.8         61.19         58.01        75   43.5%
qoi-demo10:   0.6         0.7         77.52         60.26        78   45.1%
qoi-l61r64:   0.6         0.8         79.75         58.52        75   43.3%

## Total for images/textures_pk01
libpng:       5.3        66.5         24.65          1.95       163   32.3%
stbi:         5.1        37.1         25.61          3.50       232   45.8%
qoi-experi:   1.6         1.7         81.33         77.75       178   35.2%
qoi-demo10:   1.0         1.6        136.54         81.54       180   35.6%
qoi-l61r64:   1.1         1.6        116.92         78.89       179   35.3%

## Total for images/textures_pk02
libpng:      11.7       205.0         25.97          1.48       427   36.1%
stbi:        12.1       100.1         25.07          3.03       623   52.5%
qoi-experi:   4.2         4.4         71.95         69.25       479   40.4%
qoi-demo10:   2.8         4.3        108.01         70.79       492   41.5%
qoi-l61r64:   3.1         4.3         96.66         70.61       481   40.5%

## Total for images/textures_plants
libpng:      31.9       340.6         33.35          3.12       857   20.6%
stbi:        30.7       241.5         34.66          4.41      1191   28.7%
qoi-experi:   8.0         9.2        132.21        116.02       922   22.2%
qoi-demo10:   4.3         9.5        247.12        111.57       957   23.0%
qoi-l61r64:   6.0         9.4        177.48        113.29       922   22.2%

# Grand total for images
libpng:      13.5       187.9         34.47          2.47       423   23.3%
stbi:        14.7       121.4         31.53          3.82       601   33.2%
qoi-experi:   4.7         5.0         98.37         92.79       465   25.7%
qoi-demo10:   3.0         4.9        156.82         94.45       484   26.7%
qoi-l61r64:   3.7         5.1        127.03         91.58       463   25.6%

        decode ms   encode ms   decode mpps   encode mpps   size kb    rate

This was on a mid-range x86_64 laptop (2016, Skylake):

$ cat /proc/cpuinfo | grep model.name | uniq
model name  : Intel(R) Core(TM) m3-6Y30 CPU @ 0.90GHz
phoboslab commented 2 years ago

Regarding a 7-Byte padding: wouldn't it make sense to put and "end-marker" there, instead of just 0x00 bytes?

If we then use 0xfd = QOI_RGB and 0xfe = QOI_RGBA we could pad the stream with 0xff bytes. Since QOI dosn't allow for more than 4 consecutive 0xff bytes anywhere in the stream, this could be a nice replacement for the now-missing size header, when streaming images.

Edit: thinking about it some more, a 7-byte 0x00 shouldn't occur in a stream either, since repeated QOI_INDEX chunks should be encoded as QOI_RUN instaed.

phoboslab commented 2 years ago

Further changes in experimental:

Lokathor commented 2 years ago

Just adding a size field to the header would be far mor efficient than making people who want to skip the image iterate every byte looking for a run of four 0xFF or eight 0x00.

phoboslab commented 2 years ago

We've had this discussion (see https://github.com/phoboslab/qoi/issues/28#issuecomment-980619128) and I believe the benefit of allowing a "streaming" encoder that does not need seeking outweigh the drawbacks of not having the size in the header.

oscardssmith commented 2 years ago

What about allowing encoders to put 0 as the size if they don't want it? (I.E. if a decoder reads a size of zero, it has to scan, but if it sees a nonzero size it can skip?)

Lokathor commented 2 years ago

Yeah if the encoder absolutely needs to never go back it can just put a zero, but that's probably very rare. More likely these are files being put on disk, and you can just adjust an earlier part of the file once the compression is completed.

ikskuh commented 2 years ago

What about allowing encoders to put 0 as the size if they don't want it? (I.E. if a decoder reads a size of zero, it has to scan, but if it sees a nonzero size it can skip?)

That would completly defeat the idea of a streaming reader as you then don't know the file size beforehand, but if you want to skip images in a stream, you would need to have a defined header. So you can also just leave it out.

Yeah if the encoder absolutely needs to never go back it can just put a zero, but that's probably very rare. More likely these are files being put on disk, and you can just adjust an earlier part of the file once the compression is completed.

It would make it harder for, let's say, embedded devices that have low memory anyways to require the device to store the image both encoded and unencoded in RAM just for the sake of a stream field. Parsing/seeking the full image stream also doesn't take much time on most machines, especially if no writes are done.

A stronger use case for streaming encoders is stuff like additional compression/encryption where you convert some bytes, push them into the compressor/encrypter and pull some bytes from them. All compression/encryption APIs i know provide this pattern and it would be cool to just not have to allocate additional memory for the compressed image. Especially as it might require either overallocation or reallocation which are both costly. Imho, status quo is already good enough also for special use cases. The tagging of four consecutive 0xFF is imho a good middle ground, especially as a conforming non-streaming parser doesn't have to handle that case at all (it's always a invalid)

@phoboslab: The changes for color hash and colorspace are good :+1:

Lokathor commented 2 years ago

That would completly defeat the idea of a streaming reader as you then don't know the file size beforehand, but if you want to skip images in a stream, you would need to have a defined header.

If the size is zero, you have to walk the bytes to find the end. If the size is non-zero you can skip that many bytes immediately.

But if there's no size at all then you also have to walk the bytes to find the end, so things are just as bad as if someone had put a size of zero.

It would make it harder for, let's say, embedded devices that have low memory anyways to require the device to store the image both encoded and unencoded in RAM just for the sake of a stream field.

Why would you store the entire encoded image in ram just to have the total encoded size? A counter of how many encoded bytes have been written would be plenty. Then you write that value down in the file after the encoding is done (or, again, if you can't go back for some reason you'd know that ahead of time and mark a size of 0 in your output).

ikskuh commented 2 years ago

Some reasoning on why i don't think the size field is a good idea:

If the size is zero, you have to walk the bytes to find the end. If the size is non-zero you can skip that many bytes immediately.

That very much depends. I can already see people embedded different QOI images in QOI images by specifying invalid size fields that won't skip the full image. If you want to write safe software, you have to run the decoder anyways to see if the size field is actually valid and can be trusted as a measure to seek forward. Otherwise people might craft malicious streams that will crash your decoder. So removing the field prevents an attack vector for maliciously crafted data.

Why would you store the entire encoded image in ram just to have the total encoded size?

I can't because my (fictional, but realistic) device has only 32k RAM. This means i can neither store the full source image (but have to stream it from somewhere) nor can i store the full output image (but have to stream it somewhere). So in that use case, it would require me to always write zero to that field.

But to implement something like seeking back to disk, you have to create two interfaces to your application: One that allows streaming encoding, and one that allows non-streaming encoding. The latter one would require a bit different API for non-generic purpose implementations and would make the general library/application harder without much benefit in general (i won't argue that there isn't a benefit, i just don't value it that much)

The only use case where such a field is from use is: When you want to skip over the QOI data and the source data is seekable (aka. disk or memory buffer). For all other use cases we either have to decode the data anyways (so no benefit from the size field) or we don't have seeking capabilities (so the size field isn't worth much either, as we have to read the data anyways and skip-decoding QOI is way faster than memory or even I/O devices). And in case of disk i/o, we already read way more data from the disk than we actually need (read_ahead_kb is 128kB for my primary disk), this means we can already decode 128kB of data without I/O penalty. If we seek, we will pay the additional cost for a syscall which might also take longer than just decoding the data we already got anyways

But the last word is at @phoboslab. I have stated my case and don't have more arguments here.

Lokathor commented 2 years ago

The decoder should of course not blindly trust a size field and simply jump that far ahead in memory. It still has to check how big the current buffer is and not jump past the end of the buffer. This is 101 level stuff. However, if we're speaking about hostile input there's nothing ensuring that the input will ever signal an end of the stream, so it could trick a simplistic decoder into some sort of overflow that way as well.

And I would say that "the source data is seekable" is the overwhelmingly common case. Most things are on disk or are pulled completely into memory via network. Usually you've got seekable data.

phoboslab commented 2 years ago

I initially put the size field in the header in the first version, because I was immensely frustrated that you absolutely can not know if you found the end of a frame in MPEG-TS. A simple END_OF_DATA would have solved this.

I do not like the optional size header. I goes against my desire for simplicity. All these "maybe it's like that, but maybe not... you have to check" things is what makes PNG, JPEG and many other formats way more complicated than they should be. So far, I tried hard to avoid all these optional features.

Anyway, what's the real use-cases for when you need to skip a QOI image? 1) you want to have some custom data after the image 2) you have a bunch of QOI files lumped together 3) you receive a stream of QOI images

Solutions: 1) put a fixed size trailer at the end of the file, that tells you where to find your custom data 2) use a container format that provides a size field or index of each image 3) prefix each image with a size field in the stream. If this is a video stream, you need a custom header anyway to denote framerate and whatnot

Wulf0x67E7 commented 2 years ago

Then why not also get rid of width and/or height?

You could always get one by dividing the final number of decoded pixels by the other.

Right now you already have to check that there aren't too many/few pixels encoded inside a stream for a given size, how would an encoded byte size be different?

And wouldn't a format that doesn't require you to use some external (and likely needlessly complicated and non-standard) addon to actually use it for a big portion of its use-cases, just because its header is missing one simple value, be the simpler one?

ikskuh commented 2 years ago

Then why not also get rid of width and/or height? You could always get one by dividing the final number of decoded pixels by the other.

This is a good question! For one, we need a single control mechanism for stream integrity. Width and Height are usually information of interest, in a header, even if you don't care for the data around (think of the file tool which can show you image sizes). As the QOI stream doesn't have a end of file signal, we need other means to compute that.

Right now you already have to check that there aren't too many/few pixels encoded inside a stream for a given size, how would an encoded byte size be different?

You have to check if your stream is still in the image anyways, so a secondary buffer that might even be wrong (or optional) doesn't help here. It also reduces decoding speed (my implementation was 10% faster for not checking that additional value :open_mouth:). Also you gave a cool idea:

Proposal: If the stream ends with less than width×height pixels decoded, the remaining pixels are filled with the last pixel value. This might bring a huge reduction for image files like logos, icons, … which have a "bar" at the bottom. This is cheap optimization for files or known-size memory buffers.

And wouldn't a format that doesn't require you to use some external (and likely needlessly complicated and non-standard) addon to actually use it for a big portion of its use-cases, just because its header is missing one simple value, be the simpler one?

If you really want that field to exist, it must be non-optional and thus remove the possibility of stream-encode QOI images. This is a huge decision to have either/or as a feature. If it's non-optional, people will just put a 0 there, because they are lazy.

oscardssmith commented 2 years ago

Your proposal isn't a huge optimization. It saves a maximum of 2 bytes, since a single QOI_RUN_16 tag will handle this.

phoboslab commented 2 years ago

Then why not also get rid of width and/or height?

Non sequitur. The width/height is mandated to be always present; a decoder can rely on these.

actually use it for a big portion of its use-cases

Which are?

It saves a maximum of 2 bytes, since a single QOI_RUN_16 tag will handle this

There is no QOI_RUN_16 anymore (in the experimental branch). The longest run is now 62. But your point still stands and I'd rather be explicit: if there's pixels in the image, they need to be encoded.

oscardssmith commented 2 years ago

I was relying on the assumption that by the time we finalize, we will either have special handling for multiple QOI_RUN_8 tags or a QOI_RUN_16/QOI_RUN_24 with an 8 bit tag that would deal with this.

Wulf0x67E7 commented 2 years ago

@MasterQ32 The field being optional means any valid data stream with it has to also be valid without it! You can, in fact, just completely ignore it if your use case doesn't need/can't use it!

IT DOESN'T PREVENT STREAMING IN ANY WAY, SHAPE, OR FORM WHATSOEVER!

It not strictly being needed, I can understand. But saying that it would completely prevent streaming is just nonsense.

@phoboslab

Non sequitur. The width/height is mandated to be always present; a decoder can rely on these.

Exactly my point! Every single argument for size being unreliable also makes width and height be unreliable as well, but they are obviously needed and (in one way or another) already being checked, so size, which doesn't actually need to be checked or even considered when not needed, should be even less problematic.

actually use it for a big portion of its use-cases

Which are?

Any time you want to decode more than one image at once (parallel), without using custom/third-party add-ons like additional headers or indices.

F.e. You have a large number of small images (game textures?) and want to avoid the overheads of both working with lots of small files or having to fully decode a single enormous meta-file to find a specific image. You could use some kind of archive, but that would either be completely custom and non-standard or third-party and too general/overly complicated. With a size field, you could simply pack them one after the other and only need to iterate headers while staying fully within the vanilla spec.

I agree that it is more of a (very) nice to have than an absolute necessity.

The main reason I'm still arguing is because of the reason for its exclusion. Because the statement that it would prevent streaming is one which I, as stated above, strongly disagree with.

If you decide to not include a size field because of complexity, then that is ultimately fine.

phoboslab commented 2 years ago

Well yes, there is no mandatory size field because it would prevent streaming encode. There is no optional size field because I don't like the idea of it being optional.

you could simply pack them one after the other

That's a rather constructed scenario. Not having an index in this case seems like a bad idea. Even Doom WADs have one :)

I was relying on the assumption that by the time we finalize, we will either have special handling for multiple QOI_RUN_8 tags or a QOI_RUN_16/QOI_RUN_24 with an 8 bit tag that would deal with this.

It's not needed. A run length of 62 is sufficient. The benefits of not having variable-width chunks (or fewer tag types) far outweigh the gain in compression rate. I just tested it again: screenshot_web has the biggest(!) win in compression rate, being 0.2% better than current experimental. It does even less for textures_*, icon_* and pngimg and absolutely nothing for photos.

## Total for images/textures_photo
        decode ms   encode ms   decode mpps   encode mpps   size kb    rate
ex:             0         8.0             0        130.28      1981   48.4%
ex.r16:         0         8.6             0        122.39      1981   48.4%

## Total for images/textures_pk01
ex:             0         0.9             0        150.01       178   35.2%
ex.r16:         0         0.9             0        141.63       178   35.1%

## Total for images/screenshot_game
ex:             0         3.1             0        201.80       519   21.0%
ex.r16:         0         3.5             0        179.90       516   20.9%

## Total for images/textures_pk
ex:             0         0.4             0        115.40        75   43.5%
ex.r16:         0         0.4             0        106.81        75   43.5%

## Total for images/textures_pk02
ex:             0         2.2             0        135.83       479   40.4%
ex.r16:         0         2.4             0        126.77       478   40.4%

## Total for images/icon_64
ex:             0         0.0             0        252.73         4   28.7%
ex.r16:         0         0.0             0        221.30         4   28.7%

## Total for images/icon_512
ex:             0         0.6             0        468.22        85    8.4%
ex.r16:         0         0.7             0        386.56        83    8.2%

## Total for images/photo_kodak
ex:             0         3.4             0        116.76       671   43.7%
ex.r16:         0         3.7             0        107.62       671   43.7%

## Total for images/textures_plants
ex:             0         4.5             0        235.19       922   22.2%
ex.r16:         0         5.1             0        208.62       914   22.0%

## Total for images/screenshot_web
ex:            0        18.6             0         437.47      2649    8.3%
ex.r16:        0        23.4             0         347.76      2576    8.1%

## Total for images/pngimg
ex:             0         7.7             0        235.06      1436   20.3%
ex.r16:         0         8.7             0        207.12      1422   20.1%

## Total for images/photo_tecnick
ex:             0        13.3             0        108.04      2527   44.9%
ex.r16:         0        14.3             0        100.99      2527   44.9%

## Total for images/photo_wikipedia
ex:             0        10.2             0        106.29      2102   49.6%
ex.r16:         0        10.9             0         99.41      2102   49.6%

# Grand total for images
ex:             0         2.5             0        185.77       463   25.6%
ex.r16:         0         2.8             0        167.49       461   25.5%
chocolate42 commented 2 years ago

F.e. You have a large number of small images (game textures?) and want to avoid the overheads of both working with lots of small files or having to fully decode a single enormous meta-file to find a specific image. You could use some kind of archive, but that would either be completely custom and non-standard or third-party and too general/overly complicated. With a size field, you could simply pack them one after the other and only need to iterate headers while staying fully within the vanilla spec.

Your use case for a size field is basically to have a slightly more compact tarball format that can only store qoi images and without any metadata like filename so you'll need custom nonsense to find a specific image anyway. Better to just use a tarball IMO. Tar is the standard that should have been instead of every game company delighting in creating their own format, but it's not universally suitable thanks to how it works. In an ideal world there'd be a second standard archive format for compact indexed archival (a binaryily-searchable index of UTF-8 filenames with uint64_t LE file sizes, plus a file count and that's it), but there isn't AFAIK.

erikcorry commented 2 years ago

I'm a bit worried about the demotion of alpha here. I would like to use the format to encode anti-aliased logos and icons. In this sort of file you often don't have very many different colors, but at the edge it fades from opaque to transparent, and back.

For an example, take a look at the blue triangle near the top of http://www.theinformedillustrator.com/2021/01/smooth-operations-anti-aliasing.html

In this case I was counting on using QOI_COLOR with only the a bit set, which would cost only 2 bytes for each change of alpha. In the new format that will be 5 bytes for each change of alpha, or 10 bytes for a typical edge where the alpha goes 0->intermediate->255 or 255->intermediate->0.

Do you have images like this in your test suite?

phoboslab commented 2 years ago

There's still QOI_INDEX to handle few distinct colors. textures_plants/ and pngimg/ have anti-aliased alpha. Compression is still quite ok. But for palleted images with very few colors, QOI is not the best choice anyway.

erikcorry commented 2 years ago

I took the parrot from https://en.wikipedia.org/wiki/File:Creative-Tail-Animal-parrot.svg, rendered as a 192x192 PNG, and the Youtube logo from the Templarian collection, rendered at 96x96. Here are the results:

-rw-rw-r-- 1 erik erik 10861 Dec 8 18:57 Creative-Tail-Animal-parrot.svg.png -rw-rw-r-- 1 erik erik 9590 Dec 8 20:53 parrot-experimental.qoi -rw-rw-r-- 1 erik erik 9019 Dec 8 20:53 parrot-main.qoi

-rw-rw-r-- 1 erik erik 1565 Dec 8 20:54 youtube-experimental.qoi -rw-rw-r-- 1 erik erik 942 Dec 8 20:55 youtube-main.qoi -rw-rw-r-- 1 erik erik 1466 Dec 8 20:50 youtube.png

They both got worse - quite a bit worse in the case of the youtube logo. Creative-Tail-Animal-parrot svg youtube

chocolate42 commented 2 years ago

A third 8 bit tag QOI_OP_COLOR_A might be useful. I tried it on the old data set but it was only used something like 0.05% of the time there.

nigeltao commented 2 years ago

If we use a "mod 61" hash, we could assign the other 3 opcodes in that 64-opcode block as:

Alternatively, r4g4b4a4 could be r3g3b2a8 or s2g4c2a8, where s=r-g and c=b-g.

nigeltao commented 2 years ago

Or, another crazy (and probably too-complicated) idea: separate into two 'planes'. Encode all the RGB data first, using pretty much the scheme so far minus any alpha-related opcodes. Encode the A channel (if present) after that, with a separate bytecode scheme (or even just a basic RLE compression).

rmn20 commented 2 years ago

I completely agree with erikcorry and nigeltao, removing alpha deltas makes it almost pointless to support the alpha channel in qoi, since it will compress really poorly. Maybe we can still somehow add another opcode similar to the old QOI_DIFF_24 like nigeltao suggested?

phoboslab commented 2 years ago

removing alpha deltas makes it almost pointless to support the alpha channel in qoi, since it will compress really poorly

This is absolutely not the case.

## Total for images/pngimg
        decode ms   encode ms   decode mpps   encode mpps   size kb    rate
libpng:      30.2       292.0         59.96          6.19      1201   17.0%
stbi:        27.2       234.7         66.50          7.71      1751   24.8%
qoi.master:   6.9        10.5        262.50        172.09      1415   20.0%
qoi.exp:      6.2         7.8        290.55        231.44      1436   20.3%
nigeltao commented 2 years ago

A quick experiment, starting with luma61-run64. Call the original l61r64 and this experiment l61v02.

What's new is using two of the previously-unused 8-bit opcodes (QOI_OP_Z comes from the QOI_OP_DIFF block, QOI_OP_A comes from the QOI_OP_INDEX block). For opaque source images (with a = 0xFF everywhere), these two opcodes won't be used, so there's no difference in the output .qoi file for l61r64 and l61v02.

#define QOI_OP_INDEX   0x00 // 00xxxxxx  +0 bytes  i6
#define QOI_OP_DIFF    0x40 // 01xxxxxx  +0 bytes  r2g2b2
#define QOI_OP_LUMA    0x80 // 10xxxxxx  +1 bytes  g6s4c4 (s=r-g, c=b-g)
#define QOI_OP_RUN     0xc0 // 11xxxxxx  +0 bytes  n6
#define QOI_OP_Z       0x6a // 01101010  +0 bytes  transparent black (all 0)
#define QOI_OP_A       0x3d // 00111101  +1 bytes  a8
#define QOI_OP_RGB     0x3e // 00111110  +3 bytes  r8g8b8
#define QOI_OP_RGBA    0x3f // 00111111  +4 bytes  r8g8b8a8

On @erikcorry's two images:

10861 parrot.png
 9019 parrot-master.qoi
 9590 parrot-experi.qoi
 9553 parrot-l61r64.qoi
 9202 parrot-l61v02.qoi

1466 youtube.png
 942 youtube-master.qoi
1565 youtube-experi.qoi
1593 youtube-l61r64.qoi
 930 youtube-l61v02.qoi

On the non-opaque sub-directories of @phoboslab's test suite (the "size kb" is the important column; the mpps numbers could be optimized):

        decode ms   encode ms   decode mpps   encode mpps   size kb    rate

## Total for images/icon_512
qoi-master:   1.5         1.5        170.74        173.94        80    7.8%
qoi-experi:   1.5         1.2        180.47        224.24       102   10.1%
qoi-l61r64:   1.0         1.4        254.71        190.54        86    8.4%
qoi-l61v02:   1.2         1.3        226.15        206.77        79    7.7%

## Total for images/icon_64
qoi-master:   0.0         0.0        117.66         99.81         4   28.3%
qoi-experi:   0.0         0.0        103.12        106.03         5   31.3%
qoi-l61r64:   0.0         0.0        141.35        106.50         4   28.8%
qoi-l61v02:   0.0         0.0        133.57        103.37         4   26.9%

## Total for images/pngimg
qoi-master:  16.8        19.4        107.53         93.19      1415   20.0%
qoi-experi:  15.5        15.7        116.99        114.90      1445   20.5%
qoi-l61r64:  11.7        16.4        154.61        110.63      1437   20.3%
qoi-l61v02:  13.6        16.8        133.24        107.41      1422   20.1%

## Total for images/textures_pk02
qoi-master:   4.4         5.5         68.97         55.69       504   42.5%
qoi-experi:   4.2         4.4         71.95         69.25       479   40.4%
qoi-l61r64:   3.1         4.3         96.66         70.61       481   40.5%
qoi-l61v02:   3.6         4.5         85.54         66.95       480   40.5%

## Total for images/textures_plants
qoi-master:   8.9        11.5        120.21         92.40       951   22.9%
qoi-experi:   8.0         9.2        132.21        116.02       922   22.2%
qoi-l61r64:   6.0         9.4        177.48        113.29       922   22.2%
qoi-l61v02:   7.2         9.8        148.25        108.32       915   22.0%

[other images/* results not shown]

# Grand total for images
qoi-master:   5.0         6.1         92.49         76.72       485   26.8%
qoi-experi:   4.7         5.0         98.37         92.79       465   25.7%
qoi-l61r64:   3.7         5.1        127.03         91.58       463   25.6%
qoi-l61v02:   4.1         5.2        112.78         89.77       462   25.5%

        decode ms   encode ms   decode mpps   encode mpps   size kb    rate

qoi-l61v02.h.txt

chocolate42 commented 2 years ago

What's new is using two of the previously-unused 8-bit opcodes (QOI_OP_Z comes from the QOI_OP_DIFF block, QOI_OP_A comes from the QOI_OP_INDEX block).

I've been guilty of it in earlier iterations but have come to the conclusion that overlapping encodings should be avoided (unless it's nailed down in the spec that one should be used). By that I mean your encoder might emit QOI_OP_RUN instead of QOI_OP_DIFF, but a different encoder might not so a QOI_OP_DIFF representing no change should not have an overlapping encoding. The 3 tags at the end of a 61 value index are fair game because no conforming encoder should use them for anything else.

There is a replacement for QOI_OP_DIFF that compresses the corpus better and only uses 63 values, so in the above case there would be space for QOI_OP_Z. Lets call it QOI_OP_LUMA1. It's a 6 bit mashup of LUMA's vg_r/vg_b and ANS coding the values (like GDELTA) so we can use ranges other than powers of 2 (which we do to increase green's range and better fit around 0). vg_r=-1..1, vg=-3..3, vg_b=-1..1. Apologies for lack of example but the code below should be relatively plug and play. The last value is unused.

Encode

    if (
        vg_r > -2 && vg_r < 2 &&
        vg > -4 && vg < 4 && 
        vg_b > -2 && vg_b < 2
    ) {
        bytes[p++] = QOI_OP_LUMA1 | (((vg_b + 1) * 21) + ((vg_r + 1) * 7) + (vg + 3));
    }

Decode

else if ((b1 & QOI_MASK_2) == QOI_OP_LUMA1) {
    b1 ^= QOI_OP_LUMA1;
    int vg = (b1 % 7) - 3;
    b1/=7;
    px.rgba.r += vg - 1 + (b1 % 3);
    px.rgba.g += vg;
    b1/=3;
    px.rgba.b += vg - 1 + (b1 % 3);
}

Just changing the latest experi commit from DIFF to LUMA1 does this

# Grand total for ../qoi_benchmark_suite/images
qoi-ex2103:      2.3         3.2        199.36        144.40       463   25.6%
qoi-gd8b:      2.8         3.4        168.58        138.35       459   25.4%

There are similar gains to be had giving LUMA the ANS treatment (but smaller, the 8 bit version is more critical) but that's bikeshed territory for now.

edit: It may look like processing speed is tanked but perhaps not much when optimised, delta7 in the bikeshed uses a similar but wider 7 bit LUMA/ANS encoding and still achieves decent speeds.

edit: Actually as a mix of luma and ans it should be called lama

nigeltao commented 2 years ago

It may look like processing speed is tanked but perhaps not much when optimised, delta7 in the bikeshed uses a similar but wider 7 bit LUMA/ANS encoding and still achieves decent speeds.

In case anyone else is wondering, here's a link to delta7 code and benchmark numbers.

jmi2k commented 2 years ago

Edit: thinking about it some more, a 7-byte 0x00 shouldn't occur in a stream either, since repeated QOI_INDEX chunks should be encoded as QOI_RUN instaed.

Is this an explicit condition? I recall reading somewhere around here that an encoder should be free to make any kind of decision about how to encode the data as long as valid opcodes are used (an extreme case would be an encoder emitting QOI_OP_RGB for each pixel. Stupid but technically correct). Looks like 8 zeros in a row would be a valid QOI opcode sequence, even if useless. What do you think?

chocolate42 commented 2 years ago

Speaking of padding, wouldn't requiring the total size to be a multiple of 8 bytes zero padded be better than always outputting 7 zero bytes? Making the header 16 bytes should also eliminate unaligned loads and/or make handling slightly simpler (wouldn't it? I don't know much about the nitty gritty of optimising).

If the above is true why not go one step further and require total size to be a multiple of 32 so SIMD can make some safe optimising assumptions (in particular for AVX2, but many SIMD are in this sort of range or variable which should still benefit)?

Lokathor commented 2 years ago

There is nearly nothing that can be made SIMD about this format, other than maybe a memcpy.

phoboslab commented 2 years ago

Is this an explicit condition? I recall reading somewhere around here that an encoder should be free to make any kind of decision about how to encode the data as long as valid opcodes are used (an extreme case would be an encoder emitting QOI_OP_RGB for each pixel. Stupid but technically correct). Looks like 8 zeros in a row would be a valid QOI opcode sequence, even if useless. What do you think?

Technically correct. I'm hesitant to sacrifice one more bit of QOI_INDEX for the end-code, though. So I guess we should just make that explicit: "Consecutive QOI_INDEX chunks with the same index value are illegal and must be encoded as a QOI_RUN instead."

Or we switch the QOI_RUN and QOI_INDEX tags and make the 61 values wide as proposed by @chocolate42 - which after some testing had no compression benefit and is somewhat aesthetically less pleasing to me :/

Edit: Narf. Specifying that an encoder must not produce consecutive QOI_INDEX chunks would actually not be enough to have a distinct "ending tag". Even one QOI_INDEX: 0 may be misleading: If a stream ends with a QOI_INDEX: 0 and is then followed by the 8-bytes padding, a decoder that attempts to split QOI files will split one byte early.

chocolate42 commented 2 years ago

Prime numbers are always aesthetic ;)

Powers of two are useful for performance and for some ops it makes sense, IMO index and run are two ops where powers of two do not make sense. I'm experimenting with grouping run/index/small-ops into a single mask to optimise the opcode distribution. How does an index size of 47 tickle your fancy?

phoboslab commented 2 years ago

I'm generally against introducing more opcodes if they do not increase compression by a lot. Without refactoring QOI to use some notion of blocks and/or more complicated encoding schemes, we will not beat a well optimized PNG anyway. QOI is all about simplicity with a reasonable compression rate.

Thinking about the end-code some more: even if the padding is specified to be an otherwise unused 8-bit tag a decoder may still split QOI files too early if they just happen to end with a QOI_RGBA where the color value(s) match this tag. So the end-code must be a multi-byte sequence that can not occur within the stream 🤔

Lokathor commented 2 years ago

I think simplicity is maybe the wrong metric to aim for. You only implement the encoder/decoder some small number of times.

Fast operations with decent compression seems like a better goal point. So whatever makes things go fast without hurting compression should probably be favored. Eg: using powers of 2 over prime numbers.

ikskuh commented 2 years ago

I think simplicity is maybe the wrong metric to aim for. You only implement the encoder/decoder some small number of times.

If we go that route, we'll end up PNG in the end. As you can always argue that something might improve compression rate. Simplicity is nice as you can use QOI to explain people how some basic compression works in general and you can easily proof correctness of implementation.

I prefer simple and small libs with marginal drawbacks to complexer ones nowadays because i can manage and maintain them. I don't wanna maintain a libpng, but libqoi is just so tiny, i can do that

Lokathor commented 2 years ago

You say that, but PNG does at least two things that are not helpful for compression ratio or decompression speed. Possibly more. It's really not so great, it's just what we all use at the moment.

As you can always argue that something might improve compression rate.

I would expect some concrete evidence if someone was making a specific claim of increased compression or better decompression speed.

The example of powers of two compared to primes is just an avenue to investigate, I don't think anyone was trying to advocate a specific spec change without evidence.