Future spec revision maybe?

RogerSanders commented 2 years ago

A colleague pointed me to this recent spec a couple of days ago. I've also done work on high speed PNG-like compression, in my case for streaming from a server to client over the internet, where low latency is the goal, and a combination of data size, and compression/decompression speed is key. I went with a different route, using a method based on shannon-fano coding and a 24->18 bit downsampling. I've been very impressed with QOI. The hash-based table lookup and very well crafted difference modes do a fantastic job of encoding at compression ratios that are close to my algorithm (carbon), without the lossy downsampling, and with faster encoding times.

I did some profiling of QOI however, and found it has weaknesses in two areas.

Long run encoding
Uncompressed successive RGB literals

These correspond to the best and worst case scenarios (most compressible data, and least compressible data). If you take a sampling, you'll find that run lengths in the 1-16 count range are used reasonably frequently, but this drops off pretty well exponentially, until you hit run lengths of 62, which are used a LOT in most "synthetic" images. Right now, QOI can encode a run of 62 in a single byte, but longer runs can't do any better than one byte per 62 pixels. The other weakness is "solid" data that can't be compressed. Right now an RGB triple goes from 3 bytes to 4 in all cases, meaning you're looking at a worst case scenario of a 33% size increase.

I know the spec has gone "final" now, but I've taken QOI and made a slight tweak to the encoding. In my shorthand notation, here's a comparison of QOI vs my tweaked "QOI2":

QOI:
// 00 IIIIII = Color table entry I
// 01 RRGGBB = Diff from previous pixel, -2 to 1
// 10 GGGGGG RRRR BBBB = Diff from previous pixel, G -32 to 31, R and B diff to green diff -8 to 7
// 11 YYYYYY = Repeat previous pixel Y+1 times, up to 62.
// 11111110 RRRRRRRR GGGGGGGG BBBBBBBB = Literal RGB value
// 11111111 RRRRRRRR GGGGGGGG BBBBBBBB AAAAAAAA = Literal RGBA value

QOI2:
// 00 IIIIII = Color table entry I
// 01 RRGGBB = Diff from previous pixel, -2 to 1
// 10 GGGGGG RRRR BBBB = Diff from previous pixel, G -32 to 31, R and B diff to green diff -8 to 7
// 110 YYYYY = Repeat previous pixel Y+1 times, up to 32.
// 111 YYYYY = Repeat previous pixel (Y+2)*32 times, up to YYYYY=11011
// 111111 XX RRRRRRRR GGGGGGGG BBBBBBBB = X+1 literal RGB values, up to XX=10
// 11111111 RRRRRRRR GGGGGGGG BBBBBBBB AAAAAAAA = Literal RGBA value

What I've done here is to make better use of the upper end of the space occupied by the "RUN" chunk, to allow very large runs of up to 928 repeated values to be encoded into a single byte. I've also carved out space to allow up to three successive RGB values to be encoded with a single leading byte, taking our worst case size blowout from 33% down to 11%. This is easy to implement in the encoder, you basically just keep on counting a run until it ends, then emit zero or more "RunEx" chunks to swallow large values, and emit the remainder using a "Run" chunk. Successive RGB values are similarly easily emitted, and it adds basically no overhead to the decoder.

The end result of these changes are compression rates that are higher on every single image in the benchmark suite, with minimal effect on encoding/decoding times.

Here are a couple of images semi-randomly picked out of the test suite for reference, with a breakdown of the usage counts of each type of chunk shown below, to show the frequency of chunks being used ("RunEx" being runs of 32-pixel multiples, "RGB+" being RGB chunks which encode more than one successive value):

## qoi_benchmark_suite\images/screenshot_game/ADR1FT_gameplay_screenshot.png size: 1111x625
        decode ms   encode ms   decode mpps   encode mpps   size kb    rate
libpng:      14.0       212.1         49.47          3.27       966   47.5%
stbi:        15.1       154.0         46.11          4.51      1363   67.0%
qoi:          5.6         9.6        123.49         72.23      1208   59.4%
qoi2:         6.4         9.8        109.31         70.54      1172   57.7%
carbon:       3.8        14.6        183.02         47.60      1138   55.9%
qoi:    INDEX   DIFF    LUMA    RUN RGB RGBA
    27983   110726  363194  57295   78668   0
qoi2:   INDEX   DIFF    LUMA    RUN RGB RGBA    RunEX   RGB+
    27983   110726  363194  57296   21074   0   0   21590

## qoi_benchmark_suite\images/screenshot_game/Ai_dungeon_with_a_custom_setting.png size: 1200x678
        decode ms   encode ms   decode mpps   encode mpps   size kb    rate
libpng:       3.9        25.6        210.24         31.77        46    1.9%
stbi:         4.0        53.6        201.89         15.18        69    2.9%
qoi:          1.6         5.0        524.13        164.00        62    2.6%
qoi2:         1.5         5.4        553.21        151.54        45    1.9%
carbon:       1.1         5.9        719.12        137.85        37    1.6%
qoi:    INDEX   DIFF    LUMA    RUN RGB RGBA
    5339    1317    4009    14258   8695    0
qoi2:   INDEX   DIFF    LUMA    RUN RGB RGBA    RunEX   RGB+
    5339    1317    4009    1730    541 0   907 2839

And here's the overall statistics for the entire benchmark suite (10 runs), with my old algorithm (carbon) thrown in for good measure:

# Grand total for qoi_benchmark_suite\images
        decode ms   encode ms   decode mpps   encode mpps   size kb    rate
libpng:       7.6        92.9         60.76          5.00       398   24.2%
stbi:         8.4        75.9         55.19          6.12       561   34.2%
qoi:          2.6         3.6        178.40        130.67       463   28.2%
qoi2:         2.6         3.7        175.60        126.34       449   27.3%
carbon:       1.9         7.5        244.32         61.74       362   22.1%

I think this pretty well maximizes the compression potential of QOI without dropping down to sub-byte chunk alignment. Thought I'd post it here to share, and for future consideration. I'm most likely going to adopt this modified QOI implementation in my work.

sectokia commented 2 years ago

Great idea.

It can be improved further. Because your changes avoid the 1111 111x RUN collision, you can make use of more bits for the literal count.

Example:

// 110 XXXXX= Repeat previous pixel 1 to 32 times. // 1110 XXXX= 1 to 16 RGB literals // 11110 XXX = 1 to 8 RGBA literals // 11111 XXX = Repeat previous pixel 64,128, ... , 8192 times (2^(6+x))

The above should achieve slightly better overall performance, while in the worst case reducing bloat from QOI's 33% down to under 6%.

RogerSanders commented 2 years ago

Great idea.

It can be improved further. Because your changes avoid the 1111 111x RUN collision, you can make use of more bits for the literal count.

Example:

// 110 XXXXX= Repeat previous pixel 1 to 32 times. // 1110 XXXX= 1 to 16 RGB literals // 11110 XXX = 1 to 8 RGBA literals // 11111 XXX = Repeat previous pixel 64,128, ... , 8192 times (2^(6+x))

The above should achieve slightly better overall performance, while in the worst case reducing bloat from QOI's 33% down to under 6%.

Thanks for your feedback. Based on my profiling, I'm not convinced longer runs of literal RGB values yield much gain. Because the diff modes in QOI are so powerful, and especially when combined with the recent values hash table, I found it was uncommon for literal RGB runs to exceed 3-4 pixels, in the benchmark images at least. Likewise, exceedingly long repeats of previous values (8192) are of more limited utility, because we can already do 928 in one byte with this new encoding, so with 9 bytes we can already exceed 8192 pixels repeated. Such a long encoding would only save 8 bytes on this extreme case, while requiring more redundant bytes to "hit" the correct value with more common, shorter odd run lengths, like 419. That would require 2 bytes with my proposed encoding, but 4 bytes with your proposed changes. I can run an experiment, but I suspect I'll find a heavier hit on encoding/decoding times, with most likely a worse outcome outside of images with frequent alpha changes (which I was less concerned with).

sectokia commented 2 years ago

Yeah you are right. I am going to apply what you have and go from there. Look at the hit rate for LUM INDEX DIFF RUN, its amazingly high, but I think there could be room for another LUM DIFF type operator to drop RGB literals even further.

rivitoz commented 2 years ago

Roger I too went for Long Runlength changes. I like yours better. Will also try clumping RGBA updates. They are triggered by high gradient changes in color components. If you decode a qoi image outputting only the pixels for the RGB and RGBA codes on black you get a nice image containing the big color edges in the image.

In effect the QOI algorithm works like an edge detector - its great for smooth images but abrupt color changes are its achilles heel (like PNG). It has a hard time with salt and pepper noise and dithered images.

Some extra thoughts: (I have removed original points 1,3,5,6 as I no longer support them - RVT 20221304) 2) A code for when just A changes could also be implemented for an output of 2 instead of 5 for the cost of a compare of rgb - only relevant for RGBA images but would subsume RGBA codes for images with an variable alpha mask on top of a (nearly) constant meaningful colour.

4) For 1 byte diffs code 6A (vr=vg=vb=0) will currently never be issued as it implies RGBA1=RGBA2 which is a run. Code 6A could be used for the new A 2 byte code. Using this means A can be removed from the HASH as A is then handled as an independent variable.

7) A simple, easy to implement, swizzle on input and output allows handling of pixel layouts other than RGBA (e.g. Low endian is almost all BGRA, but ARGB exists too) into and out of QOI in the core algorithm. This is only needed as a parameter on encode/decode and can enable multiple formats input/output to/from the same QOI. Testing shows no discernable impact on runtime as constant offsets become variable offsets within 3 or 4 byte sections all likely in cache.

8) A lossy variant is easily implemented by shifting input bytes right on encoding (and back left on decoding) using the same algorithm. This parameter, say Fidelity, equivalent to bits retained (8 to 1), should be saved with the encoded Image as it describes the encoded data and is needed on decoding. As the number of relevant bits decreases, index hits and runs should increase thus achieving higher compression at lower fidelity. This still leaves a lot of 0 bits to be eliminated though. So to be really successful it may need some further bitshifting of the output or perhaps huffman encoding too. Yet to be tested.

cheers Ron

verdy-p commented 2 years ago

Note also that for HDR or for lossless compression with different number of significant channels, you can largely improve the compression while remaining lossless by just compressing images by chunks of a given maximum size (e.g. 64K pixels): process the chunk by taking only 1 or 4 bit of color in only 1 channel, and process channels separately in successive steps (as many as required to reche the number of colorspaces and their maximum bitwidth. The image header would need to indicate the total number of bits in all color planes, how many bits are taken in which color place. A good compression is then achievable for each chunk of 64K pixels, and a higher number of pixel-chunks for large images will not affect the average compression ratio. It can then be used for encoding videos as well (a video can be seen as a very tall image where video frames are just stacked vertically, or compressed by outputing them in sequence; the video headers will determine how the frames are splitted in the ouput, and may be could also specify the maximum size of the pixel-chunks).

verdy-p commented 2 years ago

Also note that RGBA, with alpha=0 the effective value of RGB channels does not matter at all, so you should not even depend on their value: a transparent pixel is still fully transparent if you change the base RGB color. So the initial color values could be RGBA(0,0,0,0) or RGBA(255,255,255,0) and it should have NO effect on the encoding of the first pixel even if that pixel is fully transparent or is a plain color, or a half-transparent pixel with alpha between 1 and 254.

For this reason, the alpha channel should be treated specially for any pixel with alpha=0 so that the encoder or decoder does not alter the RGB components color of the previous pixel seen. And if that previous pixel seen is the initial color and that color is fully transparent, the 1st pixel of the actual image can only be encoded using a RLE code if it is fully transparent, or will be an explicit RGB data.

Treating image pixels with alpha=0 specially will benefit to many situations for images with transparent backgrounds and having non-rectangular shapes, typically havinb a non-fully-transparent bordering pixel which is typically the same color all around the non-rectangular border: that border can use diff encoding. And diff encoding of non-transparent pixels should not be affected by the presence of intermediate transparent pixels (which are treated and encoded separately, without affecting the encoder state for all other colors).

RunDevelopment commented 2 years ago

Also note that RGBA, with alpha=0 the effective value of RGB channels does not matter at all

That's a big assumption. I.e. in game development, the alpha channel of textures is sometimes used to store a height map.

I also want to point out that throwing away the color information of A=0 pixels invalidates the requirement of QOI being a lossless image compression format.

RogerSanders commented 2 years ago

I agree, we shouldn't assume we can discard data when A=0. It's a good point that @rivitoz made about the 6A byte prefix being free though, and the numerous 2-byte prefix codes that are available. Does anyone else have ideas about what the single-byte prefix could be used for which would help general images, while being simple/low overhead? I can't think of a better idea than the proposed alpha-only change right now. I'd personally make it override the alpha value of the NEXT pixel that's output though, meaning you could do an alpha-only change combined with a diff or hash lookup (or a run), to get compressed alpha+RGB changes combined without having to resort to a full RGBA literal. Still only helps for images with alpha changes, but it would potentially help them a LOT. Might also allow you to drop the alpha value from the hash, potentially speeding up both compression and decompression. And arguably with a 2-byte flexible alpha change you could drop RGBA literals entirely perhaps, and extend the max run of RGB literals from 3 to 4, which might help non-alpha images a little.

Can anyone else think of something that might be better use for the only remaining 1-byte prefix code?

verdy-p commented 2 years ago

Also note that RGBA, with alpha=0 the effective value of RGB channels does not matter at all

That's a big assumption. I.e. in game development, the alpha channel of textures is sometimes used to store a height map.

I also want to point out that throwing away the color information of A=0 pixels invalidates the requirement of QOI being a lossless image compression format.

My assumption is still valid because it only applies to the 1st pixel of the image, and only to encode it. This means that it applies only to the initialization value of alpha, for which there's no reason to think that it is a plain black. If the first effective pixels of the image is a plain black with alpha=255, it will be encoded as it is today with a RGB(0,0,0) chunk. If there's n (>=2) plain black pixels, it will still encode the first pixel as a RGB(0,0,0) chuck, followed by a RLE(n-1) chunk.

Only when the image starts by full transparent pixels should a file staring by a RLE(n) chunk be valid: it encodes a number of "blank" (fully transparent) pixels (whose internal RGB values would still be 0,0,0 by default.

And because of this assumption, the cache of RGB values should contain also the values 0,0,0 in its hashed position, as the hashed position for (r,g,b)=(0,0,0) is 0 this does not change the initilization of the cache as a basic vector like: BYTE cacheRGB[64][3] = {0, 0, 0}; // in C or BYTE cacheR[64] = {0}, cacheG[64] = {0}, cacheB[64] = {0}; // in C

MarkJeronimus commented 2 years ago

Getting back on track,

O hai, Roger, And devs,

I too found out experimentally that the 'highest' run length is used too often, and my proposal is not to use two different OPs for run length, but one that can recurse. (Side note, all along, it was technically never 'run length' but 'repeat count')

With recurse, I mean that repeated instances of OP_RUN will not add up, but work as a sort of little-endian positional quantity.

With that I mean, repeat counts can be encoded like this: 1..62: OP_RUN(0..61) 63..3906: OP_RUN(0..61) OP_RUN(0..61) 3907..242234: OP_RUN(0..61) OP_RUN(0..61) OP_RUN(0..61) etc...

(This is NOT positional base 62. It's bijective base 62. Think excel columns.)

Concrete example (from testcard.png): Repeat count = 239.

Encoder (numbers are akin to line numbers, i.e. they hint on looping):

1. repeatCount--;
2. emit OP_RUN(repeatCount & 63); // = OP_RUN(52)
3. repeatCount = (repeatCount - (repeatCount & 63)) / 62 - 1;
4. repeatCount still > 0? Yes
2. emit OP_RUN(repeatCount & 63); // = OP_RUN(2)
3. repeatCount = (repeatCount - (repeatCount & 63)) / 62 - 1;
4. repeatCount still > 0? No
5. repeatCount = 0; // Must be done explicitly as ends with -1.

This results in OP_RUN(52) OP_RUN(2) for a total of (52+1) + (2+1) × 62 = 239

Decoder:

OP_RUN(52) encountered. Emit (52+1)*repeatMultiplier = 53 pixels of the previous color.
repeatMultiplier *= 62;
OP_RUN(52) encountered. Emit (2+1)*repeatMultiplier = 186 pixels of the previous color.
repeatMultiplier *= 62;
239 total pixels have been emitted.
Another OP encountered
repeatMultiplier = 1; // Reset because this OP isn't OP_RUN

The way this is calculated is by keeping track of a repeatMultiplier, which grows by a factor of 62 every iteration. For the encoder, this multiplier can be implicit, as you nibble away the least significant quantity (with modulo) and divide the remainder by 62. For the decoder, this multiplier must be explicit, and reset always unless the OP is OP_RUN. (this resetting behavior can be in each if-branch, or preferrably, with a boolean that defaults to true.)

MarkJeronimus commented 2 years ago

Somehow I can't edit my post above anymore. (Github bug?)

I did some analysis on what affects the compression ratio. Average "compression" compared to QOI: 93.65% (max: 123.27%)

PNG filesize: no correlation
RAW filesize: no correlation
QOI filesize: no correlation
QOI compression (QOI/RAW): bingo.

[edit] Looks like I made some mistakes, and this is the correct chart. Conclusions don't change.

If QOI already compresses fairly well, this modification makes it significantly better. I opened a few such images and they all have large contiguous areas of the same color. Figures.

There are four outliers that perform significantly better despite the original QOI compression being average. They were screenshots of tiled games like Dwarf Fortress (colorized ascii art). Somehow QOI handles these relatively poorly.

m1lkweed commented 2 years ago

It's off-topic but I'd like to propose two further changes if this gets read:

Increasing the header size from 14 bytes to 16 bytes for alignment and the ability to add more data to the header in the future. Perhaps new versions could be marked using a different magic string like "QOIF" to prevent the two new bytes from being misread.
Image rotation data in the header, to better compress images that repeat vertically or would be more compressible if flipped. storing storing the rotation and flipping information would only take four bits. In extreme cases this can improve compression 50x.

MarkJeronimus commented 2 years ago

It's off-topic but I'd like to propose two further changes if this gets read:

1. Increasing the header size from 14 bytes to 16 bytes for alignment and the ability to add more data to the header in the future. Perhaps new versions could be marked using a different magic string like "QOIF" to prevent the two new bytes from being misread.

2. Image rotation data in the header, to better compress images that repeat vertically or would be more compressible if flipped. storing storing the rotation and flipping information would only take four bits. In extreme cases this can improve compression 50x.

I thought of that too.
That would significantly slow down encoding and decoding, and be less suited for microcontroller applications. I'll take this suggestion for my new project QoiFlow (sort of like a PNGCrush for QOI, but changes the format significantly)

phoboslab / qoi

Future spec revision maybe? #191