The final(?) specification

phoboslab commented 2 years ago

I want to apologize.

I may have been too quick with announcing the file format to be finished. I'm frankly overwhelmed with the attention this is getting. With all the implementations already out there, I thought it was a good idea to finalize the specification ASAP. I'm no longer sure if that was the right decision.

QOI is probably good enough the way it is now, but I'm wondering if there are things that could be done better — without sacrificing the simplicity or performance of this format.

One of these things is the fact that QOI_RUN_16 was determined to be pretty useless, and QOI could be become even simpler by just removing it. Maybe there's more easy wins with a different hash function or distributing some bits differently? I don't know.

At the risk of annoying everyone: how do you all feel about giving QOI a bit more time to mature?

To be clear, the things I'd be willing to discuss here are fairly limited:

I don't want more features (higher bit depth, custom headers, more meta info...).
I'm generally against any ideas that make the format more complex (e.g. mode switches, transforms into YUV colorspace...)
I don't want to make decoding of the chunks to be dependent on some information in the header (e.g. different behaviours for 3 or 4 channels)

What I'm looking for specifically is:

Changes that make the format simpler
Changes that would yield better performance when en-/decoding
Changes that improve compression without making QOI more complex

Should we set a deadline in 2-3 weeks to produce the really-final (pinky promise) specification? Or should we just leave it as it is?

Again, I'm very sorry for the confusing messaging!

Edit: Thanks for your feedback. Let's produce the final spec till 2021.12.20.

ikskuh commented 2 years ago

Things where i see potential to improvement:

Improve the "hash" function by incorporating all bits. Currently, 0x00, 0x40, 0x80, 0xC0, 0xFF all map to index 0 as the hash ignores the 2 most significant bits
Removal of QOI_RUN_16 seems like a good idea after reading the issue, which means that QOI_RUN_8 gets one additional bit for encoding 1..64
Making a 1 run encode with QOI_INDEX and increasing QOI_RUN_8 to 2..65 is probably not so bad either
I would remove the color space field in the header, but i have no strong feelings here

Apart from that, i'm already super-happy with the format, as it is very cpu-friendly in terms of branching (my impl has 0.9% branch misses which surprised me) and it looks like people already made a QOI video stream to the oculus quest 1 with 50 FPS, so it seems to be usable for such heavy-load use cases as well

And i agree that the format doesn't need to get any new features. It's really good in what it has already

oscardssmith commented 2 years ago

IMO, QOI version 1 is a really good first version. I think it makes sense to wait for a month or so before committing to a second version, since we want to make sure there is time to find good ideas.

https://github.com/nigeltao/qoi2-bikeshed/issues/14 has some really good analysis of opcode frequency which will be very useful for optimizing opcodes. While I understand why you don't want different 3 vs 4 channel behavior, I think we should look carefully at it. Separating them can give a pretty easy 10% increase in compression ratio because you simply have shorter opcodes, so you can fit more data in. This comes at a minor complexity cost, but I think it is somewhat offset by the fact that it simplifies the QOI_COLOR opcode. The 4 bits for which of RGBA to store is a lot of complexity that only gets used 0.7% of the time (other than for removing the alpha channel)

laurelkeys commented 2 years ago

I would remove the color space field in the header, but i have no strong feelings here

I think it's really nice to be able to differentiate between linear encoded channels vs sRGB encoded RGB channels + linear alpha channel (as this is the most common case for "images in the wild").

However, I'd like to point out that having the default 0x00 case mean all channels are "gamma compressed" -- including alpha -- is quite error-prone. For instance, see https://github.com/floooh/qoiview/issues/3 (and the linked PR adding QOI support to tev for more information on this).

I'm not sure what's the best way to handle this. IMO, 0x00 should mean QOI_SRGB_LINEAR_ALPHA, but maybe it's easier/simpler to "swap" the current meaning of the colorspace bits such that a 0-bit indicates linear and a 1-bit indicates sRGB (?)

This would lead to 0x00 == QOI_LINEAR, 0x01 == QOI_SRGB and 0x0f == QOI_SRGB_LINEAR_ALPHA, signaling that the choice of sRGB was intentional (i.e. the qoi_desc struct wasn't simply zero-initialized).

oscardssmith commented 2 years ago

In terms of a better hash function, I think the best approach would be to just interpret the RGBA value as a 32 bit uint, and perform an xor and multiply by 2 32 bit prime numbers. This should give much better randomization at very low cost.

While we're improving the index, it also might make sense to try adding a 16 bit instruction that is a combination of index with delta (QOI_INDEX_DELTA?). You could use 4 bits for the tag, 6 bits to store the index, and have 6 bits left to store the lower 2 bits of RGB values. This could be encoded and decoded relatively easily by having a second table where values are hashed by their upper 6 bits, which would allow O(1) lookup at the cost of a little memory.

I think this instruction would be really valuable for things like photographs and dithered images where there is a little noise that will lower the number of exact matches. Currently for these types of images, we use a 32 bit QOI_COLOR tag roughly 14% of the time, and QOI_DIFF_24 roughly 12% of the time. For images without alpha, this tag would reduce those sizes by 50% and 33% when used, which would be a really big advantage.

ikskuh commented 2 years ago

While we're improving the index, it also might make sense to try adding a 16 bit instruction that is a combination of index with delta (QOI_INDEX_DELTA?). You could use 4 bits for the tag, 6 bits to store the index, and have 6 bits left to store the lower 2 bits of RGB values. This could be encoded and decoded relatively easily by having a second table where values are hashed by their upper 6 bits, which would allow O(1) lookup at the cost of a little memory.

This sounds very complex to me and even if it might improve the compression rate, the compression speed will heavily suffer from it, as you now have to search the index array

In terms of a better hash function, I think the best approach would be to just interpret the RGBA value as a 32 bit uint, and perform an xor and multiply by 2 32 bit prime numbers. This should give much better randomization at very low cost.

True, i think we could go with something easier though, as multiplication is still costly on lower-end hardware. Maybe an adopepted Pearson hashing for 6 bit with a LUT of 64 byte, but the LUT makes my performance drop sense tingle. I think doing some xor-shifting while not discarding some bits might already be sufficient

oscardssmith commented 2 years ago

you now have to search the index array

No you don't. In compression the code is roughly

unsigned int pxtrunc =  px.v & 0xfcfcfcfc;
index_pos = QOI_COLOR_HASH(pxtrunc) % 64;
if (index_trunc[index_pos].v == pxtrunc) {
    unsigned int diff = (px.v - pxtrunc);
    unsigned char low_bits = diff | 0x3 | ((diff | 0x300)>>6) | ((diff | 0x30000)>>12)
    bytes[p++] = QOI_INDEX_DELTA | (index_pos>>2)
    bytes[p++] = (index_pos<<4) | low_bits
}
else {
    index_trunc[index_pos] = pxtrunc;
}

The key here is that you store a second index where the values are hashed based on their upper 6 bits rather than the whole value. This keeps the lookup at O(1) at the expense of some extra memory.

phoboslab commented 2 years ago

Ok, I guess we still have some things to iron out. Let's set the deadline for the final spec to 2021.12.20.

Thanks for your patience <3

andrewmd5 commented 2 years ago

Was there a technical reason the spec chose to adopt big endian over little?

oscardssmith commented 2 years ago

I think separate channels is likely a really bad idea. If you have separate channels, it becomes impossible to compress while keeping byte alignment which is really bad for performance. Delata coding also does increase the size by 1 bit since you go from 0:255 to -255:255, which again is probably really bad. Tiling might be a good idea.

phoboslab commented 2 years ago

Was there a technical reason the spec chose to adopt big endian over little?

No. Big endian seemed like the right thing to do for a file format. I guess I'll have to revisit https://github.com/phoboslab/qoi/issues/36 - but there's tradeoffs either way.

separate channels tiling use delta

All interesting ideas, but the result would be a different file format. Outside of the scope of what I'm willing to change here.

QOI_INDEX_DELTA (...) this tag would reduce those sizes by 50% and 33%

How did you arrive at these numbers? I did a quick an dirty test and got a ~3% smaller file size for the kodak images.

oscardssmith commented 2 years ago

for QOI_INDEX_DELTA, I just meant that when it's applicable, it replaces a 3 or 4 byte value with a 2 byte value, not that that would be the effect over the whole image. For the test, did you use the current r^g^b^a or one of the more complicated hashes described above?

oscardssmith commented 2 years ago

It does expand the file though. See my above comment.

gamblevore commented 2 years ago

I think separate channels is likely a really bad idea. If you have separate channels, it becomes impossible to compress while keeping byte alignment which is really bad for performance. Delata coding also does increase the size by 1 bit since you go from 0:255 to -255:255, which again is probably really bad. Tiling might be a good idea.

No... thats not how it works. In theory, yes, but in practice no. I guess it wraps around. You just consider that from 255 to 0 you add 1. Or from 0 to 255 you subtract 1. So its like a loop of 0-255 values. Imagine them wrapped around a tube.

phoboslab commented 2 years ago

For the test, did you use the current r^g^b^a or one of the more complicated hashes described above?

Since the lower 2bits are omitted for the trunc_index I did: trunc_index_pos = (px.rgba.r >> 2) ^ (px.rgba.g >> 2) ^ (px.rgba.b >> 2) ^ (px.rgba.a >> 2);

oscardssmith commented 2 years ago

Can you try a version with trunc_index_pos = 0x1071b495*(px.v & 0x92458355)+0xcb533df9;? I think it will give much better mixing. (Similarly, testing index_pos = 0x92458355*px.v+0xcb533df9; would be interesting). The magic constants are randomly chosen 32 bit primes.

gamblevore commented 2 years ago

I think separate channels is likely a really bad idea. If you have separate channels, it becomes impossible to compress while keeping byte alignment which is really bad for performance.

How?

normally this: RGBA,RGBA now this: RR, GG, BB, AA

8 bytes in both approaches. Where is the loss of byte alignment?

ikskuh commented 2 years ago

normally this: RGBA,RGBA now this: RR, GG, BB, AA

This means you have to sweep the output memory four times instead of one time, hurting your cache locality a lot. One of the reasons why QOI is so fast is that it will touch every memory cell in both input and output exactly once. If we change this, we will get huge performance losses which aren't recoverable by any "win" in the compression rate

gamblevore commented 2 years ago

Decompress rate is more important usually. Compress rate should still be fast enough. An operation as "channel splitting" could work at near memcpy speeds.


// restrict means the pointers dont overlap, so the compiler is free to do more opts
void channel_split(uint* restrict Px, int n, u8* restrict R, u8* restrict G, u8* restrict B, u8* restrict A ) {
    while (n-->0) {
        uint P = *Px++;
        *R++ = P;
        *G++ = P>>8;
        *B++ = P>>16;
        *A++ = P>>24;
    }
}

A good compiler schedules RGBA writes, so they all happen in the same cycle. Also would unroll the loop.

You won't lose much more speed than an memcpy would have lost you.

slembcke commented 2 years ago

Letting it mature sounds like a good idea I guess. Though on the other hand, even if you had 3 different revisions of the format it would still be relatively simple. ;)

Maybe the best way to go about it is to encourage people to use it as a simple intermediate format for now and NOT a long term storage format? Undoubtedly there will be subtle issues that will change your mind about how qoi should work, or even what it is. You've obviously found a nice local maxima in the design space, give it a little time to climb all the way to the top. :)

I don't personally care about compatibility because I'm just using it as an intermediate format. Qoi's defining trait that made me say "why not?" was it's tiny, hackable implementation. It was one liner make rule to drop it into my asset pipeline, and a nearly 1:1 replacement for stb_image with some (mild) benefits. Why not indeed. Being easily hackable, I even pushed the alpha pre-multiplication into the encoder and don't have to care whether or not it's considered "standard" or not. (Though I am curious if there's a technical reason for that beyond "keep it simple", which is a good reason)

mgmalheiros commented 2 years ago

As a humble suggestion to better define the scope of this format, perhaps it could be restated as:

"a simple and fast run-length encoding of byte groups"

Thus, if a group is actually a RGB triplet, then it is outside of the format scope to specify the colorspace (that's semantic info which should be somewhere else). Likewise, if the groups are RGBA quads, the actual alpha format is up to the application using it.

Following my naïve interpretation, there are only three info fields needed in the header besides the magic: width, height and channels.

I believe these are useful as a convenience (when holding images, you got the resolution and channels) but actually required when pre-allocating the buffer where decoding will happen. And the number of channels is needed as the "byte group" size for hashing/encoding/decoding.

Therefore, covering the 1 and 2-channel cases could be both very simple to support and useful, as the format would be able to hold grayscale images, indexed pixelart (actual palette info is outside the scope), tilemap info and 2-channel textures (common in certain Physically-Based Rendering pipelines).

Just my two cents.

phoboslab commented 2 years ago

In the experimental branch I have removed QOI_RUN_16 and added a new operation, based on the idea that changes in luma would affect all three RGB channels in the same direction: QOI_GDIFF_16 (needs a better name).

 - QOI_GDIFF_16 ------------------------------------
|         Byte[0]         |         Byte[1]         |
|  7  6  5  4  3  2  1  0 |  7  6  5  4  3  2  1  0 |
|-------------+--------+-------------------+--------|
|  1  1  0  1 |  dr-dg |    green diff     |  db-dg |

4-bit tag b1101
3-bit   red channel difference minus green channel difference -4..3
6-bit green channel difference from the previous pixel -32..31
3-bit  blue channel difference minus green channel difference -4..3

The green channel is used to indicate the general direction of change and gets
a few more bits. dr and db base their diffs off of the green channel diff. E.g.
  dr = (last_px.r - cur_px.r) - (last_px.g - cur_px.g)

Encoded sizes (avg) kb

	master	experimental
wallpaper	10640	10170
kodak	771	700
misc	400	407
textures	184	179
screenshots	2582	2491

Screenshots and misc suffer a bit from the removal of QOI_RUN_16. Sadly, misc also doesn't gain much from this new op. It seems to be best for photos or "natural" images.

I have also aligned the chunk prefixes so that they are either 2-bit or 4-bit. No more 3-bit codes. This could probably lead to some performance improvement for the decoder.

#define QOI_INDEX     0x00 // 00xxxxxx
#define QOI_RUN       0x40 // 01xxxxxx
#define QOI_DIFF_8    0x80 // 10xxxxxx
#define QOI_DIFF_16   0xc0 // 1100xxxx
#define QOI_GDIFF_16  0xd0 // 1101xxxx
#define QOI_DIFF_24   0xe0 // 1110xxxx
#define QOI_COLOR     0xf0 // 1111xxxx

Note that I re-aranged the if statements in the encoding function to make it easier to experiment, but it makes encoding a bit slower.

I think that's an easy win!? The new operation is very easy to implement with just two more subtractions.

A rather unsuccessful experiment was to encode the "acceleration" of change for each channel instead of the "velocity" (diff) of change. It helps, but not much. QOI_GDIFF yielded far better results.

We also really need a better test suite of images. As noted elsewhere, the only images with an alpha channel are the few ones in misc/. But I also believe that dice.png and fish.png are rather "artificial" examples. I would assume that most textures used in games (and also most use-cases for PNG on the web) have sharper edges for alpha, rather than this semi transparent gradient presented in those two files.

Anyway, I probably won't have time to work much on this in the next few days, but I'm hyped to get back to it :)

oscardssmith commented 2 years ago

If I made a PR to experimental to update the QOI_RUN_8 to an implementation more like https://github.com/phoboslab/qoi/pull/41 where repeated RUN_8 instructions are used more efficiently, would you be willing to merge it? Doing so should be a notable speedup for decompression, and give better file size.

nigeltao commented 2 years ago

In the experimental branch I have removed QOI_RUN_16 and added a new operation, based on the idea that changes in luma would affect all three RGB channels in the same direction: QOI_GDIFF_16 (needs a better name).

If we're throwing experiments out there, see also https://github.com/nigeltao/qoi2-bikeshed/issues/20

toddjonker commented 2 years ago

This sounds very complex to me and even if it might improve the compression rate, the compression speed will heavily suffer from it, as you now have to search the index array

It sounds attractive to have a simple-to-understand and fast-to-decode format that allows for more complexity (and innovation) on the encoder side. If a clever encoder wants to spend a lot of wall-clock time to get great compression, that doesn't harm the simplicity of the standard format.

amber8706 commented 2 years ago

Hi! I like the idea of this format. I would like to see an optional section for metadata as part of the format. In general terms, it would be great to store additional information about the image into format. I consider the metadata section as optional in specification. The "general" metadata properties could be the following: date and time of image creation, location, author, general description, etc. In addition to the "general" metadata, it should be possible to add any other metadata in the "key=value" format. If the metadata section is too large, it should be possible to compress it (for example, using the LZMA method or PPM etc.) I repeat: the metadata section is optional. The operation of the QOI format should not depend on it.

notnullnotvoid commented 2 years ago

If you need real-world alpha textures for the benchmark suite, I have a whole game's worth of high-res painted alpha sprites you could pick through and freely use. The slowness of the PNG format was a huge impediment in our asset pipeline, so having a fast format that performs reasonably well on sprites with transparency is something I'm quite personally invested in. Let me know and I can send a big zip of the relevant images.

rmn20 commented 2 years ago

I think that abandoning 3-bit indexes can be a bad move, as it reduces deltas size in DIFF_16, which is one of the most used opcodes at the moment. (Maybe GDIFF_16 can work even better with 544 DIFF_16?). Another way might be to remove DIFF_16 at all and enlarge the GDIFF_16 deltas to 464, 473 or 373 this way.

DamonHD commented 2 years ago

Maybe look at some of the same lossless encoding test images as JXL?

https://github.com/libjxl/libjxl/issues/727

Lokathor commented 2 years ago

I think having a version value in the header is probably a good plan, and you've got two extra bytes to play with while keeping the full header within a nice and round 16 byte total.

taotao54321 commented 2 years ago

I wrote a QOI compression visualizer for the current specification. Indeed, QOI_RUN_16 seems to be usually quite rare except for very simple images (e.g. retro-games). screenshot

jazzykoala commented 2 years ago

Hey Dominic, kudos for this gulp of fresh air in the (quite conservative) world of image compression. As we all know, genial ideas are always concise, like E=mc^2, and yours looks like one, too. Now, if I may throw a couple of possible minor improvements into the mix:

[x] Hashing speed & quality. The fastest non-cryptographic hashing algorigthm out there is probably xxHash - it claims to be ``working at RAM speed limit''. Doing r^g^b^a is a bit collision-prone because x^y = y^x. Even such a straightforward optimization as multiplying each component by an odd prime number - (r*3 + g*5 + b*7 + a*11) mod HashTableSize - can boost the hash collision resistance.
[x] I second your aspiration for simplicity and brevity: no colormaps, no metadata etc. - just compress an array of bytes (actually, not necessarily imagery). For all those bells and whistles, a resource container can be leveraged, such as RIFF (see AVI, WAV). If I needed to encode something with QOI right now, I'd just introduce a new RIFF chunk type 'QOI1' ('1' is to distinguish from future versions of this format) and enjoy all the benefits of having other chunk types for storing whatever my particular project needs.
[x] To deal with vertical stripes and other non-horizontal shapes, some kind of tiling could do the trick. Perhaps splitting images into 8x8 blocks (processable in parallel), or (at the cost of losing the general simplicity) why not fold 2 dimensions to 1 dimension with Z-curves (cheaper to compute, good quality) / Hilbert curves (harder to compute, best quality) - it's a matter of speed vs. size trade-offs. To become universal/prominent, the algo should deal with all kinds of images more or less evenly, i.e. compress vertical/diagonal content quite as successfully as horizontal content.

Sorry for not providing any patches/benchmarks here - C isn't my language, alas.

oscardssmith commented 2 years ago

@taotao54321 any chance you can test against the experimental branch? It has a few improvements.

taotao54321 commented 2 years ago

@taotao54321 any chance you can test against the experimental branch? It has a few improvements.

I may try it if I have a time, though I don't know whether I can catch up the development of that branch :)

ThomasMertes commented 2 years ago

Great that that the specification is not final for now. A period of several weeks is IMHO necessary to consider and test different ideas to improve the format. I agree that the basic ideas of simplicity should be kept. Otherwise there is the danger of a "designed by committee" format.

My suggestion would be:

Add the size field back to the header. I consider a size in the header as very important. See the link for my arguments.
I still think that header versioning would make much sense.
Regarding removing QOI_RUN_16: Every use of it can save considerable space. So not the number of QOI_RUN_16 usages should be counted, but the saved space by it. If I understand it correctly one QOI_RUN_16 can encode up to 8223 pixels. It would take 129 times an improved QOI_RUN_8 (with a count up to 64) to do the same. In photographs QOI_RUN_16 is probably rare but I guess that in screenshots it can be quite common.
Enlarge the image test suite (Take that with a grain of salt: I did not check the image suite QOI uses). It should reflect the average images found in the internet. So schemas, drawings etc should also be part of it.

oscardssmith commented 2 years ago

There are 2 possible good solutions to QOI_RUN_16. The first is https://github.com/phoboslab/qoi/pull/53 which makes it so that consecutive QOI_RUN_8 codes work differently to encode larger numbers. The second is https://github.com/nigeltao/qoi2-bikeshed/issues/20 which has QOI_RUN_16 and QOI_RUN_24 tags that both use 8 bit tags. Both of these solutions are very effective at encoding large runs efficiently without using up valuable tag space.

makapuf commented 2 years ago

About the colorspace metadata, I'm +1 on removing it (storing it out of band), but if it's kept, the current format is good to store 4 8bit channels, so why not use it as an enum, merge it with channels (fixed is good) allow it to store format info like RGBA(srgb/linear alpha), RGB linear, YUVA or Lab (as well as 0 for unknown?)

jmi2k commented 2 years ago

Great, now I can bikeshed here too! What do you all think about making color differences unbiased, just two's complement? It will be easier for the hardware (not like it's hard right now, but if the goal is making the format simpler, let's remove some unneeded biases and adders from HW implementations). More details here https://github.com/nigeltao/qoi2-bikeshed/issues/11

jmi2k commented 2 years ago

I second your aspiration for simplicity and brevity: no colormaps, no metadata etc.

just compress an array of bytes (actually, not necessarily imagery). For all those bells and whistles, a resource container can be leveraged, such as RIFF (see AVI, WAV). If I needed to encode something with QOI right now, I'd just introduce a new RIFF chunk type 'QOI1' ('1' is to distinguish from future versions of this format) and enjoy all the benefits of having other chunk types for storing whatever my particular project needs.

This is currently my approach when using QOI: forget about the header and go directly to the compressed byte stream. I consider that the header format isn't even that important, as a QOI stream can be put inside any container in future implementations. I'm far, far more interested in the quality and independence of that RGB(A) compressed stream spec.

Lokathor commented 2 years ago

One thing that isn't yet specified is if the "origin" of the image is top left or bottom left.

nigeltao commented 2 years ago

As the format has been unfrozen, I'll have more to say about the header in a separate comment. Just on the opcodes, here are some pretty sweet throughput numbers on my laptop (with similar compression sizes) for a proof of concept, backwards incompatible with the current opcode bit-packing, called "demo 10".

        decode ms   encode ms   decode mpps   encode mpps   size kb
images/kodak
qoi-master:   6.4         8.4         61.15         47.01       771
qoi-experi:   6.9         9.4         57.36         41.91       700
qoi-demo10:   4.0         6.8         98.25         58.03       772
demo10/master ratio:                   1.61x         1.23x
images/misc
qoi-master:   5.7         6.0        154.59        147.09       400
qoi-experi:   6.0         6.6        148.95        134.45       407
qoi-demo10:   2.7         4.3        327.75        204.86       400
demo10/master ratio:                   2.12x         1.39x
images/screenshots
qoi-master:  54.4        48.8        151.33        168.53      2582
qoi-experi:  53.9        51.3        152.61        160.47      2491
qoi-demo10:  28.1        33.6        292.57        244.87      2481
demo10/master ratio:                   1.93x         1.45x
images/textures
qoi-master:   1.8         2.1         74.17         62.86       184
qoi-experi:   1.7         2.3         75.03         56.17       179
qoi-demo10:   0.9         1.5        138.48         85.30       180
demo10/master ratio:                   1.87x         1.36x
images/wallpaper
qoi-master: 138.5       154.1         67.69         60.82     10640
qoi-experi: 143.3       155.7         65.38         60.19     10170
qoi-demo10: 101.1       124.1         92.70         75.50     10669
demo10/master ratio:                   1.37x         1.24x

Details:

qoi-master is the phoboslab/master branch, commit e9069e1.
qoi-experi is the phoboslab/experimental branch, commit cbb62ea.
qoi-demo10 is the little-endian index-7 "demo 10".

chocolate42 commented 2 years ago

I think the following QOI_RUN format makes the most sense, it's likely what someone has suggested already but it's hard to tell so here it is explicitly. QOI_RUN is 2 bit tag 6 bit data, QOI_RUN_16 doesn't exist (as in current experimental branch). Sequential QOI_RUN bytes are used to encode larger runs efficiently. Two consecutive QOI_RUN bytes can encode a 12 bit run, 3 bytes can encode an 18 bit run, etc. Simple and effective, doesn't pollute tags, scales. It seems ideal but maybe I'm missing something?

oscardssmith commented 2 years ago

With sequential QOI_RUN as a feature, it works better to use 4 bit tag 4 bit data. If you have more than 16 of the same pixel in a row, shrinking that more won't gain you as much as lowering the number of bytes for other stuff.

jmi2k commented 2 years ago

I think the following QOI_RUN format makes the most sense, it's likely what someone has suggested already but it's hard to tell so here it is explicitly. QOI_RUN is 2 bit tag 6 bit data, QOI_RUN_16 doesn't exist (as in current experimental branch). Sequential QOI_RUN bytes are used to encode larger runs efficiently. Two consecutive QOI_RUN bytes can encode a 12 bit run, 3 bytes can encode an 18 bit run, etc. Simple and effective, doesn't pollute tags, scales. It seems ideal but maybe I'm missing something?

I see a problem with this: there would be no upper limit on how long a run can be. Imagine, dunno, 100 QOI_RUN in a row. It would represent a 2^600-bit run length! Also, it would require reading ahead an unknown number of bytes before you know what the run length actually is. I think both reasons would make implementations more difficult.

Lokathor commented 2 years ago

Allowing a maximum of 16 (tag4) or 18 (tag2) bit runs seems sufficiently generous, and keeps an upper bound on how much the parser might have to look ahead.

oscardssmith commented 2 years ago

For screenshots, it can be useful to have more than that. @jmi2k my solution to this was to artificially limit the maximum run to 2^31 on the encoding size, and equivalently, a maximum of 8 QOI_RUN commands on the decoding side.

Lokathor commented 2 years ago

Yeah 31 bits also works, the exact number isn't too important as long as there's some cap. The data stream can just contain additional runs if an image really needs to go above that.

phoboslab commented 2 years ago

I agree that supporting longer runs is a good idea. I'm leaning towards nigeltao's solution here:

if the run length in the QOI_RUN byte is 63 read one more byte for a run length of 63..318
if the run length in the QOI_RUN byte is 64 read two more bytes for a run length of 319..65854

I believe that supporting longer runs than that has very diminishing returns.

Also, swapping the tag bits for QOI_RUN and QOI_GDIFF looks like a good idea (sorry, forgot who suggested it). With a 4bit tag for the QOI_RUN we would only have a max run length of 14 in one byte, but since runs represent a very large compression rate anyway, spending one or two more bytes for runs longer than 14 shouldn't be a big deal.

Just to respond quickly to some other comments:

I still want to look into a better hash function, but so far the simpler ones that I tried showed no real (consistent) improvement. Seems to be quite dependent on the particular image.
The colorspace field should probably just be an ENUM
Z-Curves or some block ordering is super interesting, but I think it would complicate QOI too much. It also opens this whole other can of worms about CPU cache lines (that I frankly don't know much about). So, the pixel ordering will stay as it is: left to right, top to bottom.

I'll be back working on QOI next week. First order of business will be setting up a proper test-suite with some more/better images.

Thanks for all the suggestions and experiments everyone <3

jmi2k commented 2 years ago

I, personally, wouldn't rely on multiple QOI_RUN to encode larger run lengths. It forces lookahead and complicates the decoding (now the behavior doesn't only depend on the tag type, but also on its contents and the previous opcodes found). However, I'm aware that the potential gains are big, and there are good reasons for doing it like you're proposing. My desire for making a simple HW decoder is getting in the way.

tl;dr: I don't like the behavior you're proposing, but don't listen to me too much. Do what's best :)

phoboslab commented 2 years ago

@jmi2k can you elaborate on this? A 16bit or 24bit QOI_RUN would be distinguishable in the first byte (1101 1110 = 1 more byte, 1101 1111 = 2 more bytes (with the 4bit tag 1101)). Needing to look at the "contents" of the first byte is a requirement already for QOI_COLOR anyway!?

Edit: I guess I misunderstood and you were referring to another solution for representing longer runs.

jmi2k commented 2 years ago

No you're right, and looks like I'm bikeshedding. There is no (technical) disadvantage, what you proposed in your last comment is fine. Maybe it makes the implementation just a tiny bit more complicated (I would have to try implementing it) but if the gains are considerable it's fine.

phoboslab / qoi

The final(?) specification #48