phoboslab / qoi

The “Quite OK Image Format” for fast, lossless image compression
MIT License
6.96k stars 330 forks source link

The final(?) specification #48

Closed phoboslab closed 2 years ago

phoboslab commented 2 years ago

I want to apologize.

I may have been too quick with announcing the file format to be finished. I'm frankly overwhelmed with the attention this is getting. With all the implementations already out there, I thought it was a good idea to finalize the specification ASAP. I'm no longer sure if that was the right decision.

QOI is probably good enough the way it is now, but I'm wondering if there are things that could be done better — without sacrificing the simplicity or performance of this format.

One of these things is the fact that QOI_RUN_16 was determined to be pretty useless, and QOI could be become even simpler by just removing it. Maybe there's more easy wins with a different hash function or distributing some bits differently? I don't know.

At the risk of annoying everyone: how do you all feel about giving QOI a bit more time to mature?

To be clear, the things I'd be willing to discuss here are fairly limited:

What I'm looking for specifically is:

Should we set a deadline in 2-3 weeks to produce the really-final (pinky promise) specification? Or should we just leave it as it is?

Again, I'm very sorry for the confusing messaging!

Edit: Thanks for your feedback. Let's produce the final spec till 2021.12.20.

chocolate42 commented 2 years ago

There's no need to read ahead to know the full length of a run, you can decode incrementally by keeping track of written count. ie encounter QOI_RUN, treat as 6 bit value X and write that many, encounter sequential QOI_RUN, combine into 12 bit value Y and write (Y-X) many, etc. All you need to track is if the last code was of type QOI_RUN and the number of pixels written for this run so far.

Here's some rough QOI_RUN estimates based on this run length data: https://github.com/nigeltao/qoi2-bikeshed/issues/14#issuecomment-981312967

The numbers below are the totals needed to encode each corpus with 4/5/6 bit data length QOI_RUN respectively. Length 1 runs are discarded, 8192+ runs are treated as <=16384 length (benefitting smaller data bit counts), and possibly the data isn't pure as the rows are off-by-one by being upper-exclusive (benefitting larger data bit counts). Not exact but a reasonable estimate:

kodak(388284, 386826, 385959) misc(257187, 248727, 241836) screenshots(8714649, 7671192, 6998421) textures(978129, 925929, 897111) wallpaper(43648476, 42295716, 41814129)

I'd argue that if it turns out that QOI_RUN is bumped from a 2 bit tag by other strong encodings it's likely that a 3 bit tag is the sweet spot. 4 bit data is a big handicap not for the rare long runs but for the common medium runs. It'd take better more representative data to figure out IMO.

nigeltao's special casing of 63 and 64 also works, not sure if that's more or less complex as it's effectively adding two 8 bit tags.

oscardssmith commented 2 years ago

The argument for a 4 bit tag for QOI_RUN is that a smaller tag only benefits images with a ton of repeats longer than 16. Since a run of 16 is already 64x compression, and the next best code (QOI_INDEX or QOI_DIFF_8) have a 4x compression, the maximal regression is 6% (and typically the benefit will be much smaller). The TLDR is that you get more compression from fixing the stuff that's working poorly than the stuff that's working well.

notnullnotvoid commented 2 years ago

I agree, I feel it's a bit misguided to focus so much on improving cases where the compression is already good. I think it's more important to focus on improving the cases where the compressor is struggling.

tychism commented 2 years ago

Congrats on the soon-to-be set spec - looks great. It's both clever and easy to understand - a rare combo. Glad you are keeping longer QOI_RUN's in - there are all kinds of use cases for that: screenshots, slideshows, graphs + charts, text, any pic with a uniform background or border.

One thing I would urge is to keep a few codes in reserve for future versions so that those versions can be backward compatible with this one. As mentioned above maybe make the RUN code longer (could use only length 4 codes except for DIFF_8 and INDEX which stay length 2)

As for vertical compression, in the next version you could always do something simple like have QOI_VERTICAL_REPEAT_8 and/or QOI_VERTICAL_DIFF_16 with the pixel(s) above (you just need to know the image width). Don't be tempted by those z-order curve sirens ;).

DamonHD commented 2 years ago

I think a lesson from WEBP and JXL and so on is that a prediction based on the average of left and above is often effective. But I am no expert here.

chocolate42 commented 2 years ago

I agree, I feel it's a bit misguided to focus so much on improving cases where the compression is already good. I think it's more important to focus on improving the cases where the compressor is struggling.

After testing it properly I agree, even going to 5 bit tag QOI_RUN seems fine in the presence of QOI_RUN_16 and QOI_RUN_24

As for vertical compression, in the next version you could always do something simple like have QOI_VERTICAL_REPEAT_8 and/or QOI_VERTICAL_DIFF_16 with the pixel(s) above (you just need to know the image width). Don't be tempted by those z-order curve sirens ;).

I'm toying with an 8 bit code QOI_UP which uses the value above if identical, but even this simple case is questionable. The reason is that width is variable, so the lookup table is variable, so you need to use the heap to keep even just the previous line. The spec could limit dimensions to something like 2^16 to avoid the heap, but that's still way too high a burden for things like FPGA's etc. I don't see a way to incorporate another axis without violating KISS.

jazzykoala commented 2 years ago

As for vertical compression, in the next version you could always do something simple like have QOI_VERTICAL_REPEAT_8 and/or QOI_VERTICAL_DIFF_16 with the pixel(s) above (you just need to

Is there a way to detect vertical patterns - i.e. find a reason to switch from the horizontal order of one-pass pixel processing to the vertical one - without AI?

The only flaw I see in the current algo is its strong bias towards horizontal color series. All the rest is amazingly cool and simple and still has room for optimization, but if there're no color series detected, there's nothing to compress. And if there's nothing to compress, it doesn't matter anymore if caching can be leveraged or not, if hash collisions are frequent or not etc... Also, decoding speed usually has precedence over encoding speed because the former is performed in real time by end users. In our case, the algo has no control knobs 'faster/worse compression vs. slower/better compression', which means all we can do is rely on heuristics.

So the problem is how to deal with all kinds of images (detect color series) more or less evenly in terms of the compression ratio. I heard other image formats tackle this issue by traversing pixels in zig-zag order which is neither vertical nor horizontal and yields good results at a small computational cost. Hilbert curves are cool (Google uses them in their S2 library to divide Earth into cells), but believed to require more CPU resources.

Tom94 commented 2 years ago

Hilbert curves have even fewer jumps than z-order.

That said -- and I believe this has also already been discussed in other issues -- any non-memory-order access will hurt cache coherence, so there needs to be some trickery involved in making such space-filling curves work well. May not be in the spirit of keeping the format simple.

DanielMagen commented 2 years ago

I might be a bit out of my depth here, but I do want to comment a bit regarding the hashing function and QOI_DIFF.

it seems to me that we want a higher probability for cache hits when the absolute difference between the px and px_prev rgba values is high. when the difference between px and px_prev is relatively small we can pay a small price of saving the diff using one of the QOI_DIFFs, the smaller the distance the smaller the price. but if the distance is large and we get a cache miss we pay the full price of saving the entire value of the pixel.

this is totally my intuition but maybe we can afford more cache misses for values closer in range? i.e. maybe we can have the hash function give similar values to numbers close in absolute difference?

I propose trying a hash function like taking the 2 most significant bits from rgb to get a 6 bit number ((r >> 6) << 4) | ((g >> 6) << 2) | (b >> 6)

or if you want to involve more digits you can use xor instead of or, something like (for example) ((r >> 5) << 3) ^ ((g >> 5) << 2) ^ ((b >> 5) << 1) ^ (a >> 5)

notnullnotvoid commented 2 years ago

After testing it properly I agree, even going to 5 bit tag QOI_RUN seems fine in the presence of QOI_RUN_16 and QOI_RUN_24

Maybe I didn't explain myself properly, so let me clarify. What I mean is that this entire discussion about how to best encode very long runs is itself more-or-less bikeshedding, and I don't believe qoi needs to be able to encode very long runs at all. Simply encoding consecutive 64-pixel runs is fine. Consider this argument:

If I understand it correctly one QOI_RUN_16 can encode up to 8223 pixels. It would take 129 times an improved QOI_RUN_8 (with a count up to 64) to do the same.

When you pitch this as "a 129x improvement" (actually 64x when you take the size of the tag into account), it sounds impressive. But you have to realize that this is only talking about the absolute best-case compression, and it is trying to raise the highest possible compression ratio from 1-1/4/64=99.61% up to 1-2/4/8223=99.99%, for a total gain of only 0.6% in the best possible case. Sure it's a 64x reduction in size (for large blank single-color images), but who cares? The compression ratio is already well past 99%.

By comparison, the kodim set only gets 50% compression compared to raw RGBA data, and only 33% compared to raw RGB. A mere 1.01x improvement on these images would save more space per pixel than the 64x improvement that you are proposing. With a general-purpose image compression algorithm, you should care a lot more about improvements to the compression ratio for images that don't already compress well, because those are the images that take the most space to store.

The compression gains for very long runs are not worth the time we are spending worrying about them, and they are also not worth the extra complexity and slowdown in encoding and decoding. Just get rid of QOI_RUN_16 and give the extra bit to QOI_RUN_8. It's the simplest solution and it's more than good enough.

oscardssmith commented 2 years ago

the reason why this matters a little more is that while a run of 64 is probably enough, the ability to efficiently encode longer runs let's is drop the run length to 16, which probably is too short to not have a second way of encoding runs.

DanielMagen commented 2 years ago

thinking about the hash function a little more

1) actually the preferred usage of the hash function is a bit more complicated. locally its safe to assume that neighboring pixels would have similar rgba values to your own. not designing the hash function to take proper advantage of that and delegating the responsibility to QOI_DIFFs can be really wasteful.

I think the way yo think about this is as such if you can use QOI_DIFF_8 then using the hash function is wasteful, you should almost definitely not use hash in that case. (the hash table would be filled with data that doesn't contribute to the overall size of the compressed image at end. the table cells can be better utilized by containing other pixel values).

if you can use QOI_DIFF_16 or QOI_DIFF_24 then using hash can be good, but be careful. since we expect neighboring pixels to have similar values to the current one, it would be good to have those close-valued pixels in the hash table. but we don't want to fill the hash table with those closely packed values and "fail" in case we encounter more different valued pixels.

if you cant use any QOI_DIFF then having a cache hit is the best way, using the hash table in that case is really good.

the difficulty here is having a good interplay between using the hash table for relatively close-valued pixels and far-valued pixels.

2) a good hash function must of course produce an as close as possible distribution to uniform. so any proposed hash function must be tested with a variety of images to see what sort of key distributions it generates.


thinking about a good hash function that takes all that into consideration is hard. also you do want the final result to be as simple as possible, and even if we find a good hash function it might not be as simple as needed.

so my new proposed hash function/procedure is as such: before using the hash table first check if you can use QOI_DIFF_8, if you can great use it and move on. then to hash the rgba, hash only the 6 most significant bits and use that as you hash i.e. hash(rgba) = (r^g^b^a) >> 2


edit checked the hash function/procedure and its not helpful

chocolate42 commented 2 years ago

These numbers were obtained from the average bench size of a variant but should be widely applicable (the RUN_8 baseline changes with variant but the run encoding deltas do not, the variant used is in the ballpark of all the others that exist to date). 6 builds were tested, half of them with RUN_8 capacity of 16 values, the other half with 8 values. In each set there's a build with RUN_16 and RUN_24 enabled, one with just RUN_16 enabled and one with neither enabled. RUN_16 and RUN_24 are not full opcodes themselves, each is an 8 bit tag taking up 1/256 of the opcode space that mean that an 8 or 16 bit integer follows representing an additional 256/65536 values respectively beyond any smaller encodings.

Any builds that use less than 18 opcodes in the chart have unused opcodes that could improve the variant further.

             | RUN_8(8) | RUN_8(8) | RUN_8(8) | RUN_8(16) | RUN_8(16) | RUN_8(16) |
             |          | RUN_16   | RUN_16   |           | RUN_16    | RUN_16    |
             |          |          | RUN_24   |           |           | RUN_24    |
-------------|----------|----------|----------|-----------|-----------|-----------|
tango512     |       95 |       78 |       78 |        83 |        76 |        76 |
kodak        |      690 |      689 |      689 |       689 |       689 |       689 |
misc         |      478 |      408 |      405 |       439 |       407 |       404 |
textures     |      172 |      169 |      169 |       170 |       169 |       169 |
screenshots  |     3018 |     2336 |     2315 |      2606 |      2313 |      2293 |
wallpaper    |    10455 |    10384 |    10383 |     10369 |     10348 |     10348 |
opcodes taken|    8/256 |    9/256 |   10/256 |    16/256 |    17/256 |    18/256 |

RUN_8(8)+RUN_16 clearly looks like the best solution to me, but if there are 8 bit opcodes spare adding RUN_24 is a small improvement. I couldn't test RUN_8(64) as that uses more opcode space than was conveniently available, maybe I'll rejig and retest later.

oscardssmith commented 2 years ago

I hadn't anticipated how good RUN_8(8) is with RUN_16.

HappySeaFox commented 2 years ago

A couple of feature requests. Maybe for the next spec, not necessarily the current one:

  1. Add version to the header to support future specifications
  2. Allow user meta data to be embedded into QOI files. Maybe start with a list of strings, or string -> string mappings?
  3. Allow ICC profiles to be embedded into QOI files. An ICC profile is just binary data. Similar to what libjpeg or libpng do. They just extract (or embed) ICC profiles as binary data. It's up to the user how to interpret it.
  4. Allow image resolution (per inch) to be embedded into QOI files

Thanks! :+1:

tclarke commented 2 years ago

Since there isn't a chat channel or mailing list, I'll mention here that I've started a Verilog implementation. It's currently tracking the experimental branch qoi.h spec. https://github.com/tclarke/qoi_verilog

nigeltao commented 2 years ago

Add version to the header to support future specifications

Some discussion is at #12.

As for metadata (EXIF, ICC profiles, etc), https://github.com/phoboslab/qoi/issues/28#issuecomment-985320231 is one suggestion.

wbd73 commented 2 years ago

For the color hash function you could try Knuth's multiplicative hash.

#define QOI_COLOR_HASH(C) ((C.v * 2654435761) >> 26)

Since it is only a single multiplication it should be pretty fast. I tried it only with a few images and with those images the hash function seemed to work best with the master code and less good with your experimental code.

Edit: It seems like the multiplication value above doesn't work very well with grayscale images. Something more in line with your experimental code seems to works better. #define QOI_COLOR_HASH(C) ((C.v * 0x0b070503) >> 24) But I realize doing a hash like this will give different results if the endianness of the encoding computer is different from the one decoding it.

tychism commented 2 years ago

I think that the QOI_OP_LUMA compression idea is a great one and might possibly be extended for extra compression. For photos the RGB changes from the previous pixel are often very highly correlated for even quite large changes (see https://github.com/tychism/qoi_exploring for sample color1 vs color2 scatterplots) with B&W photos showing perfect correlation. Thus, it might be better in the 2-byte QOI_OP_LUMA to have 8-bit Green data [i.e. any green value] with two 3-bit Red and Blue differential data (vs. Green). Or, there could be another 3 byte OP_LUMA with 8-bit green data and 4- or 5- bit B and R data that might make QOI_OP_RGB pretty rare in some image types (save 1 byte on each replacement).

Another interesting observation from photos is that when one color has a small difference (e.g. B+1 from previous) the other 2 might be less likely to have 0 difference than a 1 or 2 difference (e,g, see https://github.com/tychism/qoi_exploring/blob/main/RedDifferenceFromPreviousPixel_OutsideOfARun.png), This along with the expected symmetric in difference histograms might make the QOI_OP_DIFF specification "2-bit red channel difference from the previous pixel between -2..1" better replaced with the specification "2-bit red channel difference from the previous pixel one of -2, -1,+1,+2". In other words have the 2-bits skip encoding 0 (for all 3 colors. of course). That might catch slightly more pixels for 1 byte compression. This change would also remove the possibility of encoding R+0,G+0,B+0 w.r.t. previous which can't happen anyway (just part a run).

Archaistic commented 2 years ago
  • [x] Hashing speed & quality. The fastest non-cryptographic hashing algorigthm out there is probably xxHash - it claims to be ``working at RAM speed limit''. Doing r^g^b^a is a bit collision-prone because x^y = y^x. Even such a straightforward optimization as multiplying each component by an odd prime number - (r*3 + g*5 + b*7 + a*11) mod HashTableSize - can boost the hash collision resistance.

This seems to be what's in the current standard draft (as of Dec 11th) - how much does hash quality matter? Do collisions cause image corruption, or do they simply increase filesize slightly? I saw no evidence of image corruption from hash collision before the method changed.

This multiplication-based hash is inconvenient to implement on any CPU architecture without a multiplication instruction. The original XOR based hash could be implemented in assembly on basically any CPU architecture down as low as 1970s 8-bit CPUs, since XOR is an extremely basic function that's in almost every ALU, and with the XOR hash you can get away with as few as two registers (or maybe even one, depending if you can XOR a register with a memory location). The multiplication function can be made faster with shifts, but never as fast as the XOR setup. Some architectures don't even have barrel shifters.

Here's pseudocode for assembly for a generic CPU with 16-bit registers trying to create the pre-modulo value for a pixel with no multiply instruction in its ISA and no barrel shifter, that only shifts once per per instruction. MSP430 shifts this way, along with some others. Even if a CPU can shift multiple positions per instruction, simple ones may only shift one position per clock. Risc-V lacks a multiply instruction in the base ISA, and whether the shifter is multicycle or barrel is implementation-defined. Values are already loaded into the CPU's registers since preloading the registers with values is pretty much always going to look the same, regardless whether we're multiplying or XORing or whatever else.

reg 0 = R
reg 1 = G
reg 2 = B
reg 3 = A
reg 4 = initially just a scratchpad, so it's zero
reg 5 = hex 003F (for the modulo. 003f vs just 3F since the calculations require registers at least 16-bits long...)

Reg 4 <- Reg 0
Reg 4 << 1 (reg4 is now R*2. right shifting by only one since that's how this CPU does it.)
Reg 4 << 1 (reg4 is now R*4)
Reg 4 <- Reg 4 - Reg 0 (reg4 is now R*3. R*4-R*1 =R*3)
Reg 0 <- Reg 4 (reg 0 is gonna have our final pre-modulo hash)

Reg 4 <- Reg 1 (we're now doing the green value)
Reg 4 << 1 
Reg 4 << 1 
Reg 4 <- Reg 4 + Reg 1 (reg4 is now G*5. G*4+G*1 = G*5)
Reg 0 <- Reg 0 + Reg 4

Reg 4 <- Reg 2 (we're now doing blue)
Reg 4 << 1
Reg 4 << 1
Reg 4 << 1 (reg 4 is now B*8)
Reg 4 <- Reg 4 - Reg 2 (reg 4 is now B*7. B*8-B*1=B*7)
Reg 0 <- Reg 0 + Reg 4

Reg 4 <- Reg 3 (we're now doing alpha)
Reg 4 << 1
Reg 4 << 1 
Reg 3 <- Reg 4 - Reg 3 (reg 3 is now A*3)
Reg 4 << 1
Reg 4 <- Reg 4 + Reg 3 (reg 4 is now A*11, A*8+A*3=A*11)
Reg 0 <- Reg 0 + Reg 4 (reg 0 is now our final pre-modulo value)

Reg 0 <- Reg 0 && Reg 5 (modulo operation)

The resulting hash is in Register 0 after 24 instructions.

Here's the multiply hash pseudocode even with a multiply instruction (aka, not in base Risc-V), slightly different format and approach in general (I didn't write this one. Thank you @GithubPrankster.)

/* clear r0 */
xor r0, r0

/* load red */
load r1, r
mul r1, 0x3
add r0, r1

/* load green */
load r1, g
mul r1, 0x5
add r0, r1

/* load blue */
load r1, b
mul r1, 0x7
add r0, r1

/* load alpha */
load r1, a
mul r1, 0xB
add r0, r1

/* clear out top 2 bits */
and r0, 0x003F

(Notice that this is 14 instructions. Multiplies taking multiple cycles may slow this down vs the shift-add-sub approach even on some CPUs with a multiply instruction.)

Meanwhile here's pseudocode for the XOR hash. This even works on CPUs with 8-bit registers, and with modifications could use as few as two registers even with a load/store architecture CPU, though I chose to give my fictional 8-bit CPU 5 registers here, for the sake of being pretty much the same environment as the first pseudocode example:

reg 0 = R
reg 1 = G
reg 2 = B
reg 3 = A
reg 4 = hex 3F

Reg 0 <- Reg 0 ^ Reg 1
Reg 0 <- Reg 0 ^ Reg 2
Reg 0 <- Reg 0 ^ Reg 3
Reg 0 <- Reg 0 & Reg 4

And your hash is in Reg 0 after 4 instructions of calculation, and I'm only aware of one CPU (Zilog Z80) where an 8bit XOR would be more than one cycle.

Keep in mind while this may seem like only a problem for low-end CPUs, the hash as currently intended in the draft also means hardware QOI decoders (I was thinking about making one in Verilog, for FPGAs) have to have shifters or multipliers, dramatically increasing logic usage vs a simple XOR function.

If it's possible to revert to the XOR hash (or another one that only uses simple logic operations) I believe it would be ideal to do so, for the benefit of resource-constrained use cases, even if it is to the extremely slight detriment of less-constrained use cases.

wbd73 commented 2 years ago

@Jpaul999 Even on a simple cpu you don't need 24 instructions to calculate (r*3 + g*5 + b*7 + a*11) & 3f Here's some pseudo code (h is the hash) using 14 instructions.

h = a
h + h

h + g
h + b
h + h

h + r
h + b
h + a
h + h

h + r
h + g
h + b
h + a

h & 3f

If you also allow for subtraction you can do with 13 instructions (1 less)

h = b
h + a
h + h

h + r
h + g
h + a

h + h
h + h

h - r
h + g
h - b
h - a

h & 3f

If the cpu also supports shift instructions, you can replace the two subsequent lines of h + h with one shift instruction h << 2 resulting in 12 operations.

I believe some cpu can also do something like the pseudo code below (t is an additional working register). From what I understand Arm can do it and on x86 you could use the LEA instruction for the second and fifth line.

t = b + a
h = r + t << 1
h = h + g
h = h + a
h = g + h << 2
h = h - r
h = h - t
h = h & 3f
Archaistic commented 2 years ago

That's an interesting optimization trick. The 24 instruction pseudocode was a relatively naive conversion of the multiplication to shifts and adds/subtacts, so I apologize. That algorithm actually works out more nicely than the multiplication-instruction algorithm given. It still works out to substantially more instructions than the XOR hash method, and would still be trickier for hardware implementations IMO, as well as the software implementation requiring register lengths greater than 8 bits, or software workarounds for 8-bit CPUs to do 16-bit arithmetic.

Edit: Actually since we're only interested in the bottom 6 bits it would probably work fine with 8-bit arithmetic as long as you're fine with the register overflowing repeatedly

GithubPrankster commented 2 years ago

Since I helped @Jpaul999 on the example pseudo asm, I wanted to also mention how the xor-based hash ties into the simplicity of the format.

It only required 3 xors to be performed, instead of 4 multiplys and 3 adds. It is also not as clear right away how the latter hash works, nevermind how more work is technically discarded when the top 2 bits are cleared away.

ikskuh commented 2 years ago

It only required 3 xors to be performed, instead of 4 multiplys and 3 adds. It is also not as clear right away how the latter hash works, nevermind how more work is technically discarded when the top 2 bits are cleared away.

Yes, but the xor based hash performs bad as it only ever looks at 24 bit out of 32. It will have a high collision count simply due to the fact that it ignores the rest of the format. even (r^g^b^a)>>2 would be a better hash as we have single-byte low-diff encoding and single-byte indexing. but the hash would collide for white and black for example. what @wbd73 proposed looks like a nice alternative that doesn't suffer these problems and uses prime multiplication factors which are optimal

Archaistic commented 2 years ago

There are two versions of the XOR hash, one of which looks at 24 bits and the other of which looks at 32. The initial implementation was the 32-bit version. It seems that overall the 24-bit (RGB only) version outperforms the 32-bit version in most cases, with only files that aren't completely opaque taking a tiny increase in filesize. The 32-bit XOR was actually the initial implementation in qoi. Though obviously fewer bits of XOR is less logic it's not meaningfully more logic to go to 32, and from a philosophical (simplicity, etc) perspective the 32-bit version seems like the more preferable method to me. Images with highly variable partial transparency (aka none of the ones in the current test suite) may benefit from this. It'd certainly hurt to encode an alpha-channel gradient when you have to encode 5 bytes every time the alpha channel changes. https://github.com/nigeltao/qoi2-bikeshed/issues/19 It still hasn't been clearly shown how collisions are a big problem. In fact, despite what's currently at the top of qoi.h (" index_position = (r 3 + g 5 + b 7 + a 11) % 64"), it seems the current hash implementation ignores the alpha channel, or at least that ignoring the alpha channel in the hash is actually currently the intended way forward. https://github.com/phoboslab/qoi/issues/71 If collisions are just a size thing, surely the size difference between the XOR and prime-multiply methodology isn't very much.

As stated by some others in issue 71 who probably needn't be named, "QOI is all about simplicity with a reasonable compression rate", and "trading simplicity for small gains in compression ratio is imho not the right choice for this codec."

Meanwhile implementing multipliers and adders (or maybe with optimizations, just a bunch of adders) in hardware for an FPGA qoi codec vs a handful of XOR gates is definitely trading simplicity for compression ratio. Not to mention the hash speed change in software...

wbd73 commented 2 years ago

I updated my previous post with some additional information on optimizations of the current hash calculation. I do agree of course that the initial XOR was far more simple. A collision means that a different tag has to be chosen when encoding which requires more output bytes. So the less collisions the better.

fgenesis commented 2 years ago

Since tiling/hilbert/space-filling curves have been mentioned a few times i'll just add my 2 cents: That should definitely not be part of the file format. QOI is best as a dumb (2D) RGBA format and it should stay that way. Hilbert/tiling/etc can be done as a preprocessing step in your engine, and nobody prevents you from slapping on a custom header with the extra infos required in your project. Ideally, split the spec into a mandated raw stream format and a canonical, optional header specifically for images. The compressed format is nifty enough and can be set in stone, and if so desired a user can specify a custom header (that's actually little endian, 3+ image dimensions instead of 2, has colorspace infos, encoding direction, tiling, animation infos, whatever). In other words, keep the raw data format separated from the interpretation of that data.

Splitting RGBARGBA into RRGGBBAA is almost like splitting the 8 (or more?) bits of each channel into separately-compressed bitplanes. Please don't do such things. Split channels might work nicely with a qoi-akin codec that's optimized for greyscale and that you can RGB-combine on your own later (but that will be slower). Maybe such a format will happen eventually, but let's keep this one simple and learn from it for a v2, whenever it happens. (Actually, when there's only one color channel we're basically ending up at normal byte-wise compression aka deflate, zstd, .... -- so that is already covered, no?)

I'll just leave this here: https://cbloomrants.blogspot.com/2015/09/library-writing-realizations.html

That said, i'm looking forward to the final version. Definitely going to use this. Thank you all.

TL;DR leave header & pre+post-processing to the user. They know their needs and the properties of their data best.

chocolate42 commented 2 years ago

Meanwhile implementing multipliers and adders (or maybe with optimizations, just a bunch of adders) in hardware for an FPGA qoi codec vs a handful of XOR gates is definitely trading simplicity for compression ratio.

Speaking of complexity when it comes to FPGA's and/or small embedded devices, what is the burden of requiring storage for the index cache? How big can a cache be made and still allow everything reasonable to implement it? I'm experimenting with a secondary index cache that stores anywhere from 256+ values (using a 2 byte encoding) so that's 1 KiB+ of extra memory required at least. Where's the limit?

Archaistic commented 2 years ago

Speaking of complexity when it comes to FPGA's and/or small embedded devices, what is the burden of requiring storage for the index cache? How big can a cache be made and still allow everything reasonable to implement it? I'm experimenting with a secondary index cache that stores anywhere from 256+ values (using a 2 byte encoding) so that's 1 KiB+ of extra memory required at least. Where's the limit?

Depends on the FPGA. As is, the QOI 4-byte (32 bit) by 64-entry cache should fit basically any non-ancient FPGA, even the cheap ones, especially since QOI's access patterns (for decoding) seems fine for use with a singleported RAM. FPGAs have several different types of internal memory, each with different tradeoffs, but overall to get the most out of an FPGA you want to minimize logic and memory usage as much as you possibly can.

Most FPGAs have block RAMs, physical chunks of SRAM in silicon on the die itself. Cheap FPGAs tend to only have about 64 kilobytes of this (in the form of 32 18kbit chunks for a Spartan-6 LX16), or potentially less (ICE40HX8k which has an opensource toolchain only has 16 kilobytes of BRAM, in the form of 32 4kbit chunks). A kilobyte of BRAM you use for an image decoder cache is a BRAM chunk (xilinx spartan-6: 18kbit, pretty much 2 kilobytes) you can't use as CPU cache, or as part of a display linebuffer, or an audio output fifo, but at least it won't use up the reconfigurable FPGA logic itself. Strictly speaking on a Spartan they can be "split" to 2x9Kbit chunks each but this comes with some serious limitations on access patterns.

Distributed RAM is extremely limited in size (and eats logic since it's implemented on the reconfigurable logic) but can be as much as quadported if desired. A dualported 256 entry x 32-bit distributed ram block would eat about 1/32nd of the available LUTs on a Xilinx Spartan-6 LX16, according to the datasheet. For the spartans, it seems 1 LUT -> 1 singleported 64-bit RAM, 2 luts -> 1 dualported 64-bit RAM, 4 luts -> 1 quadported 64-bit RAM. So, a singleported 64-entry x 32-bit cache would be about 32 LUTs, or about 1/128th of a Spartan-6 LX16. Quadrupling the size of that would quadruple the number of LUTs, so 128 LUTs or about 1/32nd of the same FPGA. There'd also be an additional tiny logic usage increase associated with the size increase since it would need to increase the size of the address decoder but that shouldn't matter much.

A 32-bit by 256 value cache in distributed-RAM should be fine for pretty much any FPGA if the FPGA is only doing the image decoding but that comes with the tradeoff that you've now used a bunch of logic you can't use for anything else, like adders or distributed ram for other units, or address decoder logic, or something like a UART.

Of course, if you hook up an external memory chip, implement a memory controller for it, and use the external memory as a scratchpad you have practically unlimited space, but then you're almost certainly going to be bottlenecked by that chip's latency, vs the on-fpga memory which can usually work at very high clock speeds with pretty much no latency.

As for small embedded devices, even most of the ATTiny family has enough RAM for the current pixel cache. A 1KB buffer would also fit most places I imagine, but if the efficiency increase isn't much it may not be worth it. On extremely small devices like the ATTiny (ones where the pixel cache won't fit) I wouldn't expect anyone to be trying to do image decoding anyway, unless they had an additional external storage that holds the framebuffer. Meanwhile, I could plausibly expect somebody to implement a QOI decoder in 6502/Z80/8088/68k assembly for their homebrew computer, or on an MSP430 driving a picture frame or a digital sign or something. The increased number of cycles for the per-pixel hash would add up quick if your CPU's only ~16MHz, vs decoding on an octacore 5GHz processor where it's completely unnoticable and there may not even be a cycle count difference at all as a result of microarchitectural features like OOO, macro-op fusion, etc.

If you really wanted, it would be perfectly possible to decode (but not really display) QOI on an Atari 2600, as long as you had one of the carts that has the extra 256 bytes of RAM inside.

Edit: Oh, I see. I misunderstood what you said. What you meant was "Why should we make the format less complex to implement if it makes the file a little bigger?", what I heard initially was "How big could the pixel cache get before it becomes infeasable in hardware?". Answer: QOI is already pretty near an ideally constructed format for even extremely constrained use-cases, and changing back to XOR would make it unequivocally the best lossless compression format to use in those scenarios, even if it does mean the file ends up a percentage point larger than with the prime-multiply hash.

Bare minimum, for hardware implementations the prime-multiply hash will have the gate delay of a (5-bit? 6-bit?) adder as opposed to one XOR gate feeding another XOR gate, potentially be multicycle for the hash rather than the tiny-hardware-size single-cycle hash calculation that could be done with the XOR method, and use substantially more logic especially if it's not made multicycle. It would have to be multicycle for the 5/6 bit adder gate delay to be true, if it was attempted to implement in a single-cycle manner, it would require a severe number of adders fed into each other, increasing gate delay (reducing achievable clock speed), and dramatically increasing logic usage. If the hash is the RGB one it would only require 12 XOR gates to implement (single-cycle, 6 bits R ^ (6 bits B ^ 6 bits G)), and even though the RGBA one increases that number to 18 that would still be hardly any logic at all relative to the prime-multiply design.

chocolate42 commented 2 years ago

Edit: Oh, I see. I misunderstood what you said. What you meant was "Why should we make the format less complex to implement if it makes the file a little bigger?", what I heard initially was "How big could the pixel cache get before it becomes infeasable in hardware?"

You were right the first time. My experimentation is focused on compression that doesn't heavily impact performance, to optimise for that it seems reasonable to utilise everything easily available to the lowest common denominator. In theory one bonus of a second cache is that we get more mileage out of the color hash, in practice TBD. Thank you for the insight.

MKCG commented 2 years ago

Dear all,

I've been working on a AVX2 implementation based on a change of the hash function in order to vectorize it more easily :

static unsigned char qoi_color_index(unsigned int v) {
    unsigned int hash = v;
    hash ^= hash >> 19;
    hash ^= hash << 5;
    hash ^= hash >> 13;
    hash &= 63;

    return hash;
}

Then it can be used like this :

index_pos = qoi_color_index(px.v);

This allow me to compute 8 hashes using AVX2 instructions :

static void qoi_compute_color_indexes(unsigned char *indexes, const unsigned char *pixels) {
    __m256i hashes = _mm256_lddqu_si256((__m256i *) pixels);

    __m256i shift19 = _mm256_srli_epi32(hashes, 19);
    hashes  = _mm256_xor_si256(hashes, shift19);

    __m256i shift5 = _mm256_slli_epi32(hashes, 5);
    hashes = _mm256_xor_si256(hashes, shift5);

    __m256i shift13 = _mm256_srli_epi32(hashes, 13);
    hashes = _mm256_xor_si256(hashes, shift13);

    hashes = _mm256_and_si256(hashes, _mm256_set1_epi8(63));

    _mm256_storeu_si256((__m256i *) indexes, hashes);
}

I hope to be able to submit this pull request by friday night.

wbd73 commented 2 years ago

Then it can be used like this :

index_pos = qoi_color_index(px.v);

As far as I understand, a hash function like that on the 32 bit px.v value will break compatibility between big endian and little endian systems.

ikskuh commented 2 years ago
static unsigned char qoi_color_index(unsigned int v)

We should define the hash via a 4-tuple of bytes (r,g,b,a) and not via a system specific variable (unsigned int is not portable between systems as it might be 16 bit, uint32_t is only portable between same-endian systems). Also the byte order in that hash is not visible

EDIT: Damn, @wbd73 ninja'd me

phoboslab commented 2 years ago

I choose r*3+g*5+b*7+a*11 because it had the best compression ratio of all hashes I tried, is fast and simple to explain. The fact that it may be harder to implement in hardware didn't occur to me. But considering the implementation that @wbd73 showed - is that really the case?

The proposed qoi_color_index (qoi.ci) by @MKCG performs worse than that. In particular it seems to have problems with images with few colors (icon_*) where even the original r^g^b^a (qoi.xor) performs better.

## Total for images/textures_photo
        decode ms   encode ms   decode mpps   encode mpps   size kb    rate
qoi.master:   6.3         8.9        167.46        117.48      1981   48.4%
qoi.xor:      6.4         8.1        162.60        129.77      1994   48.7%
qoi.ci:       6.2         9.1        169.81        115.82      1989   48.6%

## Total for images/textures_pk01
        decode ms   encode ms   decode mpps   encode mpps   size kb    rate
qoi.master:   0.6         0.9        201.52        136.78       178   35.2%
qoi.xor:      0.6         0.9        200.30        151.50       180   35.5%
qoi.ci:       0.6         0.9        212.05        136.93       178   35.2%

## Total for images/screenshot_game
        decode ms   encode ms   decode mpps   encode mpps   size kb    rate
qoi.master:   2.8         3.6        224.07        174.05       519   21.0%
qoi.xor:      2.8         3.3        226.40        191.40       524   21.2%
qoi.ci:       2.7         3.6        231.76        174.62       518   21.0%

## Total for images/textures_pk
        decode ms   encode ms   decode mpps   encode mpps   size kb    rate
qoi.master:   0.3         0.4        139.53         99.37        75   43.5%
qoi.xor:      0.3         0.4        146.81        111.44        77   44.6%
qoi.ci:       0.3         0.4        155.30        103.63        75   43.5%

## Total for images/textures_pk02
        decode ms   encode ms   decode mpps   encode mpps   size kb    rate
qoi.master:   1.9         2.6        162.19        118.54       479   40.4%
qoi.xor:      1.9         2.3        162.26        130.51       495   41.7%
qoi.ci:       1.7         2.5        176.48        121.71       484   40.9%

## Total for images/icon_64
        decode ms   encode ms   decode mpps   encode mpps   size kb    rate
qoi.master:   0.0         0.0        288.35        223.40         4   28.7%
qoi.xor:      0.0         0.0        277.39        238.46         4   28.9%
qoi.ci:       0.0         0.0        300.01        218.89         4   30.4%

## Total for images/icon_512
        decode ms   encode ms   decode mpps   encode mpps   size kb    rate
qoi.master:   0.6         0.7        433.47        375.16        85    8.4%
qoi.xor:      0.6         0.7        418.88        378.80        85    8.4%
qoi.ci:       0.6         0.7        441.00        380.94        98    9.6%

## Total for images/photo_kodak
        decode ms   encode ms   decode mpps   encode mpps   size kb    rate
qoi.master:   2.5         3.7        157.04        105.83       671   43.7%
qoi.xor:      2.5         3.4        155.30        115.68       682   44.4%
qoi.ci:       2.4         3.7        162.03        105.21       674   43.9%

## Total for images/textures_plants
        decode ms   encode ms   decode mpps   encode mpps   size kb    rate
qoi.master:   3.4         5.1        309.92        207.49       922   22.2%
qoi.xor:      3.5         4.7        303.13        225.82       925   22.3%
qoi.ci:       3.5         5.2        306.36        203.61       922   22.2%

## Total for images/screenshot_web
        decode ms   encode ms   decode mpps   encode mpps   size kb    rate
qoi.master:  18.4        22.7        442.01        357.91      2649    8.3%
qoi.xor:     18.7        21.6        434.99        376.16      2645    8.3%
qoi.ci:      18.8        22.6        432.74        359.79      2645    8.3%

## Total for images/pngimg
        decode ms   encode ms   decode mpps   encode mpps   size kb    rate
qoi.master:   6.4         8.9        283.73        202.49      1436   20.3%
qoi.xor:      6.3         8.1        287.96        224.57      1445   20.5%
qoi.ci:       6.2         8.9        292.10        204.34      1441   20.4%

## Total for images/photo_tecnick
        decode ms   encode ms   decode mpps   encode mpps   size kb    rate
qoi.master:   9.3        14.4        154.29         99.67      2527   44.9%
qoi.xor:      9.4        12.8        153.44        112.52      2550   45.3%
qoi.ci:       9.1        14.5        158.45         99.16      2531   45.0%

## Total for images/photo_wikipedia
        decode ms   encode ms   decode mpps   encode mpps   size kb    rate
qoi.master:   7.0        11.0        154.96         98.64      2102   49.6%
qoi.xor:      7.0         9.7        154.27        111.66      2119   50.0%
qoi.ci:       7.0        11.4        153.85         94.82      2104   49.7%

# Grand total for images
        decode ms   encode ms   decode mpps   encode mpps   size kb    rate
qoi.master:   2.0         2.8        226.85        163.09       463   25.6%
qoi.xor:      2.0         2.6        227.89        180.37       469   25.9%
qoi.ci:       2.0         2.8        234.24        163.54       465   25.7%

I'm not married to r*3+g*5+b*7+a*11, but it currently shows the best tradeoffs.

Also, it has been shown that QOI_OP_INDEX often doesn't contribute much to the compression ratio. I'm still pondering if there might exist a totally different strategy to refer to previous pixels than what we currently do. We're currently focusing on some micro-optimizations here. If you have the time to tinker, I'd encourage you to step back a little and look at the bigger picture :)

jmi2k commented 2 years ago

To be honest, I was a bit afraid of multiplications because of potential HW implementations, but then realized that

3n = (n<<1) + n
5n = (n<<2) + n
7n = (3n<<1) + n
11n = (5n<<1) + n

Bit shifting is free and addition is not a big deal (at least not as big as a general-purpose multiplier).

chocolate42 commented 2 years ago

Is it a good idea to enforce in the spec that 3 channel data cannot use any opcodes involving alpha? A bad encoder would have to emit a QOI_OP_RGBA instead of something else including QOI_OP_RGB so it should never make sense to do this anyway. Enforcing it allows for this simple optimisation to the decoder (the encoder was always free to optimise this way):

# Grand total for qoi/qoi_benchmark_suite/images
qoi-master91cc: 2.133       2.582        217.62        179.81       463   25.6%
qoi-hoist:      1.988       2.121        233.52        218.86       463   25.6%

qoi-hoist.h.txt

Archaistic commented 2 years ago

Chocolate42

To be honest, I was a bit afraid of multiplications because of potential HW implementations, but then realized that

3n = (n<<1) + n
5n = (n<<2) + n
7n = (3n<<1) + n
11n = (5n<<1) + n

Bit shifting is free and addition is not a big deal (at least not as big as a general-purpose multiplier).

It's definitely true that it's not as bad as implementing four entire general-purpose multipliers (that would be horrendous!), however it's not quite as easy to work around this as it is in software since you can really only change a register's value once per cycle. If you wanted to spend multiple cycles on the hash (and implement it the way you just put) you pretty much could do something like:

cycle 1:
hash = ((R<<1) + R)
cycle 2:
hash = hash + ((G<<2) + G)
cycle 3: 
hash = hash + ((((B<<1) + B) << 1) + B)
cycle 4:
hash = hash + ((((A<<2) + A) << 1) + A)

though this straightforward implementation implements as many adders as the one below, as it doesn't do all the additions in one cycle it pretty much has the gate delay of two adders in series, which is an improvement over single-cycle shift-addition if that ends up becoming a limiting clock-speed factor. The AND can be skipped in the multicycle add-shift, single-cycle add-shift, and xor hash as long as your hash register is only 6 bits long to begin with. The top two bits will be clipped off automatically. It'd also be trivial to pipeline and have 4 hashes in flight at once, though for normal QOI operation that would be almost completely unhelpful.

but in a parallel implementation (single-cycle, overall preferable if possible since QOI is almost completely serial so pipelining hashing won't help) it'd look more like

hash = (((R<<1) + R) + ((G<<2) + G)) + (((((B<<1) + B) << 1) + B) + ((((A<<2) + A) << 1) + A))
(the parentheses don't have to be this strict but I prefer the order of operations being completely unambiguous)

which has the maximum gate delay of (((A+A) + B) + (R+G))
(R+G being its own adder, has only one adder delay vs (A+A)+B's 2 and as such doesn't increase delay overall) So 3 adder delays. The adders would probably be 5 or 6 bits long and for each adder, and if it's a typical ripple-carry adder the bits have to carry through each full-adder in them, which in a 6-bit adder comes out to 10 gate delays per adder. (5 bits to carry into, 5 bits to carry out of) So, with the 3 adders that comes out to about 30 gate delays' worth of settle time.

With the XOR hash, the single-cycle parallel implementation looks like

hash = (R^G)^(B^A)

And for any given bit it's (bit_R^bit_G)^(bit_B^bit_A), which has a maximum gate delay of about 2 gates, aka the delay of one XOR gate feeding into another.

Strictly speaking the shift-add isn't hard to implement in hardware, it just uses dramatically more logic and is dramatically slower than XOR in terms of logic latency.

PhobosLab

## Total for images/textures_photo
etc

These are all very interesting. I'd be willing to take the tradeoff of the less efficient XOR hash for sure personally. It's not much of a difference in terms of compression ratio.

Chocolate42

Is it a good idea to enforce in the spec that 3 channel data cannot use any opcodes involving alpha?

I like this idea personally. It seems like an obvious optimization to make.

It would also potentially allow using the RGB hash instead of the RGBA hash for images without transparency, which I would expect to slightly improve compression for images without transparency. RGB image vs RGBA image could be indicated in the header, or the pixel stream could just always start with either a QOI_OP_RGB or QOI_OP_RGBA and the decoder could determine whether the image has transparency based on that. The RGB hash calculation is the same equation as the RGBA hash calculation if for no-transparency images the alpha channel is calculated as if it's always 0x00. The results are different of course, but the calculation is the same.

R*3 + G*5 + B*7 + A*11, A=0 == R*3 + G*5 + B*7
R^G^B^A, A=0 == R^G^B

(not too tricky to control decoding opaque vs alpha-channeled images this way in hardware, you'd just add a single bit configuration register. Either have the processor set the bit based on what it finds in the header, or have it set automatically based on bit 0 of the first byte of the pixel stream if the OP_RGBx opcode is forced to be first. If the bit is 0, use 0x00 as the alpha value for hash calculation.) That would raise the potential problem of misbehaving or malicious encoders, which say they have an RGB image and then encode a QOI_OP_RGBA opcode somewhere, but that's a relatively trivial thing to catch while decoding and probably just ignore the encoded alpha value.

chocolate42 commented 2 years ago

RGB image vs RGBA image could be indicated in the header,

It already is by the channel count being 3 or 4

The RGB hash calculation is the same equation as the RGBA hash calculation if for no-transparency images the alpha channel is calculated as if it's always 0x00. The results are different of course, but the calculation is the same.

You can keep the results the same for both paths with A=0xff pretty simply, just add 53 to the prime sum instead of +A*11 and if I'm understanding correctly for the xor hash replace ^A with ~. The format handling RGB and RGBA differently would be a pain.

Archaistic commented 2 years ago

RGB image vs RGBA image could be indicated in the header,

It already is by the channel count being 3 or 4

The RGB hash calculation is the same equation as the RGBA hash calculation if for no-transparency images the alpha channel is calculated as if it's always 0x00. The results are different of course, but the calculation is the same.

You can keep the results the same for both paths with A=0xff pretty simply, just add 53 to the prime sum instead of +A*11 and if I'm understanding correctly for the xor hash replace ^A with ~. The format handling RGB and RGBA differently would be a pain.

You're right on all counts. I hadn't looked closely at the header since I intended the CPU to parse it and configure the decoder based on it, so I kinda forgot it has a channel number. The truly critical part for the hardware was the pixel processing itself (opcodes and hash function). I like your ideas about the hash too, yeah you can just invert the XOR. For hardware it'd probably be better to conditionally xor B with either Alpha or FF than it would be to conditionally invert the output of a (R^G)^B hash. I guess it doesn't really matter what the A-constant is for RGB hashing as long as it truly is constant in images lacking the alpha channel.

notnullnotvoid commented 2 years ago

As far as I understand, a hash function like that on the 32 bit px.v value will break compatibility between big endian and little endian systems.

In practice, that's not really a problem. Of course the spec needs to be explicit about endianness in a case like this, but given the non-existence of relevant big-endian hardware these days, designing for little-endian hardware is perfectly fine. If a little-endian hash function is chosen and qoi ever ends up on an old big-endian system, it'll just have to do a byte swap before hashing, which is not a big deal.

nigeltao commented 2 years ago

Lots of discussion about hash functions.

Here's another experiment, called "qoi-fifo2e".

Instead of tweaking the hash function, we remove the hash function and just use a FIFO cache: keep the 64 most recently written (not necessarily most recently used/read) colors. Update the cache on DIFF, LUMA, RGB and RGBA ops. Leave the cache unchanged on INDEX and RUN ops.

master is commit 2ee2169 (2021-12-11).

        decode ms   encode ms   decode mpps   encode mpps   size kb    rate
# Grand total for images
libpng:      13.5       187.9         34.47          2.47       423   23.3%
qoi-master:   5.1         5.2         91.89         89.38       463   25.6%
qoi-fifo2e:   4.2        12.7        109.58         36.57       460   25.4%

See https://github.com/nigeltao/qoi2-bikeshed/issues/19#issuecomment-994241022 for details.

notnullnotvoid commented 2 years ago

Wow, that's a very enticing possibility. The drop in encode speed is a bit hard to swallow, but there are two things that make me think it's actually not as bad as it looks:

  1. the cache search is a perfect candidate for SIMD
  2. a conforming encoder has the option to trade size for speed by searching only a portion of the cache

I'll see if I can run some numbers on both of those ideas, unless you get to them first.

wbd73 commented 2 years ago

Instead of tweaking the hash function, we remove the hash function and just use a FIFO cache: keep the 64 most recently written (not necessarily most recently used/read) colors. Update the cache on DIFF, LUMA, RGB and RGBA ops. Leave the cache unchanged on INDEX and RUN ops.

Interesting approach. A little tweak could be every time a color is found in the cache and is not at the first position of the cache, to exchange it with the one before it so it moves one more to the front of the cache. It would impact performance a bit I guess but it would keep often used colors longer inside the cache.

Wulf0x67E7 commented 2 years ago

@phoboslab How about replacing QOI_INDEX with something like QOI_INDEX_LUMA?

By having the hash function only consider the upper 4 bits we are likely to have hash-collisions with small diffs, which we can exploit with a LUMA byte.

Note that because the alpha of an indexed pixel is complettely disregarded both on en- and decode, not having it influence the hash is both faster and (should be) more effective.

I don't have the necessary headers/libs to actually test this idea right now, but have tried implementing the logic for it below for those who do. (@nigeltao ?)

EDIT: WSL to the rescue! I have now run a quick first test, but it's not looking too good:

./qoibench 1 images/ --nowarmup --nopng

Grand total for images decode ms encode ms decode mpps encode mpps size kb rate
qoi_master (ae07396) 2.5 3.2 183.06 146.33 463 28.2%
qoi_il_3bits 2.6 3.3 177.13 142.10 497 30.3%
qoi_il_4bits 2.7 3.3 173.41 140.56 492 30.0%
qoi_il_5bits 2.6 3.3 178.19 142.38 489 29.8%
qoi_il_6bits 2.7 3.3 174.07 141.85 490 29.9%
qoi_il2_5bits 2.7 3.7 170.11 127.09 480 29.2%
qoi_il2a_4bits 2.6 3.8 181.72 121.65 483 29.4%
qoi_il2a_5bits 2.6 3.9 177.59 119.93 478 29.1%

Maybe more tinkering with the hash, how it handles alpha, or splitting QOI_INDEX(_LUMA) into 2, 3-bit-tag variants might help, needs further testing.

Edit: replaced code block with attachment: qoi_il.h.txt qoi_il2a.h.txt

y-guyon commented 2 years ago

Hi, I'm Yannis from the WebP team. First of all, good job on researching and developing this interesting QOI format! While working on WebP2, research was done on a generic container specification. The goal is to have a small, simple and safe header. Based on that, we thought we could give some suggestions on the design of the QOI container.

Current state

For convenience, at the time of writing, the current QOI container looks like this:

4 bytes "qoif"
4 bytes width
4 bytes height
1 byte channels
1 byte colorspace
X bytes of encoded pixels
8 bytes set to zero

Header

This would lead to the following bit-packed array:

24 bits "qoi"
14 bits width
14 bits height
1 bit channels
1 bit colorspace
2 bits padding

We agree that this is slightly more complex to read and write but we would argue that the specification tightness is worth it for security reasons. There is no room for format abuse this way. All possible bit sequences are valid. Size-wise, it also goes from the current 14 bytes of header to only 7 bytes.

Metadata

If needed, metadata such as ICC, XMP and/or EXIF could be inserted into this format.

Encoded pixels

Tom94 commented 2 years ago

As mentioned many times by others, I don't think limiting the width/height of the image is wise.

For example: I am currently working with gigapixel images which routinely have width/height over 16,384 pixels. JPG and PNG can handle these just fine, despite being old. It would be shocking if QOI can't.

Also consider very wide or tall images, which could be used for all kinds of visualization purposes. Extreme example: visualizing a vector of 1 billion 16-byte entries as a 1000000000x4 image. It happens rarely, but qoi could handle it if not limited by the header.

ikskuh commented 2 years ago

I agree with @Tom94 here, restricting the image size is a bad idea, even to 16 bit in general. Yes, the maximum size exceeds, but i often work with images in 8192×… area, so it's likely they will grow by some small factor (2 would be enough to already be at limits). A typical sky cube with 2048² is encoded as 2048×12288 pixels, assuming we want to have sharp images on a normal screen. When we go 4k, we better use 4096×24576 which would be out of bounds already while "only" using 400 megs of storage.

chocolate42 commented 2 years ago

The width and height should be clamped to reasonable values. Currently a QOI image could be 2^33 pixels: is it supported by the current implementation?

No, but it's a trivial change as it depends only on the width of an indexing integer (which should be hardened by using a uint64_t or refactored).

30 bit w/h instead of 14, targeting 64 bit int safety instead of 32 bit, solves the problem? Conforms to the generic spec and allows for billion pixel dimensions (but not two billion, that would be crazy).

phoboslab commented 2 years ago

I appreciate your input, but I fail to understand why some of the points would be a problem and disagree with the others.

But there's a risk of confusion with regular ASCII files.

Confusion by whom, in which cases?

Replacing the byte of 3 or 4 channels by a single bit "has_alpha" (...) (...) the colorspace could be a single bit "is_srgb" too

A library can refuse to decode QOI images if channels is not 3 or 4 or colorspace is not 0 or 1. That said, decoding of the data stream is independent of the channels and colorspace info.

The byte savings of a smaller header are irrelevant.

The width and height should be clamped to reasonable values

There are legitimate uses for larger values, as mentioned by others. It's debatable if this library should limit memory allocation to e.g. 1GB by default - but I'm leaning against it. If you accept untrusted input, it's your responsibility to check if the image dimensions are within the range you want to deal with. (Granted, this library is currently missing a qoi_read_header() function for that case - I'll get to it!)

metadata such as ICC, XMP and/or EXIF could be inserted into this format.

Not a fan. I have strong feelings about software that tries to please everyone.

We would recommend prepending the encoded pixels with the number of bytes,

We had this discussion here a number of times and imho the benefits of not having to store this information outweigh the drawbacks. The padding will be made reliable for more esoteric use-cases.

wbd73 commented 2 years ago

I was wondering if anyone already tried before to only store the RGB channels for the QOI_OP_INDEX tag and change the QOI_OP_RGBA tag into a QOI_OP_A combination tag which indicates a change in alpha value using 2 bytes ? At the moment, any change in alpha value results in a QOI_OP_RGBA tag requiring 5 bytes. A 2 byte QOI_OP_A tag combined with a QOI_OP_RGB would require 6 bytes in total but there would also be a chance that QOI_OP_A could be combined with QOI_OP_INDEX, QOI_OP_DIFF or QOI_OP_LUMA which would require less than the current 5 bytes for QOI_OP_RGBA. It also would simplify the color hash function since the alpha channel wouldn't be required anymore with the hash.