Here are some pretty sweet throughput numbers on my laptop (with similar compression sizes) for a proof of concept, combining little-endianness (#3) with 7-bit indexes (#20). Little-endianness in particular allows us to remove a lot of branching in the decoder's inner loops.

I can't think of a good name for it right now, so I'm just going to call it "Demo 10".

        decode ms   encode ms   decode mpps   encode mpps   size kb
images/kodak
qoi-master:   6.4         8.4         61.15         47.01       771
qoi-experi:   6.9         9.4         57.36         41.91       700
qoi-demo10:   4.0         6.8         98.25         58.03       772
demo10/master ratio:                   1.61x         1.23x
images/misc
qoi-master:   5.7         6.0        154.59        147.09       400
qoi-experi:   6.0         6.6        148.95        134.45       407
qoi-demo10:   2.7         4.3        327.75        204.86       400
demo10/master ratio:                   2.12x         1.39x
images/screenshots
qoi-master:  54.4        48.8        151.33        168.53      2582
qoi-experi:  53.9        51.3        152.61        160.47      2491
qoi-demo10:  28.1        33.6        292.57        244.87      2481
demo10/master ratio:                   1.93x         1.45x
images/textures
qoi-master:   1.8         2.1         74.17         62.86       184
qoi-experi:   1.7         2.3         75.03         56.17       179
qoi-demo10:   0.9         1.5        138.48         85.30       180
demo10/master ratio:                   1.87x         1.36x
images/wallpaper
qoi-master: 138.5       154.1         67.69         60.82     10640
qoi-experi: 143.3       155.7         65.38         60.19     10170
qoi-demo10: 101.1       124.1         92.70         75.50     10669
demo10/master ratio:                   1.37x         1.24x

Details:

qoi-master is the phoboslab/master branch, commit e9069e1.
qoi-experi is the phoboslab/experimental branch, commit cbb62ea.
qoi-demo10 is the little-endian index-7 "demo 10". Source code is in qoi-demo10.h.txt (it replaces qoi.h).

any idea why the size is worse than experimental? is that fixable?

It's not strictly worse, though. On the "size kb" column, qoi-experi is smaller on kodak, textures and wallpaper (which are all opaque and, on average, more photographic) but qoi-demo10 is smaller on misc and screenshots.

I'm not sure about the underlying reasons other than there's a fixed amount of 8-bit opcode space and giving more opcodes to one thing means taking away opcodes from another thing. demo10 has a larger QOI_INDEX size. experi has a whole new op category (QOI_GDIFF). Any particular trade-off might look better on one test suite but worse on another.

As a data point, I added a simple "subtract green" transform to demo10's qoi_encode (and its inverse to qoi_decode):

--- qoi-demo10.h    2021-12-04 08:53:03.115172245 +1100
+++ qoi-subg10.h    2021-12-04 09:00:34.854176340 +1100
@@ -454,6 +454,8 @@
            px.rgba.g = pixels[px_pos+1];
            px.rgba.b = pixels[px_pos+2];
        }
+       px.rgba.r -= px.rgba.g;
+       px.rgba.b -= px.rgba.g;

        if (px.v == px_prev.v) {
            run++;
@@ -947,6 +949,11 @@
        }
    }

+   for (int px_pos = 0; px_pos < px_len; px_pos += channels) {
+       pixels[px_pos + 0] += pixels[px_pos + 1];
+       pixels[px_pos + 2] += pixels[px_pos + 1];
+   }
+
    return pixels;
 }

Again, there's not an outright winner. Some compression sizes got slightly better. Some got slightly worse.

It also tanked the throughput numbers, though. I'm not sure if that's fixable. This was a very quick evaluation focuing on file sizes.

        decode ms   encode ms   decode mpps   encode mpps   size kb
images/kodak
qoi-demo10:   4.0         6.8         98.25         58.03       772
qoi-subg10:   5.0         8.7         79.06         45.08       756
images/misc
qoi-demo10:   2.7         4.3        327.75        204.86       400
qoi-subg10:   4.3         6.9        206.77        129.22       402
images/screenshots
qoi-demo10:  28.1        33.6        292.57        244.87      2481
qoi-subg10:  42.7        60.1        192.93        137.04      2446
images/textures
qoi-demo10:   0.9         1.5        138.48         85.30       180
qoi-subg10:   1.2         1.9        106.70         66.81       177
images/wallpaper
qoi-demo10: 101.1       124.1         92.70         75.50     10669
qoi-subg10: 123.4       163.2         75.95         57.43     10680

Another data point, on the 512x512 pixel editions of the Tango Icon Library icons (as suggested in #17). 213 PNG images. Average PNG size is 51 KiB.

        decode ms   encode ms   decode mpps   encode mpps   size kb
tango-icon-library-pngs/png-512
qoi-master:   1.5         1.5        174.95        176.58        80
qoi-experi:   1.5         1.5        176.91        178.29        79
qoi-demo10:   0.8         1.0        316.69        259.58        76
demo10/master ratio:                   1.81x         1.47x

Honestly I didn't understand why increasing the color cache should affect the compression ratio much. This helps mostly for palletized images, but not for regular images where colors are not often reused.

Here's a patch to count opcode histograms:

--- a/qoi-demo10.h  2021-12-04 08:53:03.115172245 +1100
+++ b/qoi-demo10.h  2021-12-06 09:22:55.592694824 +1100
@@ -1,3 +1,15 @@
+int histogram[9] = {0};
+const char* histnames[9] = {
+"QOI_INDEX1",
+"QOI_DIFF1",
+"QOI_DIFF2",
+"QOI_DIFF3",
+"QOI_DIFF4",
+"QOI_DIFF5",
+"QOI_RUN1",
+"QOI_RUN2",
+"QOI_RUN3",
+};
 /*

 QOI - The "Quite OK Image" format for fast, lossless image compression
@@ -386,19 +398,23 @@

 static inline void qoi_encode_run(unsigned char *bytes, int *pptr, int run) {
    if (run == 1) {
+histogram[1]++;
        poke_u8(bytes+*pptr, QOI_DIFF1 | 0xA8);
        *pptr += 1;
    }
    else if (run < 30) {
+histogram[6]++;
        poke_u8(bytes+*pptr, QOI_RUN1 | (run << 3));
        *pptr += 1;
    }
    else if (run < 256) {
+histogram[7]++;
        poke_u8(bytes+*pptr+0, QOI_RUN2);
        poke_u8(bytes+*pptr+1, run);
        *pptr += 2;
    }
    else {
+histogram[8]++;
        poke_u8(bytes+*pptr+0, QOI_RUN3);
        poke_u8(bytes+*pptr+1, run>>0);
        poke_u8(bytes+*pptr+2, run>>8);
@@ -471,6 +487,7 @@
            int index_pos = QOI_COLOR_HASH(px);

            if (index[index_pos].v == px.v) {
+histogram[0]++;
                poke_u8(bytes+p, QOI_INDEX1 | (index_pos << 1));
                p += 1;
            }
@@ -488,6 +505,7 @@
                        ((uint8_t)(dg + 2u) <= 3u) &&
                        ((uint8_t)(db + 2u) <= 3u)
                    ) {
+histogram[1]++;
                        poke_u8(bytes+p, QOI_DIFF1 |
                                ((uint8_t)(dr + 2u) << 2u) |
                                ((uint8_t)(dg + 2u) << 4u) |
@@ -499,6 +517,7 @@
                        ((uint8_t)(dg + 8u) <= 15u) &&
                        ((uint8_t)(db + 8u) <= 15u)
                    ) {
+histogram[2]++;
                        poke_u16le(bytes+p, QOI_DIFF2 |
                                ((uint8_t)(dr + 8u) << 4u) |
                                ((uint8_t)(dg + 8u) << 8u) |
@@ -510,6 +529,7 @@
                        ((uint8_t)(dg + 16u) <= 31u) &&
                        ((uint8_t)(db + 16u) <= 31u)
                    ) {
+histogram[3]++;
                        poke_u24le(bytes+p, QOI_DIFF3 |
                                ((uint8_t)(dr + 16u) << 4u) |
                                ((uint8_t)(dg + 16u) << 9u) |
@@ -518,6 +538,7 @@
                        p += 3;
                    }
                    else {
+histogram[4]++;
                        bytes[p++] = QOI_DIFF4;
                        bytes[p++] = dr;
                        bytes[p++] = dg;
@@ -531,6 +552,7 @@
                        ((uint8_t)(db + 16u) <= 31u) &&
                        ((uint8_t)(da + 16u) <= 31u)
                    ) {
+histogram[3]++;
                            poke_u24le(bytes+p, QOI_DIFF3 |
                                    ((uint8_t)(dr + 16u) << 4u) |
                                    ((uint8_t)(dg + 16u) << 9u) |
@@ -539,6 +561,7 @@
                            p += 3;
                    }
                    else {
+histogram[5]++;
                        bytes[p++] = QOI_DIFF5;
                        bytes[p++] = dr;
                        bytes[p++] = dg;

and

--- a/qoibench.c    2021-12-06 09:29:35.576698449 +1100
+++ b/qoibench.c    2021-12-06 09:30:58.729699203 +1100
@@ -519,5 +519,12 @@

    benchmark_print_result("Totals (AVG)", totals);

+int histotal = 0;
+for (int i = 0; i < 9; i++) {
+  histotal += histogram[i];
+}
+for (int i = 0; i < 9; i++) {
+  printf("%-12s%12d   %0.4f\n", histnames[i], histogram[i], (double)histogram[i]/(double)histotal);
+}
    return 0;
 }

Opcode Histograms

Edit: I split out an artificial QOI_REP1 row to count "a 1-length run", which demo10 can encode in 1 byte either as a QOI_INDEX1 or a QOI_DIFF1.

I might be triple-counting (calling qoi_encode thrice), but it's the fractions that matter, not the absolute counts.

NAME               COUNT FRACTION

images/kodak
QOI_INDEX1       5244627   0.1911
QOI_REP1         1577745   0.0575
QOI_DIFF1        2229174   0.0812
QOI_DIFF2       10353426   0.3772
QOI_DIFF3        3843036   0.1400
QOI_DIFF4        3814401   0.1390
QOI_DIFF5              0   0.0000
QOI_RUN1          383481   0.0140
QOI_RUN2            1407   0.0001
QOI_RUN3             297   0.0000

images/misc
QOI_INDEX1        335304   0.1300
QOI_REP1          220257   0.0854
QOI_DIFF1         679683   0.2635
QOI_DIFF2         292371   0.1133
QOI_DIFF3         586974   0.2275
QOI_DIFF4          98868   0.0383
QOI_DIFF5         137346   0.0532
QOI_RUN1          211008   0.0818
QOI_RUN2           12822   0.0050
QOI_RUN3            5088   0.0020

images/screenshots
QOI_INDEX1      12828252   0.2432
QOI_REP1         2584725   0.0490
QOI_DIFF1        9000384   0.1706
QOI_DIFF2        8655897   0.1641
QOI_DIFF3        3652713   0.0692
QOI_DIFF4        9636693   0.1827
QOI_DIFF5              0   0.0000
QOI_RUN1         5087340   0.0964
QOI_RUN2         1122582   0.0213
QOI_RUN3          182223   0.0035

images/textures
QOI_INDEX1      11350341   0.3444
QOI_REP1         1147431   0.0348
QOI_DIFF1        1980981   0.0601
QOI_DIFF2        9417486   0.2857
QOI_DIFF3        4313643   0.1309
QOI_DIFF4        3876123   0.1176
QOI_DIFF5              0   0.0000
QOI_RUN1          819174   0.0249
QOI_RUN2           53082   0.0016
QOI_RUN3            1122   0.0000

images/wallpaper
QOI_INDEX1     199880241   0.3101
QOI_REP1        59868390   0.0929
QOI_DIFF1      110251467   0.1711
QOI_DIFF2      160711896   0.2494
QOI_DIFF3       39464934   0.0612
QOI_DIFF4       32742819   0.0508
QOI_DIFF5              0   0.0000
QOI_RUN1        40708647   0.0632
QOI_RUN2          869475   0.0013
QOI_RUN3           14805   0.0000

tango-icon-library-pngs/png-512
QOI_INDEX1      10806669   0.3053
QOI_REP1         3333795   0.0942
QOI_DIFF1        8349252   0.2359
QOI_DIFF2        1361142   0.0385
QOI_DIFF3        1861113   0.0526
QOI_DIFF4        1803873   0.0510
QOI_DIFF5         927729   0.0262
QOI_RUN1         6305679   0.1781
QOI_RUN2          573573   0.0162
QOI_RUN3           74694   0.0021

tango-icon-library-pngs/png-64
QOI_INDEX1        420957   0.3175
QOI_REP1           69408   0.0523
QOI_DIFF1         151905   0.1146
QOI_DIFF2         119769   0.0903
QOI_DIFF3         113607   0.0857
QOI_DIFF4         175122   0.1321
QOI_DIFF5         166518   0.1256
QOI_RUN1           98715   0.0744
QOI_RUN2            9330   0.0070
QOI_RUN3             594   0.0004

Honestly I didn't understand why increasing the color cache should affect the compression ratio much. This helps mostly for palletized images, but not for regular images where colors are not often reused.

I'm not entirely sure either, but FWIW here's some data on how often QOI_INDEX1 gets used as I dial down the 127 in this line:

#define QOI_COLOR_HASH(C) ((C.rgba.r ^ C.rgba.g ^ C.rgba.b ^ C.rgba.a) & 127)

Color Cache Hit Rate

images/kodak
QOI_INDEX1        921066   0.0336   1-bit hash
QOI_INDEX1       1884114   0.0686   2-bit hash
QOI_INDEX1       3071772   0.1119   3-bit hash
QOI_INDEX1       3932514   0.1433   4-bit hash
QOI_INDEX1       4478970   0.1632   5-bit hash
QOI_INDEX1       4840920   0.1764   6-bit hash
QOI_INDEX1       5244627   0.1911   7-bit hash

images/misc
QOI_INDEX1         76332   0.0296   1-bit hash
QOI_INDEX1        128883   0.0500   2-bit hash
QOI_INDEX1        169314   0.0656   3-bit hash
QOI_INDEX1        201573   0.0781   4-bit hash
QOI_INDEX1        235701   0.0914   5-bit hash
QOI_INDEX1        278769   0.1081   6-bit hash
QOI_INDEX1        335304   0.1300   7-bit hash

images/screenshots
QOI_INDEX1       1484016   0.0281   1-bit hash
QOI_INDEX1       3179232   0.0603   2-bit hash
QOI_INDEX1       4798965   0.0910   3-bit hash
QOI_INDEX1       6541278   0.1240   4-bit hash
QOI_INDEX1       8495481   0.1610   5-bit hash
QOI_INDEX1      10678482   0.2024   6-bit hash
QOI_INDEX1      12828252   0.2432   7-bit hash

images/textures
QOI_INDEX1       1594854   0.0484   1-bit hash
QOI_INDEX1       2712801   0.0823   2-bit hash
QOI_INDEX1       4055679   0.1231   3-bit hash
QOI_INDEX1       5755113   0.1746   4-bit hash
QOI_INDEX1       7795578   0.2365   5-bit hash
QOI_INDEX1       9923466   0.3011   6-bit hash
QOI_INDEX1      11350341   0.3444   7-bit hash

images/wallpaper
QOI_INDEX1      46937343   0.0728   1-bit hash
QOI_INDEX1     102437733   0.1589   2-bit hash
QOI_INDEX1     147887331   0.2295   3-bit hash
QOI_INDEX1     177376335   0.2752   4-bit hash
QOI_INDEX1     190361811   0.2954   5-bit hash
QOI_INDEX1     196686504   0.3052   6-bit hash
QOI_INDEX1     199880241   0.3101   7-bit hash

tango-icon-library-pngs/png-512
QOI_INDEX1        445926   0.0126   1-bit hash
QOI_INDEX1        905001   0.0256   2-bit hash
QOI_INDEX1       1629585   0.0460   3-bit hash
QOI_INDEX1       2944923   0.0832   4-bit hash
QOI_INDEX1       5267601   0.1488   5-bit hash
QOI_INDEX1       8354949   0.2360   6-bit hash
QOI_INDEX1      10806669   0.3053   7-bit hash

tango-icon-library-pngs/png-64
QOI_INDEX1         17541   0.0132   1-bit hash
QOI_INDEX1         39786   0.0300   2-bit hash
QOI_INDEX1         75480   0.0569   3-bit hash
QOI_INDEX1        147141   0.1110   4-bit hash
QOI_INDEX1        253395   0.1911   5-bit hash
QOI_INDEX1        355503   0.2681   6-bit hash
QOI_INDEX1        420957   0.3175   7-bit hash

In WebP lossless I use a 'double RLE' mode. The double RLE mode can repeat the previous pixel, or copy N pixels of the previous row at respective locations. I choose the longer of the RLEs. Having both RLEs instead of one turns out to be helpful on a large variety of images. In addition to it, I try the full LZ77 mode, and choose the one that generates less entropy in initial analysis.

The double RLE mode uses only distance codes 0 and 1, reducing the entropy of distance codes (See https://developers.google.com/speed/webp/docs/webp_lossless_bitstream_specification#42_encoding_of_image_data) into one bit.

nigeltao / qoi2-bikeshed

Demo 10: a Proof of Concept #22

Opcode Histograms

Color Cache Hit Rate