Open nigeltao opened 2 years ago
any idea why the size is worse than experimental? is that fixable?
It's not strictly worse, though. On the "size kb" column, qoi-experi is smaller on kodak, textures and wallpaper (which are all opaque and, on average, more photographic) but qoi-demo10 is smaller on misc and screenshots.
I'm not sure about the underlying reasons other than there's a fixed amount of 8-bit opcode space and giving more opcodes to one thing means taking away opcodes from another thing. demo10 has a larger QOI_INDEX size. experi has a whole new op category (QOI_GDIFF). Any particular trade-off might look better on one test suite but worse on another.
As a data point, I added a simple "subtract green" transform to demo10's qoi_encode (and its inverse to qoi_decode):
--- qoi-demo10.h 2021-12-04 08:53:03.115172245 +1100
+++ qoi-subg10.h 2021-12-04 09:00:34.854176340 +1100
@@ -454,6 +454,8 @@
px.rgba.g = pixels[px_pos+1];
px.rgba.b = pixels[px_pos+2];
}
+ px.rgba.r -= px.rgba.g;
+ px.rgba.b -= px.rgba.g;
if (px.v == px_prev.v) {
run++;
@@ -947,6 +949,11 @@
}
}
+ for (int px_pos = 0; px_pos < px_len; px_pos += channels) {
+ pixels[px_pos + 0] += pixels[px_pos + 1];
+ pixels[px_pos + 2] += pixels[px_pos + 1];
+ }
+
return pixels;
}
Again, there's not an outright winner. Some compression sizes got slightly better. Some got slightly worse.
It also tanked the throughput numbers, though. I'm not sure if that's fixable. This was a very quick evaluation focuing on file sizes.
decode ms encode ms decode mpps encode mpps size kb
images/kodak
qoi-demo10: 4.0 6.8 98.25 58.03 772
qoi-subg10: 5.0 8.7 79.06 45.08 756
images/misc
qoi-demo10: 2.7 4.3 327.75 204.86 400
qoi-subg10: 4.3 6.9 206.77 129.22 402
images/screenshots
qoi-demo10: 28.1 33.6 292.57 244.87 2481
qoi-subg10: 42.7 60.1 192.93 137.04 2446
images/textures
qoi-demo10: 0.9 1.5 138.48 85.30 180
qoi-subg10: 1.2 1.9 106.70 66.81 177
images/wallpaper
qoi-demo10: 101.1 124.1 92.70 75.50 10669
qoi-subg10: 123.4 163.2 75.95 57.43 10680
Another data point, on the 512x512 pixel editions of the Tango Icon Library icons (as suggested in #17). 213 PNG images. Average PNG size is 51 KiB.
decode ms encode ms decode mpps encode mpps size kb
tango-icon-library-pngs/png-512
qoi-master: 1.5 1.5 174.95 176.58 80
qoi-experi: 1.5 1.5 176.91 178.29 79
qoi-demo10: 0.8 1.0 316.69 259.58 76
demo10/master ratio: 1.81x 1.47x
Honestly I didn't understand why increasing the color cache should affect the compression ratio much. This helps mostly for palletized images, but not for regular images where colors are not often reused.
Here's a patch to count opcode histograms:
--- a/qoi-demo10.h 2021-12-04 08:53:03.115172245 +1100
+++ b/qoi-demo10.h 2021-12-06 09:22:55.592694824 +1100
@@ -1,3 +1,15 @@
+int histogram[9] = {0};
+const char* histnames[9] = {
+"QOI_INDEX1",
+"QOI_DIFF1",
+"QOI_DIFF2",
+"QOI_DIFF3",
+"QOI_DIFF4",
+"QOI_DIFF5",
+"QOI_RUN1",
+"QOI_RUN2",
+"QOI_RUN3",
+};
/*
QOI - The "Quite OK Image" format for fast, lossless image compression
@@ -386,19 +398,23 @@
static inline void qoi_encode_run(unsigned char *bytes, int *pptr, int run) {
if (run == 1) {
+histogram[1]++;
poke_u8(bytes+*pptr, QOI_DIFF1 | 0xA8);
*pptr += 1;
}
else if (run < 30) {
+histogram[6]++;
poke_u8(bytes+*pptr, QOI_RUN1 | (run << 3));
*pptr += 1;
}
else if (run < 256) {
+histogram[7]++;
poke_u8(bytes+*pptr+0, QOI_RUN2);
poke_u8(bytes+*pptr+1, run);
*pptr += 2;
}
else {
+histogram[8]++;
poke_u8(bytes+*pptr+0, QOI_RUN3);
poke_u8(bytes+*pptr+1, run>>0);
poke_u8(bytes+*pptr+2, run>>8);
@@ -471,6 +487,7 @@
int index_pos = QOI_COLOR_HASH(px);
if (index[index_pos].v == px.v) {
+histogram[0]++;
poke_u8(bytes+p, QOI_INDEX1 | (index_pos << 1));
p += 1;
}
@@ -488,6 +505,7 @@
((uint8_t)(dg + 2u) <= 3u) &&
((uint8_t)(db + 2u) <= 3u)
) {
+histogram[1]++;
poke_u8(bytes+p, QOI_DIFF1 |
((uint8_t)(dr + 2u) << 2u) |
((uint8_t)(dg + 2u) << 4u) |
@@ -499,6 +517,7 @@
((uint8_t)(dg + 8u) <= 15u) &&
((uint8_t)(db + 8u) <= 15u)
) {
+histogram[2]++;
poke_u16le(bytes+p, QOI_DIFF2 |
((uint8_t)(dr + 8u) << 4u) |
((uint8_t)(dg + 8u) << 8u) |
@@ -510,6 +529,7 @@
((uint8_t)(dg + 16u) <= 31u) &&
((uint8_t)(db + 16u) <= 31u)
) {
+histogram[3]++;
poke_u24le(bytes+p, QOI_DIFF3 |
((uint8_t)(dr + 16u) << 4u) |
((uint8_t)(dg + 16u) << 9u) |
@@ -518,6 +538,7 @@
p += 3;
}
else {
+histogram[4]++;
bytes[p++] = QOI_DIFF4;
bytes[p++] = dr;
bytes[p++] = dg;
@@ -531,6 +552,7 @@
((uint8_t)(db + 16u) <= 31u) &&
((uint8_t)(da + 16u) <= 31u)
) {
+histogram[3]++;
poke_u24le(bytes+p, QOI_DIFF3 |
((uint8_t)(dr + 16u) << 4u) |
((uint8_t)(dg + 16u) << 9u) |
@@ -539,6 +561,7 @@
p += 3;
}
else {
+histogram[5]++;
bytes[p++] = QOI_DIFF5;
bytes[p++] = dr;
bytes[p++] = dg;
and
--- a/qoibench.c 2021-12-06 09:29:35.576698449 +1100
+++ b/qoibench.c 2021-12-06 09:30:58.729699203 +1100
@@ -519,5 +519,12 @@
benchmark_print_result("Totals (AVG)", totals);
+int histotal = 0;
+for (int i = 0; i < 9; i++) {
+ histotal += histogram[i];
+}
+for (int i = 0; i < 9; i++) {
+ printf("%-12s%12d %0.4f\n", histnames[i], histogram[i], (double)histogram[i]/(double)histotal);
+}
return 0;
}
Edit: I split out an artificial QOI_REP1 row to count "a 1-length run", which demo10 can encode in 1 byte either as a QOI_INDEX1 or a QOI_DIFF1.
I might be triple-counting (calling qoi_encode thrice), but it's the fractions that matter, not the absolute counts.
NAME COUNT FRACTION
images/kodak
QOI_INDEX1 5244627 0.1911
QOI_REP1 1577745 0.0575
QOI_DIFF1 2229174 0.0812
QOI_DIFF2 10353426 0.3772
QOI_DIFF3 3843036 0.1400
QOI_DIFF4 3814401 0.1390
QOI_DIFF5 0 0.0000
QOI_RUN1 383481 0.0140
QOI_RUN2 1407 0.0001
QOI_RUN3 297 0.0000
images/misc
QOI_INDEX1 335304 0.1300
QOI_REP1 220257 0.0854
QOI_DIFF1 679683 0.2635
QOI_DIFF2 292371 0.1133
QOI_DIFF3 586974 0.2275
QOI_DIFF4 98868 0.0383
QOI_DIFF5 137346 0.0532
QOI_RUN1 211008 0.0818
QOI_RUN2 12822 0.0050
QOI_RUN3 5088 0.0020
images/screenshots
QOI_INDEX1 12828252 0.2432
QOI_REP1 2584725 0.0490
QOI_DIFF1 9000384 0.1706
QOI_DIFF2 8655897 0.1641
QOI_DIFF3 3652713 0.0692
QOI_DIFF4 9636693 0.1827
QOI_DIFF5 0 0.0000
QOI_RUN1 5087340 0.0964
QOI_RUN2 1122582 0.0213
QOI_RUN3 182223 0.0035
images/textures
QOI_INDEX1 11350341 0.3444
QOI_REP1 1147431 0.0348
QOI_DIFF1 1980981 0.0601
QOI_DIFF2 9417486 0.2857
QOI_DIFF3 4313643 0.1309
QOI_DIFF4 3876123 0.1176
QOI_DIFF5 0 0.0000
QOI_RUN1 819174 0.0249
QOI_RUN2 53082 0.0016
QOI_RUN3 1122 0.0000
images/wallpaper
QOI_INDEX1 199880241 0.3101
QOI_REP1 59868390 0.0929
QOI_DIFF1 110251467 0.1711
QOI_DIFF2 160711896 0.2494
QOI_DIFF3 39464934 0.0612
QOI_DIFF4 32742819 0.0508
QOI_DIFF5 0 0.0000
QOI_RUN1 40708647 0.0632
QOI_RUN2 869475 0.0013
QOI_RUN3 14805 0.0000
tango-icon-library-pngs/png-512
QOI_INDEX1 10806669 0.3053
QOI_REP1 3333795 0.0942
QOI_DIFF1 8349252 0.2359
QOI_DIFF2 1361142 0.0385
QOI_DIFF3 1861113 0.0526
QOI_DIFF4 1803873 0.0510
QOI_DIFF5 927729 0.0262
QOI_RUN1 6305679 0.1781
QOI_RUN2 573573 0.0162
QOI_RUN3 74694 0.0021
tango-icon-library-pngs/png-64
QOI_INDEX1 420957 0.3175
QOI_REP1 69408 0.0523
QOI_DIFF1 151905 0.1146
QOI_DIFF2 119769 0.0903
QOI_DIFF3 113607 0.0857
QOI_DIFF4 175122 0.1321
QOI_DIFF5 166518 0.1256
QOI_RUN1 98715 0.0744
QOI_RUN2 9330 0.0070
QOI_RUN3 594 0.0004
Honestly I didn't understand why increasing the color cache should affect the compression ratio much. This helps mostly for palletized images, but not for regular images where colors are not often reused.
I'm not entirely sure either, but FWIW here's some data on how often QOI_INDEX1
gets used as I dial down the 127 in this line:
#define QOI_COLOR_HASH(C) ((C.rgba.r ^ C.rgba.g ^ C.rgba.b ^ C.rgba.a) & 127)
images/kodak
QOI_INDEX1 921066 0.0336 1-bit hash
QOI_INDEX1 1884114 0.0686 2-bit hash
QOI_INDEX1 3071772 0.1119 3-bit hash
QOI_INDEX1 3932514 0.1433 4-bit hash
QOI_INDEX1 4478970 0.1632 5-bit hash
QOI_INDEX1 4840920 0.1764 6-bit hash
QOI_INDEX1 5244627 0.1911 7-bit hash
images/misc
QOI_INDEX1 76332 0.0296 1-bit hash
QOI_INDEX1 128883 0.0500 2-bit hash
QOI_INDEX1 169314 0.0656 3-bit hash
QOI_INDEX1 201573 0.0781 4-bit hash
QOI_INDEX1 235701 0.0914 5-bit hash
QOI_INDEX1 278769 0.1081 6-bit hash
QOI_INDEX1 335304 0.1300 7-bit hash
images/screenshots
QOI_INDEX1 1484016 0.0281 1-bit hash
QOI_INDEX1 3179232 0.0603 2-bit hash
QOI_INDEX1 4798965 0.0910 3-bit hash
QOI_INDEX1 6541278 0.1240 4-bit hash
QOI_INDEX1 8495481 0.1610 5-bit hash
QOI_INDEX1 10678482 0.2024 6-bit hash
QOI_INDEX1 12828252 0.2432 7-bit hash
images/textures
QOI_INDEX1 1594854 0.0484 1-bit hash
QOI_INDEX1 2712801 0.0823 2-bit hash
QOI_INDEX1 4055679 0.1231 3-bit hash
QOI_INDEX1 5755113 0.1746 4-bit hash
QOI_INDEX1 7795578 0.2365 5-bit hash
QOI_INDEX1 9923466 0.3011 6-bit hash
QOI_INDEX1 11350341 0.3444 7-bit hash
images/wallpaper
QOI_INDEX1 46937343 0.0728 1-bit hash
QOI_INDEX1 102437733 0.1589 2-bit hash
QOI_INDEX1 147887331 0.2295 3-bit hash
QOI_INDEX1 177376335 0.2752 4-bit hash
QOI_INDEX1 190361811 0.2954 5-bit hash
QOI_INDEX1 196686504 0.3052 6-bit hash
QOI_INDEX1 199880241 0.3101 7-bit hash
tango-icon-library-pngs/png-512
QOI_INDEX1 445926 0.0126 1-bit hash
QOI_INDEX1 905001 0.0256 2-bit hash
QOI_INDEX1 1629585 0.0460 3-bit hash
QOI_INDEX1 2944923 0.0832 4-bit hash
QOI_INDEX1 5267601 0.1488 5-bit hash
QOI_INDEX1 8354949 0.2360 6-bit hash
QOI_INDEX1 10806669 0.3053 7-bit hash
tango-icon-library-pngs/png-64
QOI_INDEX1 17541 0.0132 1-bit hash
QOI_INDEX1 39786 0.0300 2-bit hash
QOI_INDEX1 75480 0.0569 3-bit hash
QOI_INDEX1 147141 0.1110 4-bit hash
QOI_INDEX1 253395 0.1911 5-bit hash
QOI_INDEX1 355503 0.2681 6-bit hash
QOI_INDEX1 420957 0.3175 7-bit hash
In WebP lossless I use a 'double RLE' mode. The double RLE mode can repeat the previous pixel, or copy N pixels of the previous row at respective locations. I choose the longer of the RLEs. Having both RLEs instead of one turns out to be helpful on a large variety of images. In addition to it, I try the full LZ77 mode, and choose the one that generates less entropy in initial analysis.
The double RLE mode uses only distance codes 0 and 1, reducing the entropy of distance codes (See https://developers.google.com/speed/webp/docs/webp_lossless_bitstream_specification#42_encoding_of_image_data) into one bit.
Here are some pretty sweet throughput numbers on my laptop (with similar compression sizes) for a proof of concept, combining little-endianness (#3) with 7-bit indexes (#20). Little-endianness in particular allows us to remove a lot of branching in the decoder's inner loops.
I can't think of a good name for it right now, so I'm just going to call it "Demo 10".
Details: