Open nigeltao opened 2 years ago
For what it's worth, this trivial change to the hash function:
diff --git a/qoi.h b/qoi.h
index 0aec728..da278a8 100644
--- a/qoi.h
+++ b/qoi.h
@@ -321,7 +321,7 @@ void *qoi_decode(const void *data, int size, qoi_desc *desc, int channels);
#define QOI_MASK_3 0xe0 // 11100000
#define QOI_MASK_4 0xf0 // 11110000
-#define QOI_COLOR_HASH(C) (C.rgba.r ^ C.rgba.g ^ C.rgba.b ^ C.rgba.a)
+#define QOI_COLOR_HASH(C) (C.rgba.r ^ C.rgba.g ^ C.rgba.b)
#define QOI_MAGIC \
(((unsigned int)'q') << 24 | ((unsigned int)'o') << 16 | \
((unsigned int)'i') << 8 | ((unsigned int)'f'))
Has a noticable improvement in encode throughput on my laptop, for images/misc (1.18x) and images/screenshots (1.21x). The images/misc set (the only one with not-fully-opaque images) also shows slightly worse compression size (1.01x).
decode ms encode ms decode mpps encode mpps size kb
images/kodak
qoi-before: 6.4 8.3 61.02 47.11 771
qoi-after: 6.4 8.7 61.12 45.10 771
images/misc
qoi-before: 5.8 6.1 154.48 146.05 400
qoi-after: 5.7 5.2 155.45 172.22 404
images/screenshots
qoi-before: 53.9 48.6 152.63 169.25 2582
qoi-after: 54.4 40.3 151.31 204.22 2582
images/textures
qoi-before: 1.6 2.0 80.17 65.13 184
qoi-after: 1.6 2.0 79.42 65.67 184
images/wallpaper
qoi-before: 137.6 153.6 68.10 61.02 10640
qoi-after: 139.1 151.8 67.38 61.75 10640
Here's another experiment, called "qoi-fifo2e".
Instead of tweaking the hash function, we remove the hash function and just use a FIFO cache: keep the 64 most recently written (not necessarily most recently used/read) colors. Update the cache on DIFF, LUMA, RGB and RGBA ops. Leave the cache unchanged on INDEX and RUN ops.
Here's the diff (the code is a little awkward because I was trying to minimize the diff):
--- qoi-master-2ee2169.h 2021-12-12 08:44:59.203920322 +1100
+++ qoi-fifo2e.h 2021-12-15 13:29:18.194335467 +1100
@@ -387,6 +387,7 @@
const unsigned char *pixels = (const unsigned char *)data;
qoi_rgba_t index[64] = {0};
+ unsigned int index_wpos = 0;
int run = 0;
qoi_rgba_t px_prev = {.rgba = {.r = 0, .g = 0, .b = 0, .a = 0}};
@@ -419,13 +420,15 @@
run = 0;
}
- int index_pos = QOI_COLOR_HASH(px) % 64;
-
+ for (int index_pos = 0;; index_pos++) {
+ if (index_pos < 64) {
if (index[index_pos].v == px.v) {
bytes[p++] = QOI_OP_INDEX | index_pos;
+ break;
+ }
}
else {
- index[index_pos] = px;
+ index[index_wpos++ & 63] = px;
if (px.rgba.a == px_prev.rgba.a) {
char vr = px.rgba.r - px_prev.rgba.r;
@@ -464,7 +467,10 @@
bytes[p++] = px.rgba.b;
bytes[p++] = px.rgba.a;
}
+ break;
}
+ } // end-for (int index_pos = 0;; index_pos++)
+
}
px_prev = px;
}
@@ -516,6 +522,7 @@
qoi_rgba_t px = {.rgba = {.r = 0, .g = 0, .b = 0, .a = 0}};
qoi_rgba_t index[64] = {0};
+ unsigned int index_wpos = 0;
int run = 0;
int chunks_len = size - QOI_PADDING;
@@ -530,12 +537,14 @@
px.rgba.r = bytes[p++];
px.rgba.g = bytes[p++];
px.rgba.b = bytes[p++];
+ index[index_wpos++ & 63] = px;
}
else if (b1 == QOI_OP_RGBA) {
px.rgba.r = bytes[p++];
px.rgba.g = bytes[p++];
px.rgba.b = bytes[p++];
px.rgba.a = bytes[p++];
+ index[index_wpos++ & 63] = px;
}
else if ((b1 & QOI_MASK_2) == QOI_OP_INDEX) {
px = index[b1];
@@ -544,6 +553,7 @@
px.rgba.r += ((b1 >> 4) & 0x03) - 2;
px.rgba.g += ((b1 >> 2) & 0x03) - 2;
px.rgba.b += ( b1 & 0x03) - 2;
+ index[index_wpos++ & 63] = px;
}
else if ((b1 & QOI_MASK_2) == QOI_OP_LUMA) {
int b2 = bytes[p++];
@@ -551,12 +561,11 @@
px.rgba.r += vg - 8 + ((b2 >> 4) & 0x0f);
px.rgba.g += vg;
px.rgba.b += vg - 8 + (b2 & 0x0f);
+ index[index_wpos++ & 63] = px;
}
else if ((b1 & QOI_MASK_2) == QOI_OP_RUN) {
run = (b1 & 0x3f);
}
-
- index[QOI_COLOR_HASH(px) % 64] = px;
}
if (channels == 4) {
Here's the numbers. Encode speed gets 2.4x worse because we now have to compare against all 64 cache entries (instead of just 1) on each non-run pixel. Decode speed gets 1.2x better because index_wpos++
is faster than calculating the hash function, because we no longer update the index for INDEX and RUN ops (which was a no-op conceptually but still took CPU time, before the change) and maybe because the QOI files get a tiny bit smaller.
master is commit 2ee2169 (2021-12-11).
decode ms encode ms decode mpps encode mpps size kb rate
## Total for images/icon_512
libpng: 5.0 33.9 52.74 7.74 51 5.0%
qoi-master: 1.5 1.3 171.27 198.88 85 8.4%
qoi-fifo2e: 1.3 2.7 199.95 98.18 87 8.6%
## Total for images/icon_64
libpng: 0.2 0.9 23.26 4.35 3 23.6%
qoi-master: 0.0 0.0 104.20 100.77 4 28.7%
qoi-fifo2e: 0.0 0.1 115.03 38.60 4 28.2%
## Total for images/photo_kodak
libpng: 15.2 295.7 25.88 1.33 717 46.7%
qoi-master: 6.1 6.7 64.95 58.81 671 43.7%
qoi-fifo2e: 4.9 17.5 79.58 22.45 668 43.5%
## Total for images/photo_tecnick
libpng: 44.9 1143.8 32.08 1.26 2414 42.9%
qoi-master: 22.6 26.3 63.73 54.84 2527 44.9%
qoi-fifo2e: 17.9 66.6 80.23 21.61 2523 44.9%
## Total for images/photo_wikipedia
libpng: 42.9 865.8 25.28 1.25 2046 48.3%
qoi-master: 17.0 20.2 63.66 53.81 2102 49.6%
qoi-fifo2e: 14.5 54.3 74.86 19.99 2099 49.6%
## Total for images/pngimg
libpng: 56.4 533.3 32.08 3.39 1201 17.0%
qoi-master: 16.4 16.7 110.41 108.00 1436 20.3%
qoi-fifo2e: 14.3 38.8 126.59 46.68 1431 20.3%
## Total for images/screenshot_game
libpng: 18.6 216.7 34.08 2.92 448 18.1%
qoi-master: 6.7 6.5 93.95 97.27 519 21.0%
qoi-fifo2e: 5.7 15.6 110.87 40.59 511 20.7%
## Total for images/screenshot_web
libpng: 92.0 1060.7 88.33 7.66 2402 7.6%
qoi-master: 55.7 44.8 145.98 181.17 2649 8.3%
qoi-fifo2e: 48.4 88.0 167.98 92.34 2622 8.3%
## Total for images/textures_photo
libpng: 39.6 725.0 26.45 1.45 1977 48.3%
qoi-master: 15.0 16.0 69.87 65.44 1981 48.4%
qoi-fifo2e: 12.1 47.5 86.65 22.07 1984 48.4%
## Total for images/textures_pk
libpng: 0.9 24.6 47.24 1.81 89 51.5%
qoi-master: 0.8 0.8 59.21 57.16 75 43.5%
qoi-fifo2e: 0.6 1.8 78.65 24.52 72 41.6%
## Total for images/textures_pk01
libpng: 5.3 66.5 24.65 1.95 163 32.3%
qoi-master: 1.6 1.7 80.68 75.78 178 35.2%
qoi-fifo2e: 1.4 5.0 90.33 26.17 178 35.1%
## Total for images/textures_pk02
libpng: 11.7 205.0 25.97 1.48 427 36.1%
qoi-master: 4.5 4.5 67.71 67.38 479 40.4%
qoi-fifo2e: 3.7 12.8 82.66 23.80 482 40.7%
## Total for images/textures_plants
libpng: 31.9 340.6 33.35 3.12 857 20.6%
qoi-master: 8.6 9.9 123.62 107.79 922 22.2%
qoi-fifo2e: 7.3 22.9 144.92 46.53 920 22.1%
# Grand total for images
libpng: 13.5 187.9 34.47 2.47 423 23.3%
qoi-master: 5.1 5.2 91.89 89.38 463 25.6%
qoi-fifo2e: 4.2 12.7 109.58 36.57 460 25.4%
decode ms encode ms decode mpps encode mpps size kb rate
Refactored a little and tried making an AVX2 path with intrinsics. @notnullnotvoid this may not be optimal as I have zero experience, don't let this existing deter you from rolling your own and/or benchmarking. This exists to learn and to apply it to a variant, I'm not benchmarking further.
Compile with -mavx2 for avx2. Uses a gcc builtin to count trailing zeroes, msvc and maybe others will need their own definitions.
4700u gcc 11.2.0 -O3
# Grand total for qoi_benchmark_suite/images
decode ms encode ms decode mpps encode mpps size kb rate
libpng: 6.596 77.717 70.38 5.97 398 24.2%
stbi: 6.542 59.617 70.95 7.79 561 34.2%
qoi-masterae07: 2.123 2.574 218.61 180.31 463 25.6%
qoi-fifo2e-refac: 1.811 6.110 256.27 75.97 460 25.4%
qoi-fifo2e-refac-avx2:1.844 2.975 251.77 156.02 460 25.4%
Can you add a row for qoi-master to this benchmark?
Done. Didn't redo the other results like you're supposed to but the environmental conditions are the same as half an hour ago ;)
Currently, it's:
This is about as simple as it gets, which has its own virtues. There may be better ones, although be aware that better compression (size) might mean worse performance (throughput).
Some discussion (amongst other things) is at https://github.com/phoboslab/qoi/issues/48