Speedup opportunity via vector shuffle

@bonzini forwarded me a request to contribute this here. Thanks Paolo, and thank you Mikolaj for lbzip2.

There's an opportunity to significantly speed up the common case in the inverse MTF transform where the reordered index is in the first sliding window (i.e. 0..15). Instead of taking a poorly predictable branch, we can load the entire window, shuffle it on registers, and write it back. With SSSE3 this boils down to a load-pshufb-store sequence, but could be done as well by shifting/masking on general registers (obviously more instructions, but still should be an improvement over the current branchy code).

I coded the following as a personal exercise. Probably not upstreamable, would like to hear what the requirements for a proper patch might be. If you try this patch, keep in mind that CFLAGS need to enable SSSE3 (-mssse3, or something like -march=native).

--- ../lbzip2/src/decode.c  2022-07-06 00:57:22.365988349 +0300
+++ src/decode.c    2022-07-06 09:54:45.420303970 +0300
@@ -452,6 +452,34 @@ mtf_one(uint8_t **imtf_row, uint8_t *imt
        version is given in #else clause.
      */
 #if ROW_WIDTH == 16
+#if defined(__SSSE3__)
+    {
+      typedef char c8v16 __attribute__((vector_size(16)));
+      static const c8v16 tab[] = {
+        { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 },
+        { 1, 0, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 },
+        { 2, 0, 1, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 },
+        { 3, 0, 1, 2, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 },
+        { 4, 0, 1, 2, 3, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 },
+        { 5, 0, 1, 2, 3, 4, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 },
+        { 6, 0, 1, 2, 3, 4, 5, 7, 8, 9, 10, 11, 12, 13, 14, 15 },
+        { 7, 0, 1, 2, 3, 4, 5, 6, 8, 9, 10, 11, 12, 13, 14, 15 },
+        { 8, 0, 1, 2, 3, 4, 5, 6, 7, 9, 10, 11, 12, 13, 14, 15 },
+        { 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 10, 11, 12, 13, 14, 15 },
+        { 10, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 11, 12, 13, 14, 15 },
+        { 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 13, 14, 15 },
+        { 12, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 13, 14, 15 },
+        { 13, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 14, 15 },
+        { 14, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 15 },
+        { 15, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14 },
+      };
+      c8v16 t;
+      memcpy(&t, pp, sizeof t);
+      t = __builtin_ia32_pshufb128(t, tab[nn]);
+      memcpy(pp, &t, sizeof t);
+      return c;
+    }
+#endif
     switch (nn) {
     default:
       abort();

I tried to test it but despite cpuid claiming SSE3 support it won't compile. Also on ARM there is no SSE. Would you mind to release it in "general terms", so it could be tested?

Ed. I have used `-mssse3' and compiled it. The test result turned out to be interesting:

$ time lbzip2-2.5  -t -n1  silesia.tar.bzip2

real    0m15.470s
user    0m14.405s
sys     0m1.073s

$ time lbzip2-2.5-i31  -t -n1  silesia.tar.bzip2

real    0m15.300s
user    0m14.034s
sys     0m1.116s

$ time lbzip2-2.5  -t -n1  silesia.tar.lbzip2 

real    0m15.496s
user    0m14.267s
sys     0m1.071s

$ time lbzip2-2.5-i31  -t -n1  silesia.tar.lbzip2

real    0m18.341s
user    0m16.430s
sys     0m1.099s

So with bzip2 archive there was no difference, with lbzip2 archive it was 20% slower.

kjn / lbzip2

Speedup opportunity via vector shuffle #31