powturbo / Turbo-Base64

Turbo Base64 - Fastest Base64 SIMD:SSE/AVX2/AVX512/Neon/Altivec - Faster than memcpy!
GNU General Public License v3.0
264 stars 40 forks source link

Benchmark doesn't test the usual scenario #12

Closed ilyakurdyukov closed 2 years ago

ilyakurdyukov commented 2 years ago

It's great that you can convert gigabytes of data per second, but that's if all those gigabytes are in one long line!

This is a rare scenario. Whereas the usual scenario is processing a large set of small strings (1kb isn't very small). And some people might think that processing a large set of strings will take the same amount of time as if it were one long string, but that would make a huge difference. Try to process random sizes from say 1 to 64 bytes until the N megabytes is reached.

Although prefetching speeds up on big data, for small sizes it fills the cache with data that the application might not need. And worst of all, if the prefetch is done on the unallocated page (which could be next), then an exception will be thrown, and although the OS will handle and ignore this exception, it takes time.

So there also should be a test, where the input addresses are not consecutive, but randomized. (So miss-prefetching will only make worse.)

And the best thing would be a graph showing the cost of each size, starting at 1 until the graph turns into a straight line. How people are doing to show the performance of memcpy. Like this.

powturbo commented 2 years ago

If you read carefully the readme, you can also notice a benchmark with short strings. The benchmark is not trying to show every szenario or the wort case, but to compare turbo-base64 to other base64 libraries. Each user can do benchmark with his own data. The Clikckhouse people has evaluated several base64 libraries and have chosen Turbo-Base64. I'm also using very large i/o buffers to avoid the cache szenario like shown in other benchmarks.

ilyakurdyukov commented 2 years ago

The problem, again, is in comparison to memcpy. Because memcpy testing standards are much higher than the cases your benchmark is measuring. Just google the memcpy comparison images. And with more thorough testing, you can find weaknesses, and I think your code can be improved to handle these cases. But I don't see that you care at all. You have already claimed the victory, that you are the best of the best, and most people believed you. So because they already believed and none of the contenders have better benchmarks - you are happy with it and don't make any improvements against possible weaknesses of your code.

I wouldn't criticize you if you weren't comparing to memcpy and claiming superiority, but you do that.

powturbo commented 2 years ago

In contrast to most of the software in github, you can see from all my repositories, I'm still improving and updating the software. I have also ideas how to improve Turbo-Base64. Yes, actually there is no other faster base64 library and that's the point. I'm only interpreting the benchmark numbers and I'm providing benchmark programs to directly compare Turbo-Base64 with other libraries, so everyone can run the software. The comparison to memcpy is provided as a reference point and only as orientation, the use case of base64 is differrent, because memcpy is not doing base64. You are rarely doing base64 with very small lengths like 4 or 5 characters. You can't find other benchmarks comparing Turbo-Base64. They are simply ignoring its existence.

ilyakurdyukov commented 2 years ago

You can't find other benchmarks comparing Turbo-Base64.

I'll make mine to check these cases.

Your benchmark was recommended to me as a reference. And although your base64 SIMD core is very good, the benchmark is no good at all, because it only checks the cases that are beneficial to you, where your code gives excellent results. That's what I see.

And I'll say twice that I think that miss-prefetching data (that "s + 1024") is a bad thing.

powturbo commented 2 years ago

The prefetching is not used in the short strings version _tb64avx2enc/__tb64avx2dec. I'll check that in the next update for the other functions, it is possible that with modern hardware it's no longer needed at all.

ilyakurdyukov commented 2 years ago

Prefetching is gives a slight speedup (I checked it).

I think you can do something like Prefetch(min(src + 1024, src_end - 1)), that shouldn't add extra delay, but will not prefetch unneeded data. Compiler may use cmov here.

ilyakurdyukov commented 2 years ago

Also I think that some leftover bytes at the end:

    size_t rc = _tb64xd(ip, inlen-(ip-in), op);
    if(!rc || cx) return 0;
    return (op-out)+rc;

Can be read into a vector and processed without fallback to _tb64xd(). Especially when you have _mm256_maskload_epi32 in AVX2. (The remaining 2-3 bytes is a bit of problem, but it can be solved.)

And I want to try to implement it (but not for your project).

ilyakurdyukov commented 2 years ago

If you provided results for small strings, then why there's no options for running the benchmark in this mode? Because I don't see this feature. (And that's what I asked for.)

But I have already done my benchmark, which checks the blocks of powers of two from 1 to 1 << 20. And also in different modes (same block, or consecutive blocks, or random order).

And I did beat your results at some sizes/modes with my base64 modification. While I haven't used loop unrolling yet, it might give my code some boost if I do. And the idea of handling leftovers in a vector really helps for x86 CPUs on a small inputs (makes worse for aarch64).

powturbo commented 2 years ago

Congratulation. The short string mode was included because of a request from the Clickhouse people. I've not spent much time for this. I've also ideas how to improve the processing for several representative sizes. I'll make that in the next update when I find some spare time and when I can find a recent cpu for testing. I've also a not yet published python interface.

ilyakurdyukov commented 2 years ago

So if you need to compete with someone to improve your code, I'll give it to you.

And the users of your code will benefit if you make it better. So this is not a useless competition.

I still want to experiment more with my code. And also I plan to add a .svg image writer to my benchmark so that results can be viewed in a more convenient form. I want it to look like those memcpy comparisons I talked about.

ilyakurdyukov commented 2 years ago

Or maybe not much on what sizes I won because I messed up the AVX2 TB64 string/regular functions.

However, when trying to test TB64 on Aarch64, I found that TB64 performance is really poor.

Like what's that?

$ ./tb64app 
detected simd (id=38->'arm_neon')

  E MB/s    size     ratio%   D MB/s   function
size=1Kb (Kb=1.000)
   443.50       1336 133.60%   501.97 tb64s        1000
   787.83       1336 133.60%   758.23 tb64x        1000
   611.15       1336 133.60%   460.90 tb64sse        1000
   609.07       1336 133.60%   459.97 tb64auto        1000
  6803.48       1000 100.00%  6899.79 memcpy        1000
size=10Kb (Kb=1.000)
   457.21      13336 133.36%   511.54 tb64s       10000
   830.79      13336 133.36%   795.15 tb64x       10000
   662.40      13336 133.36%   475.43 tb64sse       10000
   662.19      13336 133.36%   475.12 tb64auto       10000
  7936.60      10000 100.00%  7937.80 memcpy       10000
size=100Kb (Kb=1.000)
   442.34     133336 133.34%   482.84 tb64s      100000
   774.62     133336 133.34%   707.64 tb64x      100000
   658.56     133336 133.34%   447.41 tb64sse      100000
   658.55     133336 133.34%   447.37 tb64auto      100000
  6512.73     100000 100.00%  6558.52 memcpy      100000
size=1Mb (Mb=1.000.000)
   443.07    1333336 133.33%   433.75 tb64s     1000000
   803.85    1333336 133.33%   488.02 tb64x     1000000
   609.80    1333336 133.33%   412.00 tb64sse     1000000
   606.65    1333336 133.33%   411.73 tb64auto     1000000
  1527.66    1000000 100.00%  1474.18 memcpy     1000000
size=10Mb (Mb=1.000.000)
   442.00   13333336 133.33%   430.60 tb64s    10000000
   798.73   13333336 133.33%   486.04 tb64x    10000000
   608.90   13333336 133.33%   411.39 tb64sse    10000000
   603.82   13333336 133.33%   411.56 tb64auto    10000000
  1518.03   10000000 100.00%  1543.03 memcpy    10000000
size=20Mb (Mb=1.000.000)
   441.81   26666668 133.33%   409.75 tb64s    20000000
   765.84   26666668 133.33%   492.27 tb64x    20000000
   406.05   26666668 133.33%   375.99 tb64sse    20000000
   405.88   26666668 133.33%   376.18 tb64auto    20000000
  1508.51   20000000 100.00%  1497.61 memcpy    20000000

__ARM_NEON macro appears to be defined, and "detected simd (id=38->'arm_neon')" shown.

But what is it when tb64x is faster than tb64sse where the Neon vector code should be?

Why is the encoder faster, even if I compile with make NCHECK=1 ?

This is an Allwinner H616 and for my vector code for the Neon I got a good speedup with it.

powturbo commented 2 years ago

Yes, depending on the cpu, the hardware and the instructions used, scalar code can be faster than simd. It is very difficult to benchmark most of the ARM boards because of throttling. Always use the options -I# and -J# with a large #, for ex. 15, 31 or 63. It will take very long time, but the results are more accurate. Try using clang instead of gcc

ilyakurdyukov commented 2 years ago

Seems like GCC 9.3.0 issue which works very badly for your Neon code.

I tried Clang 11.0.1 which gives much better results on Allwinner H616:

$ ./tb64app 
detected simd (id=38->'arm_neon')

  E MB/s    size     ratio%   D MB/s   function
size=1Kb (Kb=1.000)
   384.38       1336 133.60%   482.34 tb64s        1000
   713.74       1336 133.60%   559.83 tb64x        1000
  1395.00       1336 133.60%  1290.84 tb64sse        1000
  1387.20       1336 133.60%  1286.73 tb64auto        1000
  6807.11       1000 100.00%  6900.42 memcpy        1000
size=10Kb (Kb=1.000)
   393.05      13336 133.36%   489.76 tb64s       10000
   714.92      13336 133.36%   568.19 tb64x       10000
  1751.98      13336 133.36%  1510.76 tb64sse       10000
  1750.47      13336 133.36%  1510.31 tb64auto       10000
  7942.17      10000 100.00%  7940.57 memcpy       10000
size=100Kb (Kb=1.000)
   384.23     133336 133.34%   474.47 tb64s      100000
   664.04     133336 133.34%   545.02 tb64x      100000
  1765.05     133336 133.34%  1429.38 tb64sse      100000
  1765.10     133336 133.34%  1428.25 tb64auto      100000
  6542.89     100000 100.00%  6470.67 memcpy      100000
size=1Mb (Mb=1.000.000)
   381.20    1333336 133.33%   461.81 tb64s     1000000
   709.20    1333336 133.33%   460.76 tb64x     1000000
  1215.44    1333336 133.33%  1357.67 tb64sse     1000000
  1217.97    1333336 133.33%  1359.23 tb64auto     1000000
  1512.70    1000000 100.00%  1451.83 memcpy     1000000
size=10Mb (Mb=1.000.000)
   377.78   13333336 133.33%   457.07 tb64s    10000000
   707.43   13333336 133.33%   456.08 tb64x    10000000
  1211.69   13333336 133.33%  1352.01 tb64sse    10000000
  1209.93   13333336 133.33%  1351.42 tb64auto    10000000
  1519.48   10000000 100.00%  1544.09 memcpy    10000000
size=20Mb (Mb=1.000.000)
   370.07   26666668 133.33%   430.21 tb64s    20000000
   683.67   26666668 133.33%   411.12 tb64x    20000000
  1119.71   26666668 133.33%  1146.30 tb64sse    20000000
  1121.61   26666668 133.33%  1146.84 tb64auto    20000000
  1500.15   20000000 100.00%  1500.42 memcpy    20000000

Also results from Kunpeng-920 (with Clang 11.0.1), which is a top-end server CPU:

$ ./tb64app -I15 -J15
detected simd (id=38->'arm_neon')

  E MB/s    size     ratio%   D MB/s   function
size=1Kb (Kb=1.000)
  1553.94       1336 133.60%  1832.45 tb64s        1000
  2413.25       1336 133.60%  2233.39 tb64x        1000
  4182.25       1336 133.60%  4419.85 tb64sse        1000
  4185.82       1336 133.60%  4408.30 tb64auto        1000
 20121.04       1000 100.00% 20596.66 memcpy        1000
size=10Kb (Kb=1.000)
  1687.68      13336 133.36%  1868.10 tb64s       10000
  2642.42      13336 133.36%  2265.32 tb64x       10000
  5284.68      13336 133.36%  4803.35 tb64sse       10000
  5280.65      13336 133.36%  4793.96 tb64auto       10000
 39486.98      10000 100.00% 39312.89 memcpy       10000
size=100Kb (Kb=1.000)
  1668.29     133336 133.34%  1824.14 tb64s      100000
  2613.15     133336 133.34%  2212.57 tb64x      100000
  5111.60     133336 133.34%  4680.91 tb64sse      100000
  5116.96     133336 133.34%  4685.42 tb64auto      100000
 24545.18     100000 100.00% 24653.34 memcpy      100000
size=1Mb (Mb=1.000.000)
  1572.49    1333336 133.33%  1785.52 tb64s     1000000
  2512.57    1333336 133.33%  2045.56 tb64x     1000000
  4582.90    1333336 133.33%  4561.65 tb64sse     1000000
  4707.62    1333336 133.33%  4599.14 tb64auto     1000000
 17789.63    1000000 100.00% 15538.28 memcpy     1000000
size=10Mb (Mb=1.000.000)
  1516.72   13333336 133.33%  1770.81 tb64s    10000000
  2363.12   13333336 133.33%  1966.43 tb64x    10000000
  4506.41   13333336 133.33%  4506.90 tb64sse    10000000
  4511.00   13333336 133.33%  4489.11 tb64auto    10000000
 14919.41   10000000 100.00% 15055.00 memcpy    10000000
size=20Mb (Mb=1.000.000)
  1424.68   26666668 133.33%  1427.27 tb64s    20000000
  2202.20   26666668 133.33%  1832.59 tb64x    20000000
  3624.79   26666668 133.33%  3529.17 tb64sse    20000000
  3632.39   26666668 133.33%  3565.70 tb64auto    20000000
  5938.97   20000000 100.00%  5709.97 memcpy    20000000

And a little newer GCC 10.3.1 and Kunpeng-920:

$ ./tb64app -I15 -J15
detected simd (id=38->'arm_neon')

  E MB/s    size     ratio%   D MB/s   function
size=1Kb (Kb=1.000)
  1270.10       1336 133.60%  1796.14 tb64s        1000
  2348.15       1336 133.60%  2289.20 tb64x        1000
  2858.96       1336 133.60%  2287.53 tb64sse        1000
  2874.36       1336 133.60%  2284.12 tb64auto        1000
 20129.93       1000 100.00% 20607.96 memcpy        1000
size=10Kb (Kb=1.000)
  1318.80      13336 133.36%  1820.97 tb64s       10000
  2546.63      13336 133.36%  2330.87 tb64x       10000
  3301.96      13336 133.36%  2344.52 tb64sse       10000
  3301.13      13336 133.36%  2344.38 tb64auto       10000
 39201.95      10000 100.00% 39153.98 memcpy       10000
size=100Kb (Kb=1.000)
  1311.00     133336 133.34%  1799.43 tb64s      100000
  2532.02     133336 133.34%  2276.98 tb64x      100000
  3193.05     133336 133.34%  2328.40 tb64sse      100000
  3192.59     133336 133.34%  2328.44 tb64auto      100000
 24467.35     100000 100.00% 24779.43 memcpy      100000
size=1Mb (Mb=1.000.000)
  1290.23    1333336 133.33%  1744.71 tb64s     1000000
  2353.24    1333336 133.33%  2263.10 tb64x     1000000
  2955.80    1333336 133.33%  2245.15 tb64sse     1000000
  2958.51    1333336 133.33%  2245.27 tb64auto     1000000
 17373.09    1000000 100.00% 16517.68 memcpy     1000000
size=10Mb (Mb=1.000.000)
  1222.32   13333336 133.33%  1716.83 tb64s    10000000
  2330.96   13333336 133.33%  2117.21 tb64x    10000000
  2250.78   13333336 133.33%  2124.52 tb64sse    10000000
  2250.93   13333336 133.33%  2125.33 tb64auto    10000000
 14434.34   10000000 100.00% 13271.02 memcpy    10000000
size=20Mb (Mb=1.000.000)
   613.56   26666668 133.33%  1465.64 tb64s    20000000
  2231.74   26666668 133.33%  1873.39 tb64x    20000000
  2505.94   26666668 133.33%  1798.04 tb64sse    20000000
  2338.66   26666668 133.33%  1874.06 tb64auto    20000000
  5413.85   20000000 100.00%  5046.94 memcpy    20000000

Which just proves that if your Neon code is compiled with GCC - then its performance is very low. I guess you need to warn about this somewhere. Or you can try to figure out what is causing GCC to generate inefficient code and try to fix it.

I read the assembly listing from GCC, so ... This can be the result of greedy unrolling, I know of situations where the compiler does not work well, when you inline too much code, even for Clang (I seen this for Hexagon HVX code). So it's better not to write such code. Or if you are hoping for Clang then hide it under #ifdef __clang__ and give GCC more simple code. The problem is that the compiler creates a lot of variables and starts saving and popping them off the stack, which completely kills performance.

ilyakurdyukov commented 2 years ago

There is also a case you are overlooking is when the buffers are not aligned.

I just add one byte to the pointer I got from malloc (which are aligned to at least 8 bytes)

So TB64 Neon code compiled with Clang gives this (Allwinner H616):

memcpy: 80.783ms (1237.88 MB/s)
encode: 87.864ms (1138.12 MB/s)
decode: 107.139ms (933.37 MB/s)
encode_unaligned: 94.242ms (1061.10 MB/s)
decode_unaligned: 119.974ms (833.51 MB/s)

And my results:

memcpy: 80.611ms (1240.53 MB/s)
encode: 220.581ms (453.35 MB/s)
decode: 103.892ms (962.54 MB/s)
encode_unaligned: 229.204ms (436.29 MB/s)
decode_unaligned: 107.047ms (934.17 MB/s)

Through on Kunpeng-920 for TB64:

memcpy: 13.578ms (7364.85 MB/s)
encode: 28.358ms (3526.34 MB/s)
decode: 42.583ms (2348.35 MB/s)
encode_unaligned: 33.757ms (2962.35 MB/s)
decode_unaligned: 39.603ms (2525.06 MB/s)

And my results:

memcpy: 13.611ms (7347.00 MB/s)
encode: 109.945ms (909.55 MB/s)
decode: 37.836ms (2642.99 MB/s)
encode_unaligned: 106.490ms (939.06 MB/s)
decode_unaligned: 45.155ms (2214.59 MB/s)

That's about encoding/decoding 100MB (your test defaults miss that too).

(I don't care much about the encoder performance, I don't need it for my task.)

powturbo commented 2 years ago

Very interesting findings, I've tested tb64 with clang 6.0 and gcc 7.0 on ARM. As you can see I'm publishing only the clang results, because gcc was considerably slower for the NEON functions. On intel cpu gcc generated code was always faster. You can try to change the unroling ND define from 256 to 128 and maybe disable the checking. I'll considere your valuable suggestions in the next update. Than you!

ilyakurdyukov commented 2 years ago

You can try to change the unroling ND define from 256 to 128 and maybe disable the checking.

Tried it already, it doesn't help. Checking is disabled. It doesn't even help if I turn off the unrolled ND loop completely and leave the loop running 64 at a time.

But GCC is slow for this code, not just any Neon code, because GCC gives 5% better results for my Neon code than Clang.

ilyakurdyukov commented 2 years ago

This is from the assembly listing from GCC, from a loop that does 64 at a time. Notice these lines with x2:

"add x2, sp, 96" "st1 {v0.16b - v3.16b}, [x2]" "ld1 {v9.16b - v12.16b}, [x2]"

These are saving to the stack and loading from the stack. Which shouldn't be there for the code to work fast.

.L3:
    adrp    x2, .LC0
    add x2, x2, :lo12:.LC0
    ld4 {v0.16b - v3.16b}, [x3], 64
    ld1 {v8.16b - v11.16b}, [x2]
    add x2, sp, 96
    mov v30.16b, v22.16b
    cmp x3, x0
    mov v28.16b, v20.16b
    mov v31.16b, v11.16b
    st1 {v0.16b - v3.16b}, [x2]
    adrp    x2, .LC1
    mov v29.16b, v21.16b
    add x2, x2, :lo12:.LC1
    mov v19.16b, v11.16b
    mov v16.16b, v24.16b
    ld1 {v1.16b - v3.16b}, [x2]
    add x2, sp, 160
    mov v17.16b, v25.16b
    mov v18.16b, v26.16b
    mov v9.16b, v21.16b
    st1 {v1.16b - v3.16b}, [x2]
    adrp    x2, .LC0
    mov v10.16b, v22.16b
    add x2, x2, :lo12:.LC0
    mov v11.16b, v23.16b
    mov v8.16b, v20.16b
    ld1 {v0.16b - v3.16b}, [x2]
    add x2, sp, 336
    mov v0.16b, v24.16b
    mov v1.16b, v25.16b
    mov v2.16b, v26.16b
    st1 {v28.16b - v31.16b}, [x2]
    add x2, sp, 272
    movi    v28.16b, 0x40
    str q23, [sp, 384]
    ldr q29, [sp, 96]
    st1 {v16.16b - v19.16b}, [x2]
    add x2, sp, 96
    movi    v14.16b, 0x40
    str q27, [sp, 320]
    eor v28.16b, v28.16b, v29.16b
    mov v4.16b, v24.16b
    mov v5.16b, v25.16b
    mov v6.16b, v26.16b
    tbl v8.16b, {v8.16b - v11.16b}, v28.16b
    ld1 {v9.16b - v12.16b}, [x2]
    add x2, sp, 208
    mov v7.16b, v27.16b
    mov v28.16b, v20.16b
    mov v29.16b, v21.16b
    st1 {v0.16b - v3.16b}, [x2]
    add x2, sp, 336
    tbx v8.16b, {v4.16b - v7.16b}, v9.16b
    str q27, [sp, 256]
    eor v4.16b, v14.16b, v10.16b
    ld1 {v0.16b - v3.16b}, [x2]
    add x2, sp, 272
    shl v8.16b, v8.16b, 2
    mov v31.16b, v23.16b
    tbl v1.16b, {v0.16b - v3.16b}, v4.16b
    ld1 {v10.16b - v13.16b}, [x2]
    add x2, sp, 96
    ldr q0, [sp, 112]
    mov v16.16b, v20.16b
    mov v17.16b, v21.16b
    tbx v1.16b, {v10.16b - v13.16b}, v0.16b
    ld1 {v9.16b - v12.16b}, [x2]
    add x2, sp, 160
    mov v18.16b, v22.16b
    ushr    v2.16b, v1.16b, 4
    eor v0.16b, v14.16b, v11.16b
    mov v10.16b, v11.16b
    mov v19.16b, v23.16b
    orr v2.16b, v2.16b, v8.16b
    tbl v0.16b, {v28.16b - v31.16b}, v0.16b
    mov v11.16b, v12.16b
    mov v4.16b, v24.16b
    shl v1.16b, v1.16b, 4
    str q2, [sp, 160]
    ld1 {v28.16b - v30.16b}, [x2]
    add x2, sp, 208
    eor v2.16b, v14.16b, v12.16b
    ld1 {v12.16b - v15.16b}, [x2]
    tbl v16.16b, {v16.16b - v19.16b}, v2.16b
    tbx v0.16b, {v12.16b - v15.16b}, v10.16b
    tbx v16.16b, {v4.16b - v7.16b}, v11.16b
    ushr    v2.16b, v0.16b, 2
    shl v0.16b, v0.16b, 6
    orr v29.16b, v2.16b, v1.16b
    orr v30.16b, v0.16b, v16.16b
    st3 {v28.16b - v30.16b}, [x4], 48
    bne .L3
ilyakurdyukov commented 2 years ago

The insane part of this is that GCC loads the lut array on every iteration of the loop! And then immediately saves to the stack!

And in the unrolled loop, the lut array is loaded several times.

powturbo commented 2 years ago

It seems there is no way to convince gcc to not use the st1,ld1 instructions. I've tested this and gcc can generate similar code to clang only with the incoming gcc 12 (tested "gcc trunk" in gotbolt). It's also better to replace the ifndef vld1q_u8_x4 wtih:

if defined(GNUC) && !defined(clang) && \

((__GNUC__ == 10 && (__GNUC_MINOR__ <= 1)) || \
 (__GNUC__ == 9 && (__GNUC_MINOR__ <= 3)) ||  \
 (__GNUC__ == 8 && (__GNUC_MINOR__ <= 4)) || __GNUC__ <= 7)
ilyakurdyukov commented 2 years ago

It's also better to replace the ifndef vld1q_u8_x4

Why not just read like this *(uint8x16x4_t*)(lut) ?

It seems there is no way to convince gcc to not use the st1,ld1 instructions.

I also tried different methods, but nothing works. I suggest you to create a bug report on GCC bugzilla.

powturbo commented 2 years ago

Why not just read like this (uint8x16x4_t)(lut) ?

this is not the same, this generate 2 instructions and vld1q_u8_x4 only 1. In general it is better to use the simd instruction instead of the pointer variables.

I suggest you to create a bug report on GCC bugzilla.

As I said before, gcc will generate the perfect code in the next release. I you are familiar with inline assembly you can insert the code from "gcc trunk" in godbolt into the c code.

ilyakurdyukov commented 2 years ago

Try this patch:

diff --git a/turbob64sse.c b/turbob64sse.c
index 551da41..147ba9f 100644
--- a/turbob64sse.c
+++ b/turbob64sse.c
@@ -86,11 +86,29 @@ static inline uint8x16x4_t vld1q_u8_x4(const uint8_t *lut) {
 }
   #endif

+#ifdef __GNUC__
+#define B64D_GCC_FIX
+#endif
+
+#ifdef B64D_GCC_FIX
+#define B64D_TBL(x) { \
+    uint8x16_t t = veorq_u8(x, cv40); \
+    asm ("tbl %0.16b, {v4.16b - v7.16b}, %0.16b" : "+w"(t) : \
+   "w"(v4), "w"(v5), "w"(v6), "w"(v7)); \
+    asm ("tbx %0.16b, {v0.16b - v3.16b}, %1.16b" : "+w"(t) : \
+   "w"(x), "w"(v0), "w"(v1), "w"(v2), "w"(v3)); \
+    x = t; \
+}
+#else
+#define B64D_TBL(x) \
+    x = vqtbx4q_u8(vqtbl4q_u8(vlut1, veorq_u8(x, cv40)), vlut0, x);
+#endif
+
 #define B64D(iv, ov) {\
-    iv.val[0] = vqtbx4q_u8(vqtbl4q_u8(vlut1, veorq_u8(iv.val[0], cv40)), vlut0, iv.val[0]);\
-    iv.val[1] = vqtbx4q_u8(vqtbl4q_u8(vlut1, veorq_u8(iv.val[1], cv40)), vlut0, iv.val[1]);\
-    iv.val[2] = vqtbx4q_u8(vqtbl4q_u8(vlut1, veorq_u8(iv.val[2], cv40)), vlut0, iv.val[2]);\
-    iv.val[3] = vqtbx4q_u8(vqtbl4q_u8(vlut1, veorq_u8(iv.val[3], cv40)), vlut0, iv.val[3]);\
+    B64D_TBL(iv.val[0]);\
+    B64D_TBL(iv.val[1]);\
+    B64D_TBL(iv.val[2]);\
+    B64D_TBL(iv.val[3]);\
 \
    ov.val[0] = vorrq_u8(vshlq_n_u8(iv.val[0], 2), vshrq_n_u8(iv.val[1], 4));\
    ov.val[1] = vorrq_u8(vshlq_n_u8(iv.val[1], 4), vshrq_n_u8(iv.val[2], 2));\
@@ -106,6 +124,18 @@ size_t tb64ssedec(const unsigned char *in, size_t inlen, unsigned char *out) {
                      vlut1 = vld1q_u8_x4(&lut[64]);
   const uint8x16_t    cv40 = vdupq_n_u8(0x40);
         uint8x16_t      xv = vdupq_n_u8(0);
+
+#ifdef B64D_GCC_FIX
+  register uint8x16_t v0 asm("v0") = vlut0.val[0];
+  register uint8x16_t v1 asm("v1") = vlut0.val[1];
+  register uint8x16_t v2 asm("v2") = vlut0.val[2];
+  register uint8x16_t v3 asm("v3") = vlut0.val[3];
+  register uint8x16_t v4 asm("v4") = vlut1.val[0];
+  register uint8x16_t v5 asm("v5") = vlut1.val[1];
+  register uint8x16_t v6 asm("v6") = vlut1.val[2];
+  register uint8x16_t v7 asm("v7") = vlut1.val[3];
+#endif
+
   #define ND 256
   for(ip = in, op = out; ip != in+(inlen&~(ND-1)); ip += ND, op += (ND/4)*3) { PREFETCH(ip,256,0); 
     uint8x16x4_t iv0 = vld4q_u8(ip),

Partially fixes the problem with GCC (from Allwinner H616, result without fix, but I haven't fixed the encoder):

$ ./tb64app 
detected simd (id=38->'arm_neon')

  E MB/s    size     ratio%   D MB/s   function
size=1Kb (Kb=1.000)
   443.63       1336 133.60%   502.15 tb64s        1000
   788.13       1336 133.60%   758.41 tb64x        1000
   611.84       1336 133.60%   966.15 tb64sse        1000
   609.68       1336 133.60%   962.11 tb64auto        1000
  6807.23       1000 100.00%  6900.32 memcpy        1000
size=10Kb (Kb=1.000)
   447.53      13336 133.36%   506.22 tb64s       10000
   770.65      13336 133.36%   785.93 tb64x       10000
   658.11      13336 133.36%  1048.94 tb64sse       10000
   657.84      13336 133.36%  1047.68 tb64auto       10000
  7805.69      10000 100.00%  7940.31 memcpy       10000
size=100Kb (Kb=1.000)
   438.02     133336 133.34%   492.89 tb64s      100000
   744.99     133336 133.34%   741.89 tb64x      100000
   659.72     133336 133.34%  1006.12 tb64sse      100000
   659.53     133336 133.34%  1006.12 tb64auto      100000
  6537.56     100000 100.00%  6506.95 memcpy      100000
size=1Mb (Mb=1.000.000)
   443.27    1333336 133.33%   433.48 tb64s     1000000
   803.59    1333336 133.33%   487.48 tb64x     1000000
   609.31    1333336 133.33%  1018.82 tb64sse     1000000
   604.48    1333336 133.33%  1018.57 tb64auto     1000000
  1518.18    1000000 100.00%  1477.22 memcpy     1000000
size=10Mb (Mb=1.000.000)
   442.54   13333336 133.33%   431.26 tb64s    10000000
   800.12   13333336 133.33%   485.46 tb64x    10000000
   600.23   13333336 133.33%  1015.55 tb64sse    10000000
   601.00   13333336 133.33%  1015.56 tb64auto    10000000
  1517.18   10000000 100.00%  1539.37 memcpy    10000000
size=20Mb (Mb=1.000.000)
   441.87   26666668 133.33%   409.92 tb64s    20000000
   765.68   26666668 133.33%   480.25 tb64x    20000000
   406.66   26666668 133.33%   870.99 tb64sse    20000000
   406.58   26666668 133.33%   871.35 tb64auto    20000000
  1511.86   20000000 100.00%  1497.83 memcpy    20000000
powturbo commented 2 years ago

This is very impressive. I'll try it in the next days and compare it to clang-10 on my ODROID N2. The numbers will be interesting on Apple M1 or on your server cpu for comparisons with x86_64. The ARM instructions are more powerfull for base64 than intel/AMD (except maybe AVX512).

Better use:

#if defined(__GNUC__) && !defined(__clang__)

to only activate inline assembly for gcc, because __GNUC__ is also defined in clang.

ilyakurdyukov commented 2 years ago

As you can see, this patch saves code compiled with older versions of GCC from serious degradation, but Clang is still faster. (And this patch needs to be extended for the encoder). And you need to know which versions of GCC don't need this patch.

GCC 9.3.0:

size=10Kb (Kb=1.000)
   457.21      13336 133.36%   511.54 tb64s       10000
   830.79      13336 133.36%   795.15 tb64x       10000
   662.40      13336 133.36%   475.43 tb64sse       10000
   662.19      13336 133.36%   475.12 tb64auto       10000
  7936.60      10000 100.00%  7937.80 memcpy       10000

GCC 9.3.0 (with patch):

size=10Kb (Kb=1.000)
   447.53      13336 133.36%   506.22 tb64s       10000
   770.65      13336 133.36%   785.93 tb64x       10000
   658.11      13336 133.36%  1048.94 tb64sse       10000
   657.84      13336 133.36%  1047.68 tb64auto       10000
  7805.69      10000 100.00%  7940.31 memcpy       10000

Clang 11.0.1:

size=10Kb (Kb=1.000)
   393.05      13336 133.36%   489.76 tb64s       10000
   714.92      13336 133.36%   568.19 tb64x       10000
  1751.98      13336 133.36%  1510.76 tb64sse       10000
  1750.47      13336 133.36%  1510.31 tb64auto       10000
  7942.17      10000 100.00%  7940.57 memcpy       10000
ilyakurdyukov commented 2 years ago

I beat TB64 in decoding performance on Kunpeng-920 (Aarch64): TB64 is compiled with Clang, my code with GCC.

100MB run:

memcpy: 13.572ms (7368.11 MB/s)
tb64___encode: 27.128ms (3686.23 MB/s)
crzy64_encode: 41.365ms (2417.50 MB/s)
tb64___decode: 42.446ms (2355.93 MB/s)
crzy64_decode: 29.313ms (3411.46 MB/s)
tb64___encode_unaligned: 33.976ms (2943.25 MB/s)
crzy64_encode_unaligned: 46.765ms (2138.35 MB/s)
tb64___decode_unaligned: 40.160ms (2490.04 MB/s)
crzy64_decode_unaligned: 38.098ms (2624.81 MB/s)

block repeat (cached):

memcpy (1): 190.51 MB/s
tb64___encode (1): 179.50 MB/s
crzy64_encode (1): 152.88 MB/s
tb64___decode (1): 190.55 MB/s
crzy64_decode (1): 225.23 MB/s

memcpy (2): 381.10 MB/s
tb64___encode (2): 309.60 MB/s
crzy64_encode (2): 305.86 MB/s
tb64___decode (2): 380.95 MB/s
crzy64_decode (2): 430.85 MB/s

memcpy (4): 825.76 MB/s
tb64___encode (4): 566.09 MB/s
crzy64_encode (4): 347.64 MB/s
tb64___decode (4): 762.05 MB/s
crzy64_decode (4): 808.90 MB/s

memcpy (8): 1801.80 MB/s
tb64___encode (8): 620.83 MB/s
crzy64_encode (8): 628.58 MB/s
tb64___decode (8): 1524.39 MB/s
crzy64_decode (8): 1085.78 MB/s

memcpy (16): 3963.34 MB/s
tb64___encode (16): 1213.59 MB/s
crzy64_encode (16): 740.40 MB/s
tb64___decode (16): 3048.20 MB/s
crzy64_decode (16): 2186.99 MB/s

memcpy (32): 7926.68 MB/s
tb64___encode (32): 1485.47 MB/s
crzy64_encode (32): 1219.98 MB/s
tb64___decode (32): 6096.40 MB/s
crzy64_decode (32): 2642.01 MB/s

memcpy (64): 14411.17 MB/s
tb64___encode (64): 3705.63 MB/s
crzy64_encode (64): 1964.40 MB/s
tb64___decode (64): 2789.64 MB/s
crzy64_decode (64): 3708.43 MB/s

memcpy (128): 28826.75 MB/s
tb64___encode (128): 3262.46 MB/s
crzy64_encode (128): 2171.37 MB/s
tb64___decode (128): 3483.60 MB/s
crzy64_decode (128): 3889.97 MB/s

memcpy (256): 32520.33 MB/s
tb64___encode (256): 4497.54 MB/s
crzy64_encode (256): 2673.23 MB/s
tb64___decode (256): 3269.72 MB/s
crzy64_decode (256): 4159.41 MB/s

memcpy (512): 36258.16 MB/s
tb64___encode (512): 4436.95 MB/s
crzy64_encode (512): 2788.97 MB/s
tb64___decode (512): 3472.34 MB/s
crzy64_decode (512): 4328.01 MB/s

memcpy (1K): 37721.61 MB/s
tb64___encode (1K): 4936.59 MB/s
crzy64_encode (1K): 2938.53 MB/s
tb64___decode (1K): 3417.48 MB/s
crzy64_decode (1K): 4357.26 MB/s

memcpy (2K): 37993.92 MB/s
tb64___encode (2K): 4912.61 MB/s
crzy64_encode (2K): 3002.59 MB/s
tb64___decode (2K): 3455.91 MB/s
crzy64_decode (2K): 4440.48 MB/s

memcpy (4K): 38240.92 MB/s
tb64___encode (4K): 5039.03 MB/s
crzy64_encode (4K): 3010.85 MB/s
tb64___decode (4K): 3445.12 MB/s
crzy64_decode (4K): 4417.05 MB/s

memcpy (8K): 38402.46 MB/s
tb64___encode (8K): 5008.12 MB/s
crzy64_encode (8K): 3002.86 MB/s
tb64___decode (8K): 3462.33 MB/s
crzy64_decode (8K): 4546.75 MB/s

memcpy (16K): 37993.92 MB/s
tb64___encode (16K): 5040.81 MB/s
crzy64_encode (16K): 3039.96 MB/s
tb64___decode (16K): 3458.11 MB/s
crzy64_decode (16K): 4496.31 MB/s

memcpy (32K): 37864.45 MB/s
tb64___encode (32K): 5015.74 MB/s
crzy64_encode (32K): 3033.68 MB/s
tb64___decode (32K): 3431.81 MB/s
crzy64_decode (32K): 4368.31 MB/s

memcpy (64K): 29265.44 MB/s
tb64___encode (64K): 5020.72 MB/s
crzy64_encode (64K): 3033.56 MB/s
tb64___decode (64K): 3330.25 MB/s
crzy64_decode (64K): 4368.85 MB/s

memcpy (128K): 22727.27 MB/s
tb64___encode (128K): 4872.19 MB/s
crzy64_encode (128K): 3017.10 MB/s
tb64___decode (128K): 3325.51 MB/s
crzy64_decode (128K): 4364.40 MB/s

memcpy (256K): 22527.60 MB/s
tb64___encode (256K): 4905.09 MB/s
crzy64_encode (256K): 3001.35 MB/s
tb64___decode (256K): 3337.12 MB/s
crzy64_decode (256K): 4311.09 MB/s

memcpy (512K): 16342.54 MB/s
tb64___encode (512K): 4847.04 MB/s
crzy64_encode (512K): 3002.05 MB/s
tb64___decode (512K): 3141.77 MB/s
crzy64_decode (512K): 4214.32 MB/s

memcpy (1M): 12637.43 MB/s
tb64___encode (1M): 4812.60 MB/s
crzy64_encode (1M): 3063.69 MB/s
tb64___decode (1M): 3072.82 MB/s
crzy64_decode (1M): 4179.15 MB/s

Also my code is better when decoding medium sizes (32 .. 64K) using AVX2 on Skylake: Both compiled with GCC.


100MB run:

memcpy: 7.928ms (12613.05 MB/s)
tb64___encode: 13.503ms (7405.60 MB/s)
crzy64_encode: 14.053ms (7115.78 MB/s)
tb64___decode: 12.434ms (8042.75 MB/s)
crzy64_decode: 12.772ms (7829.57 MB/s)
tb64___encode_unaligned: 13.505ms (7404.56 MB/s)
crzy64_encode_unaligned: 14.047ms (7118.98 MB/s)
tb64___decode_unaligned: 12.876ms (7766.60 MB/s)
crzy64_decode_unaligned: 12.767ms (7832.91 MB/s)

block repeat (cached):

memcpy (1): 452.90 MB/s
tb64___encode (1): 201.23 MB/s
crzy64_encode (1): 161.82 MB/s
tb64___decode (1): 241.46 MB/s
crzy64_decode (1): 205.83 MB/s

memcpy (2): 724.64 MB/s
tb64___encode (2): 378.18 MB/s
crzy64_encode (2): 323.60 MB/s
tb64___decode (2): 426.13 MB/s
crzy64_decode (2): 376.60 MB/s

memcpy (4): 1610.29 MB/s
tb64___encode (4): 628.96 MB/s
crzy64_encode (4): 381.08 MB/s
tb64___decode (4): 689.79 MB/s
crzy64_decode (4): 720.59 MB/s

memcpy (8): 3220.59 MB/s
tb64___encode (8): 1024.53 MB/s
crzy64_encode (8): 789.95 MB/s
tb64___decode (8): 1025.38 MB/s
crzy64_decode (8): 983.56 MB/s

memcpy (16): 5797.09 MB/s
tb64___encode (16): 1588.46 MB/s
crzy64_encode (16): 915.46 MB/s
tb64___decode (16): 1556.15 MB/s
crzy64_decode (16): 1506.75 MB/s

memcpy (32): 13465.80 MB/s
tb64___encode (32): 3255.20 MB/s
crzy64_encode (32): 2408.02 MB/s
tb64___decode (32): 2393.46 MB/s
crzy64_decode (32): 3541.96 MB/s

memcpy (64): 33126.04 MB/s
tb64___encode (64): 4313.02 MB/s
crzy64_encode (64): 2848.97 MB/s
tb64___decode (64): 3715.40 MB/s
crzy64_decode (64): 6290.44 MB/s

memcpy (128): 23438.94 MB/s
tb64___encode (128): 7545.40 MB/s
crzy64_encode (128): 4900.30 MB/s
tb64___decode (128): 7231.12 MB/s
crzy64_decode (128): 9867.02 MB/s

memcpy (256): 65763.15 MB/s
tb64___encode (256): 8993.35 MB/s
crzy64_encode (256): 4985.46 MB/s
tb64___decode (256): 9760.69 MB/s
crzy64_decode (256): 15139.05 MB/s

memcpy (512): 71007.01 MB/s
tb64___encode (512): 11373.18 MB/s
crzy64_encode (512): 6671.36 MB/s
tb64___decode (512): 13468.98 MB/s
crzy64_decode (512): 18713.91 MB/s

memcpy (1K): 89399.94 MB/s
tb64___encode (1K): 12560.04 MB/s
crzy64_encode (1K): 7324.90 MB/s
tb64___decode (1K): 15516.87 MB/s
crzy64_decode (1K): 21664.02 MB/s

memcpy (2K): 104233.76 MB/s
tb64___encode (2K): 14416.69 MB/s
crzy64_encode (2K): 8108.54 MB/s
tb64___decode (2K): 17416.27 MB/s
crzy64_decode (2K): 23365.08 MB/s

memcpy (4K): 82908.21 MB/s
tb64___encode (4K): 15344.78 MB/s
crzy64_encode (4K): 8453.27 MB/s
tb64___decode (4K): 18175.85 MB/s
crzy64_decode (4K): 23836.40 MB/s

memcpy (8K): 64805.65 MB/s
tb64___encode (8K): 15612.13 MB/s
crzy64_encode (8K): 8926.90 MB/s
tb64___decode (8K): 18671.60 MB/s
crzy64_decode (8K): 24132.40 MB/s

memcpy (16K): 64115.15 MB/s
tb64___encode (16K): 15851.67 MB/s
crzy64_encode (16K): 8811.55 MB/s
tb64___decode (16K): 18446.79 MB/s
crzy64_decode (16K): 19460.10 MB/s

memcpy (32K): 42401.74 MB/s
tb64___encode (32K): 15906.48 MB/s
crzy64_encode (32K): 9010.06 MB/s
tb64___decode (32K): 18791.77 MB/s
crzy64_decode (32K): 19638.86 MB/s

memcpy (64K): 42752.67 MB/s
tb64___encode (64K): 16080.01 MB/s
crzy64_encode (64K): 9017.60 MB/s
tb64___decode (64K): 18720.88 MB/s
crzy64_decode (64K): 19115.18 MB/s

memcpy (128K): 40450.44 MB/s
tb64___encode (128K): 16097.22 MB/s
crzy64_encode (128K): 8869.10 MB/s
tb64___decode (128K): 18825.18 MB/s
crzy64_decode (128K): 16981.21 MB/s

memcpy (256K): 34023.79 MB/s
tb64___encode (256K): 15952.50 MB/s
crzy64_encode (256K): 8903.92 MB/s
tb64___decode (256K): 18855.88 MB/s
crzy64_decode (256K): 16417.96 MB/s

memcpy (512K): 31248.06 MB/s
tb64___encode (512K): 15867.42 MB/s
crzy64_encode (512K): 9036.64 MB/s
tb64___decode (512K): 18907.26 MB/s
crzy64_decode (512K): 16394.81 MB/s

memcpy (1M): 31193.80 MB/s
tb64___encode (1M): 15764.65 MB/s
crzy64_encode (1M): 8896.05 MB/s
tb64___decode (1M): 18756.73 MB/s
crzy64_decode (1M): 16308.18 MB/s

And although in my case it is a non-standard base64, I think that there are cases for which compliance with the standard does not matter. And it's without checks, but your code also compiled without checks (NCHECK=1).

The SVG writer is still not ready because I was working on optimizing the code.