Closed ilyakurdyukov closed 2 years ago
If you read carefully the readme, you can also notice a benchmark with short strings. The benchmark is not trying to show every szenario or the wort case, but to compare turbo-base64 to other base64 libraries. Each user can do benchmark with his own data. The Clikckhouse people has evaluated several base64 libraries and have chosen Turbo-Base64. I'm also using very large i/o buffers to avoid the cache szenario like shown in other benchmarks.
The problem, again, is in comparison to memcpy. Because memcpy testing standards are much higher than the cases your benchmark is measuring. Just google the memcpy comparison
images. And with more thorough testing, you can find weaknesses, and I think your code can be improved to handle these cases. But I don't see that you care at all. You have already claimed the victory, that you are the best of the best, and most people believed you. So because they already believed and none of the contenders have better benchmarks - you are happy with it and don't make any improvements against possible weaknesses of your code.
I wouldn't criticize you if you weren't comparing to memcpy and claiming superiority, but you do that.
In contrast to most of the software in github, you can see from all my repositories, I'm still improving and updating the software. I have also ideas how to improve Turbo-Base64. Yes, actually there is no other faster base64 library and that's the point. I'm only interpreting the benchmark numbers and I'm providing benchmark programs to directly compare Turbo-Base64 with other libraries, so everyone can run the software. The comparison to memcpy is provided as a reference point and only as orientation, the use case of base64 is differrent, because memcpy is not doing base64. You are rarely doing base64 with very small lengths like 4 or 5 characters. You can't find other benchmarks comparing Turbo-Base64. They are simply ignoring its existence.
You can't find other benchmarks comparing Turbo-Base64.
I'll make mine to check these cases.
Your benchmark was recommended to me as a reference. And although your base64 SIMD core is very good, the benchmark is no good at all, because it only checks the cases that are beneficial to you, where your code gives excellent results. That's what I see.
And I'll say twice that I think that miss-prefetching data (that "s + 1024") is a bad thing.
The prefetching is not used in the short strings version _tb64avx2enc/__tb64avx2dec. I'll check that in the next update for the other functions, it is possible that with modern hardware it's no longer needed at all.
Prefetching is gives a slight speedup (I checked it).
I think you can do something like Prefetch(min(src + 1024, src_end - 1))
, that shouldn't add extra delay, but will not prefetch unneeded data. Compiler may use cmov here.
Also I think that some leftover bytes at the end:
size_t rc = _tb64xd(ip, inlen-(ip-in), op);
if(!rc || cx) return 0;
return (op-out)+rc;
Can be read into a vector and processed without fallback to _tb64xd(). Especially when you have _mm256_maskload_epi32 in AVX2. (The remaining 2-3 bytes is a bit of problem, but it can be solved.)
And I want to try to implement it (but not for your project).
If you provided results for small strings, then why there's no options for running the benchmark in this mode? Because I don't see this feature. (And that's what I asked for.)
But I have already done my benchmark, which checks the blocks of powers of two from 1 to 1 << 20. And also in different modes (same block, or consecutive blocks, or random order).
And I did beat your results at some sizes/modes with my base64 modification. While I haven't used loop unrolling yet, it might give my code some boost if I do. And the idea of handling leftovers in a vector really helps for x86 CPUs on a small inputs (makes worse for aarch64).
Congratulation. The short string mode was included because of a request from the Clickhouse people. I've not spent much time for this. I've also ideas how to improve the processing for several representative sizes. I'll make that in the next update when I find some spare time and when I can find a recent cpu for testing. I've also a not yet published python interface.
So if you need to compete with someone to improve your code, I'll give it to you.
And the users of your code will benefit if you make it better. So this is not a useless competition.
I still want to experiment more with my code. And also I plan to add a .svg image writer to my benchmark so that results can be viewed in a more convenient form. I want it to look like those memcpy comparisons I talked about.
Or maybe not much on what sizes I won because I messed up the AVX2 TB64 string/regular functions.
However, when trying to test TB64 on Aarch64, I found that TB64 performance is really poor.
Like what's that?
$ ./tb64app
detected simd (id=38->'arm_neon')
E MB/s size ratio% D MB/s function
size=1Kb (Kb=1.000)
443.50 1336 133.60% 501.97 tb64s 1000
787.83 1336 133.60% 758.23 tb64x 1000
611.15 1336 133.60% 460.90 tb64sse 1000
609.07 1336 133.60% 459.97 tb64auto 1000
6803.48 1000 100.00% 6899.79 memcpy 1000
size=10Kb (Kb=1.000)
457.21 13336 133.36% 511.54 tb64s 10000
830.79 13336 133.36% 795.15 tb64x 10000
662.40 13336 133.36% 475.43 tb64sse 10000
662.19 13336 133.36% 475.12 tb64auto 10000
7936.60 10000 100.00% 7937.80 memcpy 10000
size=100Kb (Kb=1.000)
442.34 133336 133.34% 482.84 tb64s 100000
774.62 133336 133.34% 707.64 tb64x 100000
658.56 133336 133.34% 447.41 tb64sse 100000
658.55 133336 133.34% 447.37 tb64auto 100000
6512.73 100000 100.00% 6558.52 memcpy 100000
size=1Mb (Mb=1.000.000)
443.07 1333336 133.33% 433.75 tb64s 1000000
803.85 1333336 133.33% 488.02 tb64x 1000000
609.80 1333336 133.33% 412.00 tb64sse 1000000
606.65 1333336 133.33% 411.73 tb64auto 1000000
1527.66 1000000 100.00% 1474.18 memcpy 1000000
size=10Mb (Mb=1.000.000)
442.00 13333336 133.33% 430.60 tb64s 10000000
798.73 13333336 133.33% 486.04 tb64x 10000000
608.90 13333336 133.33% 411.39 tb64sse 10000000
603.82 13333336 133.33% 411.56 tb64auto 10000000
1518.03 10000000 100.00% 1543.03 memcpy 10000000
size=20Mb (Mb=1.000.000)
441.81 26666668 133.33% 409.75 tb64s 20000000
765.84 26666668 133.33% 492.27 tb64x 20000000
406.05 26666668 133.33% 375.99 tb64sse 20000000
405.88 26666668 133.33% 376.18 tb64auto 20000000
1508.51 20000000 100.00% 1497.61 memcpy 20000000
__ARM_NEON macro appears to be defined, and "detected simd (id=38->'arm_neon')" shown.
But what is it when tb64x is faster than tb64sse where the Neon vector code should be?
Why is the encoder faster, even if I compile with make NCHECK=1
?
This is an Allwinner H616 and for my vector code for the Neon I got a good speedup with it.
Yes, depending on the cpu, the hardware and the instructions used, scalar code can be faster than simd. It is very difficult to benchmark most of the ARM boards because of throttling. Always use the options -I# and -J# with a large #, for ex. 15, 31 or 63. It will take very long time, but the results are more accurate. Try using clang instead of gcc
Seems like GCC 9.3.0 issue which works very badly for your Neon code.
I tried Clang 11.0.1 which gives much better results on Allwinner H616:
$ ./tb64app
detected simd (id=38->'arm_neon')
E MB/s size ratio% D MB/s function
size=1Kb (Kb=1.000)
384.38 1336 133.60% 482.34 tb64s 1000
713.74 1336 133.60% 559.83 tb64x 1000
1395.00 1336 133.60% 1290.84 tb64sse 1000
1387.20 1336 133.60% 1286.73 tb64auto 1000
6807.11 1000 100.00% 6900.42 memcpy 1000
size=10Kb (Kb=1.000)
393.05 13336 133.36% 489.76 tb64s 10000
714.92 13336 133.36% 568.19 tb64x 10000
1751.98 13336 133.36% 1510.76 tb64sse 10000
1750.47 13336 133.36% 1510.31 tb64auto 10000
7942.17 10000 100.00% 7940.57 memcpy 10000
size=100Kb (Kb=1.000)
384.23 133336 133.34% 474.47 tb64s 100000
664.04 133336 133.34% 545.02 tb64x 100000
1765.05 133336 133.34% 1429.38 tb64sse 100000
1765.10 133336 133.34% 1428.25 tb64auto 100000
6542.89 100000 100.00% 6470.67 memcpy 100000
size=1Mb (Mb=1.000.000)
381.20 1333336 133.33% 461.81 tb64s 1000000
709.20 1333336 133.33% 460.76 tb64x 1000000
1215.44 1333336 133.33% 1357.67 tb64sse 1000000
1217.97 1333336 133.33% 1359.23 tb64auto 1000000
1512.70 1000000 100.00% 1451.83 memcpy 1000000
size=10Mb (Mb=1.000.000)
377.78 13333336 133.33% 457.07 tb64s 10000000
707.43 13333336 133.33% 456.08 tb64x 10000000
1211.69 13333336 133.33% 1352.01 tb64sse 10000000
1209.93 13333336 133.33% 1351.42 tb64auto 10000000
1519.48 10000000 100.00% 1544.09 memcpy 10000000
size=20Mb (Mb=1.000.000)
370.07 26666668 133.33% 430.21 tb64s 20000000
683.67 26666668 133.33% 411.12 tb64x 20000000
1119.71 26666668 133.33% 1146.30 tb64sse 20000000
1121.61 26666668 133.33% 1146.84 tb64auto 20000000
1500.15 20000000 100.00% 1500.42 memcpy 20000000
Also results from Kunpeng-920 (with Clang 11.0.1), which is a top-end server CPU:
$ ./tb64app -I15 -J15
detected simd (id=38->'arm_neon')
E MB/s size ratio% D MB/s function
size=1Kb (Kb=1.000)
1553.94 1336 133.60% 1832.45 tb64s 1000
2413.25 1336 133.60% 2233.39 tb64x 1000
4182.25 1336 133.60% 4419.85 tb64sse 1000
4185.82 1336 133.60% 4408.30 tb64auto 1000
20121.04 1000 100.00% 20596.66 memcpy 1000
size=10Kb (Kb=1.000)
1687.68 13336 133.36% 1868.10 tb64s 10000
2642.42 13336 133.36% 2265.32 tb64x 10000
5284.68 13336 133.36% 4803.35 tb64sse 10000
5280.65 13336 133.36% 4793.96 tb64auto 10000
39486.98 10000 100.00% 39312.89 memcpy 10000
size=100Kb (Kb=1.000)
1668.29 133336 133.34% 1824.14 tb64s 100000
2613.15 133336 133.34% 2212.57 tb64x 100000
5111.60 133336 133.34% 4680.91 tb64sse 100000
5116.96 133336 133.34% 4685.42 tb64auto 100000
24545.18 100000 100.00% 24653.34 memcpy 100000
size=1Mb (Mb=1.000.000)
1572.49 1333336 133.33% 1785.52 tb64s 1000000
2512.57 1333336 133.33% 2045.56 tb64x 1000000
4582.90 1333336 133.33% 4561.65 tb64sse 1000000
4707.62 1333336 133.33% 4599.14 tb64auto 1000000
17789.63 1000000 100.00% 15538.28 memcpy 1000000
size=10Mb (Mb=1.000.000)
1516.72 13333336 133.33% 1770.81 tb64s 10000000
2363.12 13333336 133.33% 1966.43 tb64x 10000000
4506.41 13333336 133.33% 4506.90 tb64sse 10000000
4511.00 13333336 133.33% 4489.11 tb64auto 10000000
14919.41 10000000 100.00% 15055.00 memcpy 10000000
size=20Mb (Mb=1.000.000)
1424.68 26666668 133.33% 1427.27 tb64s 20000000
2202.20 26666668 133.33% 1832.59 tb64x 20000000
3624.79 26666668 133.33% 3529.17 tb64sse 20000000
3632.39 26666668 133.33% 3565.70 tb64auto 20000000
5938.97 20000000 100.00% 5709.97 memcpy 20000000
And a little newer GCC 10.3.1 and Kunpeng-920:
$ ./tb64app -I15 -J15
detected simd (id=38->'arm_neon')
E MB/s size ratio% D MB/s function
size=1Kb (Kb=1.000)
1270.10 1336 133.60% 1796.14 tb64s 1000
2348.15 1336 133.60% 2289.20 tb64x 1000
2858.96 1336 133.60% 2287.53 tb64sse 1000
2874.36 1336 133.60% 2284.12 tb64auto 1000
20129.93 1000 100.00% 20607.96 memcpy 1000
size=10Kb (Kb=1.000)
1318.80 13336 133.36% 1820.97 tb64s 10000
2546.63 13336 133.36% 2330.87 tb64x 10000
3301.96 13336 133.36% 2344.52 tb64sse 10000
3301.13 13336 133.36% 2344.38 tb64auto 10000
39201.95 10000 100.00% 39153.98 memcpy 10000
size=100Kb (Kb=1.000)
1311.00 133336 133.34% 1799.43 tb64s 100000
2532.02 133336 133.34% 2276.98 tb64x 100000
3193.05 133336 133.34% 2328.40 tb64sse 100000
3192.59 133336 133.34% 2328.44 tb64auto 100000
24467.35 100000 100.00% 24779.43 memcpy 100000
size=1Mb (Mb=1.000.000)
1290.23 1333336 133.33% 1744.71 tb64s 1000000
2353.24 1333336 133.33% 2263.10 tb64x 1000000
2955.80 1333336 133.33% 2245.15 tb64sse 1000000
2958.51 1333336 133.33% 2245.27 tb64auto 1000000
17373.09 1000000 100.00% 16517.68 memcpy 1000000
size=10Mb (Mb=1.000.000)
1222.32 13333336 133.33% 1716.83 tb64s 10000000
2330.96 13333336 133.33% 2117.21 tb64x 10000000
2250.78 13333336 133.33% 2124.52 tb64sse 10000000
2250.93 13333336 133.33% 2125.33 tb64auto 10000000
14434.34 10000000 100.00% 13271.02 memcpy 10000000
size=20Mb (Mb=1.000.000)
613.56 26666668 133.33% 1465.64 tb64s 20000000
2231.74 26666668 133.33% 1873.39 tb64x 20000000
2505.94 26666668 133.33% 1798.04 tb64sse 20000000
2338.66 26666668 133.33% 1874.06 tb64auto 20000000
5413.85 20000000 100.00% 5046.94 memcpy 20000000
Which just proves that if your Neon code is compiled with GCC - then its performance is very low. I guess you need to warn about this somewhere. Or you can try to figure out what is causing GCC to generate inefficient code and try to fix it.
I read the assembly listing from GCC, so ...
This can be the result of greedy unrolling, I know of situations where the compiler does not work well, when you inline too much code, even for Clang (I seen this for Hexagon HVX code). So it's better not to write such code. Or if you are hoping for Clang then hide it under #ifdef __clang__
and give GCC more simple code. The problem is that the compiler creates a lot of variables and starts saving and popping them off the stack, which completely kills performance.
There is also a case you are overlooking is when the buffers are not aligned.
I just add one byte to the pointer I got from malloc (which are aligned to at least 8 bytes)
So TB64 Neon code compiled with Clang gives this (Allwinner H616):
memcpy: 80.783ms (1237.88 MB/s)
encode: 87.864ms (1138.12 MB/s)
decode: 107.139ms (933.37 MB/s)
encode_unaligned: 94.242ms (1061.10 MB/s)
decode_unaligned: 119.974ms (833.51 MB/s)
And my results:
memcpy: 80.611ms (1240.53 MB/s)
encode: 220.581ms (453.35 MB/s)
decode: 103.892ms (962.54 MB/s)
encode_unaligned: 229.204ms (436.29 MB/s)
decode_unaligned: 107.047ms (934.17 MB/s)
Through on Kunpeng-920 for TB64:
memcpy: 13.578ms (7364.85 MB/s)
encode: 28.358ms (3526.34 MB/s)
decode: 42.583ms (2348.35 MB/s)
encode_unaligned: 33.757ms (2962.35 MB/s)
decode_unaligned: 39.603ms (2525.06 MB/s)
And my results:
memcpy: 13.611ms (7347.00 MB/s)
encode: 109.945ms (909.55 MB/s)
decode: 37.836ms (2642.99 MB/s)
encode_unaligned: 106.490ms (939.06 MB/s)
decode_unaligned: 45.155ms (2214.59 MB/s)
That's about encoding/decoding 100MB (your test defaults miss that too).
(I don't care much about the encoder performance, I don't need it for my task.)
Very interesting findings, I've tested tb64 with clang 6.0 and gcc 7.0 on ARM. As you can see I'm publishing only the clang results, because gcc was considerably slower for the NEON functions. On intel cpu gcc generated code was always faster. You can try to change the unroling ND define from 256 to 128 and maybe disable the checking. I'll considere your valuable suggestions in the next update. Than you!
You can try to change the unroling ND define from 256 to 128 and maybe disable the checking.
Tried it already, it doesn't help. Checking is disabled. It doesn't even help if I turn off the unrolled ND loop completely and leave the loop running 64 at a time.
But GCC is slow for this code, not just any Neon code, because GCC gives 5% better results for my Neon code than Clang.
This is from the assembly listing from GCC, from a loop that does 64 at a time. Notice these lines with x2:
"add x2, sp, 96" "st1 {v0.16b - v3.16b}, [x2]" "ld1 {v9.16b - v12.16b}, [x2]"
These are saving to the stack and loading from the stack. Which shouldn't be there for the code to work fast.
.L3:
adrp x2, .LC0
add x2, x2, :lo12:.LC0
ld4 {v0.16b - v3.16b}, [x3], 64
ld1 {v8.16b - v11.16b}, [x2]
add x2, sp, 96
mov v30.16b, v22.16b
cmp x3, x0
mov v28.16b, v20.16b
mov v31.16b, v11.16b
st1 {v0.16b - v3.16b}, [x2]
adrp x2, .LC1
mov v29.16b, v21.16b
add x2, x2, :lo12:.LC1
mov v19.16b, v11.16b
mov v16.16b, v24.16b
ld1 {v1.16b - v3.16b}, [x2]
add x2, sp, 160
mov v17.16b, v25.16b
mov v18.16b, v26.16b
mov v9.16b, v21.16b
st1 {v1.16b - v3.16b}, [x2]
adrp x2, .LC0
mov v10.16b, v22.16b
add x2, x2, :lo12:.LC0
mov v11.16b, v23.16b
mov v8.16b, v20.16b
ld1 {v0.16b - v3.16b}, [x2]
add x2, sp, 336
mov v0.16b, v24.16b
mov v1.16b, v25.16b
mov v2.16b, v26.16b
st1 {v28.16b - v31.16b}, [x2]
add x2, sp, 272
movi v28.16b, 0x40
str q23, [sp, 384]
ldr q29, [sp, 96]
st1 {v16.16b - v19.16b}, [x2]
add x2, sp, 96
movi v14.16b, 0x40
str q27, [sp, 320]
eor v28.16b, v28.16b, v29.16b
mov v4.16b, v24.16b
mov v5.16b, v25.16b
mov v6.16b, v26.16b
tbl v8.16b, {v8.16b - v11.16b}, v28.16b
ld1 {v9.16b - v12.16b}, [x2]
add x2, sp, 208
mov v7.16b, v27.16b
mov v28.16b, v20.16b
mov v29.16b, v21.16b
st1 {v0.16b - v3.16b}, [x2]
add x2, sp, 336
tbx v8.16b, {v4.16b - v7.16b}, v9.16b
str q27, [sp, 256]
eor v4.16b, v14.16b, v10.16b
ld1 {v0.16b - v3.16b}, [x2]
add x2, sp, 272
shl v8.16b, v8.16b, 2
mov v31.16b, v23.16b
tbl v1.16b, {v0.16b - v3.16b}, v4.16b
ld1 {v10.16b - v13.16b}, [x2]
add x2, sp, 96
ldr q0, [sp, 112]
mov v16.16b, v20.16b
mov v17.16b, v21.16b
tbx v1.16b, {v10.16b - v13.16b}, v0.16b
ld1 {v9.16b - v12.16b}, [x2]
add x2, sp, 160
mov v18.16b, v22.16b
ushr v2.16b, v1.16b, 4
eor v0.16b, v14.16b, v11.16b
mov v10.16b, v11.16b
mov v19.16b, v23.16b
orr v2.16b, v2.16b, v8.16b
tbl v0.16b, {v28.16b - v31.16b}, v0.16b
mov v11.16b, v12.16b
mov v4.16b, v24.16b
shl v1.16b, v1.16b, 4
str q2, [sp, 160]
ld1 {v28.16b - v30.16b}, [x2]
add x2, sp, 208
eor v2.16b, v14.16b, v12.16b
ld1 {v12.16b - v15.16b}, [x2]
tbl v16.16b, {v16.16b - v19.16b}, v2.16b
tbx v0.16b, {v12.16b - v15.16b}, v10.16b
tbx v16.16b, {v4.16b - v7.16b}, v11.16b
ushr v2.16b, v0.16b, 2
shl v0.16b, v0.16b, 6
orr v29.16b, v2.16b, v1.16b
orr v30.16b, v0.16b, v16.16b
st3 {v28.16b - v30.16b}, [x4], 48
bne .L3
The insane part of this is that GCC loads the lut array on every iteration of the loop! And then immediately saves to the stack!
And in the unrolled loop, the lut array is loaded several times.
It seems there is no way to convince gcc to not use the st1,ld1 instructions. I've tested this and gcc can generate similar code to clang only with the incoming gcc 12 (tested "gcc trunk" in gotbolt). It's also better to replace the ifndef vld1q_u8_x4 wtih:
((__GNUC__ == 10 && (__GNUC_MINOR__ <= 1)) || \
(__GNUC__ == 9 && (__GNUC_MINOR__ <= 3)) || \
(__GNUC__ == 8 && (__GNUC_MINOR__ <= 4)) || __GNUC__ <= 7)
It's also better to replace the ifndef vld1q_u8_x4
Why not just read like this *(uint8x16x4_t*)(lut)
?
It seems there is no way to convince gcc to not use the st1,ld1 instructions.
I also tried different methods, but nothing works. I suggest you to create a bug report on GCC bugzilla.
Why not just read like this (uint8x16x4_t)(lut) ?
this is not the same, this generate 2 instructions and vld1q_u8_x4 only 1. In general it is better to use the simd instruction instead of the pointer variables.
I suggest you to create a bug report on GCC bugzilla.
As I said before, gcc will generate the perfect code in the next release. I you are familiar with inline assembly you can insert the code from "gcc trunk" in godbolt into the c code.
Try this patch:
diff --git a/turbob64sse.c b/turbob64sse.c
index 551da41..147ba9f 100644
--- a/turbob64sse.c
+++ b/turbob64sse.c
@@ -86,11 +86,29 @@ static inline uint8x16x4_t vld1q_u8_x4(const uint8_t *lut) {
}
#endif
+#ifdef __GNUC__
+#define B64D_GCC_FIX
+#endif
+
+#ifdef B64D_GCC_FIX
+#define B64D_TBL(x) { \
+ uint8x16_t t = veorq_u8(x, cv40); \
+ asm ("tbl %0.16b, {v4.16b - v7.16b}, %0.16b" : "+w"(t) : \
+ "w"(v4), "w"(v5), "w"(v6), "w"(v7)); \
+ asm ("tbx %0.16b, {v0.16b - v3.16b}, %1.16b" : "+w"(t) : \
+ "w"(x), "w"(v0), "w"(v1), "w"(v2), "w"(v3)); \
+ x = t; \
+}
+#else
+#define B64D_TBL(x) \
+ x = vqtbx4q_u8(vqtbl4q_u8(vlut1, veorq_u8(x, cv40)), vlut0, x);
+#endif
+
#define B64D(iv, ov) {\
- iv.val[0] = vqtbx4q_u8(vqtbl4q_u8(vlut1, veorq_u8(iv.val[0], cv40)), vlut0, iv.val[0]);\
- iv.val[1] = vqtbx4q_u8(vqtbl4q_u8(vlut1, veorq_u8(iv.val[1], cv40)), vlut0, iv.val[1]);\
- iv.val[2] = vqtbx4q_u8(vqtbl4q_u8(vlut1, veorq_u8(iv.val[2], cv40)), vlut0, iv.val[2]);\
- iv.val[3] = vqtbx4q_u8(vqtbl4q_u8(vlut1, veorq_u8(iv.val[3], cv40)), vlut0, iv.val[3]);\
+ B64D_TBL(iv.val[0]);\
+ B64D_TBL(iv.val[1]);\
+ B64D_TBL(iv.val[2]);\
+ B64D_TBL(iv.val[3]);\
\
ov.val[0] = vorrq_u8(vshlq_n_u8(iv.val[0], 2), vshrq_n_u8(iv.val[1], 4));\
ov.val[1] = vorrq_u8(vshlq_n_u8(iv.val[1], 4), vshrq_n_u8(iv.val[2], 2));\
@@ -106,6 +124,18 @@ size_t tb64ssedec(const unsigned char *in, size_t inlen, unsigned char *out) {
vlut1 = vld1q_u8_x4(&lut[64]);
const uint8x16_t cv40 = vdupq_n_u8(0x40);
uint8x16_t xv = vdupq_n_u8(0);
+
+#ifdef B64D_GCC_FIX
+ register uint8x16_t v0 asm("v0") = vlut0.val[0];
+ register uint8x16_t v1 asm("v1") = vlut0.val[1];
+ register uint8x16_t v2 asm("v2") = vlut0.val[2];
+ register uint8x16_t v3 asm("v3") = vlut0.val[3];
+ register uint8x16_t v4 asm("v4") = vlut1.val[0];
+ register uint8x16_t v5 asm("v5") = vlut1.val[1];
+ register uint8x16_t v6 asm("v6") = vlut1.val[2];
+ register uint8x16_t v7 asm("v7") = vlut1.val[3];
+#endif
+
#define ND 256
for(ip = in, op = out; ip != in+(inlen&~(ND-1)); ip += ND, op += (ND/4)*3) { PREFETCH(ip,256,0);
uint8x16x4_t iv0 = vld4q_u8(ip),
Partially fixes the problem with GCC (from Allwinner H616, result without fix, but I haven't fixed the encoder):
$ ./tb64app
detected simd (id=38->'arm_neon')
E MB/s size ratio% D MB/s function
size=1Kb (Kb=1.000)
443.63 1336 133.60% 502.15 tb64s 1000
788.13 1336 133.60% 758.41 tb64x 1000
611.84 1336 133.60% 966.15 tb64sse 1000
609.68 1336 133.60% 962.11 tb64auto 1000
6807.23 1000 100.00% 6900.32 memcpy 1000
size=10Kb (Kb=1.000)
447.53 13336 133.36% 506.22 tb64s 10000
770.65 13336 133.36% 785.93 tb64x 10000
658.11 13336 133.36% 1048.94 tb64sse 10000
657.84 13336 133.36% 1047.68 tb64auto 10000
7805.69 10000 100.00% 7940.31 memcpy 10000
size=100Kb (Kb=1.000)
438.02 133336 133.34% 492.89 tb64s 100000
744.99 133336 133.34% 741.89 tb64x 100000
659.72 133336 133.34% 1006.12 tb64sse 100000
659.53 133336 133.34% 1006.12 tb64auto 100000
6537.56 100000 100.00% 6506.95 memcpy 100000
size=1Mb (Mb=1.000.000)
443.27 1333336 133.33% 433.48 tb64s 1000000
803.59 1333336 133.33% 487.48 tb64x 1000000
609.31 1333336 133.33% 1018.82 tb64sse 1000000
604.48 1333336 133.33% 1018.57 tb64auto 1000000
1518.18 1000000 100.00% 1477.22 memcpy 1000000
size=10Mb (Mb=1.000.000)
442.54 13333336 133.33% 431.26 tb64s 10000000
800.12 13333336 133.33% 485.46 tb64x 10000000
600.23 13333336 133.33% 1015.55 tb64sse 10000000
601.00 13333336 133.33% 1015.56 tb64auto 10000000
1517.18 10000000 100.00% 1539.37 memcpy 10000000
size=20Mb (Mb=1.000.000)
441.87 26666668 133.33% 409.92 tb64s 20000000
765.68 26666668 133.33% 480.25 tb64x 20000000
406.66 26666668 133.33% 870.99 tb64sse 20000000
406.58 26666668 133.33% 871.35 tb64auto 20000000
1511.86 20000000 100.00% 1497.83 memcpy 20000000
This is very impressive. I'll try it in the next days and compare it to clang-10 on my ODROID N2. The numbers will be interesting on Apple M1 or on your server cpu for comparisons with x86_64. The ARM instructions are more powerfull for base64 than intel/AMD (except maybe AVX512).
Better use:
#if defined(__GNUC__) && !defined(__clang__)
to only activate inline assembly for gcc, because __GNUC__
is also defined in clang.
As you can see, this patch saves code compiled with older versions of GCC from serious degradation, but Clang is still faster. (And this patch needs to be extended for the encoder). And you need to know which versions of GCC don't need this patch.
GCC 9.3.0:
size=10Kb (Kb=1.000)
457.21 13336 133.36% 511.54 tb64s 10000
830.79 13336 133.36% 795.15 tb64x 10000
662.40 13336 133.36% 475.43 tb64sse 10000
662.19 13336 133.36% 475.12 tb64auto 10000
7936.60 10000 100.00% 7937.80 memcpy 10000
GCC 9.3.0 (with patch):
size=10Kb (Kb=1.000)
447.53 13336 133.36% 506.22 tb64s 10000
770.65 13336 133.36% 785.93 tb64x 10000
658.11 13336 133.36% 1048.94 tb64sse 10000
657.84 13336 133.36% 1047.68 tb64auto 10000
7805.69 10000 100.00% 7940.31 memcpy 10000
Clang 11.0.1:
size=10Kb (Kb=1.000)
393.05 13336 133.36% 489.76 tb64s 10000
714.92 13336 133.36% 568.19 tb64x 10000
1751.98 13336 133.36% 1510.76 tb64sse 10000
1750.47 13336 133.36% 1510.31 tb64auto 10000
7942.17 10000 100.00% 7940.57 memcpy 10000
I beat TB64 in decoding performance on Kunpeng-920 (Aarch64): TB64 is compiled with Clang, my code with GCC.
I think somehow you can improve the code with less table usage for Neon. At least I have such an impression.
MB/s measured against raw data.
100MB run:
memcpy: 13.572ms (7368.11 MB/s)
tb64___encode: 27.128ms (3686.23 MB/s)
crzy64_encode: 41.365ms (2417.50 MB/s)
tb64___decode: 42.446ms (2355.93 MB/s)
crzy64_decode: 29.313ms (3411.46 MB/s)
tb64___encode_unaligned: 33.976ms (2943.25 MB/s)
crzy64_encode_unaligned: 46.765ms (2138.35 MB/s)
tb64___decode_unaligned: 40.160ms (2490.04 MB/s)
crzy64_decode_unaligned: 38.098ms (2624.81 MB/s)
block repeat (cached):
memcpy (1): 190.51 MB/s
tb64___encode (1): 179.50 MB/s
crzy64_encode (1): 152.88 MB/s
tb64___decode (1): 190.55 MB/s
crzy64_decode (1): 225.23 MB/s
memcpy (2): 381.10 MB/s
tb64___encode (2): 309.60 MB/s
crzy64_encode (2): 305.86 MB/s
tb64___decode (2): 380.95 MB/s
crzy64_decode (2): 430.85 MB/s
memcpy (4): 825.76 MB/s
tb64___encode (4): 566.09 MB/s
crzy64_encode (4): 347.64 MB/s
tb64___decode (4): 762.05 MB/s
crzy64_decode (4): 808.90 MB/s
memcpy (8): 1801.80 MB/s
tb64___encode (8): 620.83 MB/s
crzy64_encode (8): 628.58 MB/s
tb64___decode (8): 1524.39 MB/s
crzy64_decode (8): 1085.78 MB/s
memcpy (16): 3963.34 MB/s
tb64___encode (16): 1213.59 MB/s
crzy64_encode (16): 740.40 MB/s
tb64___decode (16): 3048.20 MB/s
crzy64_decode (16): 2186.99 MB/s
memcpy (32): 7926.68 MB/s
tb64___encode (32): 1485.47 MB/s
crzy64_encode (32): 1219.98 MB/s
tb64___decode (32): 6096.40 MB/s
crzy64_decode (32): 2642.01 MB/s
memcpy (64): 14411.17 MB/s
tb64___encode (64): 3705.63 MB/s
crzy64_encode (64): 1964.40 MB/s
tb64___decode (64): 2789.64 MB/s
crzy64_decode (64): 3708.43 MB/s
memcpy (128): 28826.75 MB/s
tb64___encode (128): 3262.46 MB/s
crzy64_encode (128): 2171.37 MB/s
tb64___decode (128): 3483.60 MB/s
crzy64_decode (128): 3889.97 MB/s
memcpy (256): 32520.33 MB/s
tb64___encode (256): 4497.54 MB/s
crzy64_encode (256): 2673.23 MB/s
tb64___decode (256): 3269.72 MB/s
crzy64_decode (256): 4159.41 MB/s
memcpy (512): 36258.16 MB/s
tb64___encode (512): 4436.95 MB/s
crzy64_encode (512): 2788.97 MB/s
tb64___decode (512): 3472.34 MB/s
crzy64_decode (512): 4328.01 MB/s
memcpy (1K): 37721.61 MB/s
tb64___encode (1K): 4936.59 MB/s
crzy64_encode (1K): 2938.53 MB/s
tb64___decode (1K): 3417.48 MB/s
crzy64_decode (1K): 4357.26 MB/s
memcpy (2K): 37993.92 MB/s
tb64___encode (2K): 4912.61 MB/s
crzy64_encode (2K): 3002.59 MB/s
tb64___decode (2K): 3455.91 MB/s
crzy64_decode (2K): 4440.48 MB/s
memcpy (4K): 38240.92 MB/s
tb64___encode (4K): 5039.03 MB/s
crzy64_encode (4K): 3010.85 MB/s
tb64___decode (4K): 3445.12 MB/s
crzy64_decode (4K): 4417.05 MB/s
memcpy (8K): 38402.46 MB/s
tb64___encode (8K): 5008.12 MB/s
crzy64_encode (8K): 3002.86 MB/s
tb64___decode (8K): 3462.33 MB/s
crzy64_decode (8K): 4546.75 MB/s
memcpy (16K): 37993.92 MB/s
tb64___encode (16K): 5040.81 MB/s
crzy64_encode (16K): 3039.96 MB/s
tb64___decode (16K): 3458.11 MB/s
crzy64_decode (16K): 4496.31 MB/s
memcpy (32K): 37864.45 MB/s
tb64___encode (32K): 5015.74 MB/s
crzy64_encode (32K): 3033.68 MB/s
tb64___decode (32K): 3431.81 MB/s
crzy64_decode (32K): 4368.31 MB/s
memcpy (64K): 29265.44 MB/s
tb64___encode (64K): 5020.72 MB/s
crzy64_encode (64K): 3033.56 MB/s
tb64___decode (64K): 3330.25 MB/s
crzy64_decode (64K): 4368.85 MB/s
memcpy (128K): 22727.27 MB/s
tb64___encode (128K): 4872.19 MB/s
crzy64_encode (128K): 3017.10 MB/s
tb64___decode (128K): 3325.51 MB/s
crzy64_decode (128K): 4364.40 MB/s
memcpy (256K): 22527.60 MB/s
tb64___encode (256K): 4905.09 MB/s
crzy64_encode (256K): 3001.35 MB/s
tb64___decode (256K): 3337.12 MB/s
crzy64_decode (256K): 4311.09 MB/s
memcpy (512K): 16342.54 MB/s
tb64___encode (512K): 4847.04 MB/s
crzy64_encode (512K): 3002.05 MB/s
tb64___decode (512K): 3141.77 MB/s
crzy64_decode (512K): 4214.32 MB/s
memcpy (1M): 12637.43 MB/s
tb64___encode (1M): 4812.60 MB/s
crzy64_encode (1M): 3063.69 MB/s
tb64___decode (1M): 3072.82 MB/s
crzy64_decode (1M): 4179.15 MB/s
Also my code is better when decoding medium sizes (32 .. 64K) using AVX2 on Skylake: Both compiled with GCC.
100MB run:
memcpy: 7.928ms (12613.05 MB/s)
tb64___encode: 13.503ms (7405.60 MB/s)
crzy64_encode: 14.053ms (7115.78 MB/s)
tb64___decode: 12.434ms (8042.75 MB/s)
crzy64_decode: 12.772ms (7829.57 MB/s)
tb64___encode_unaligned: 13.505ms (7404.56 MB/s)
crzy64_encode_unaligned: 14.047ms (7118.98 MB/s)
tb64___decode_unaligned: 12.876ms (7766.60 MB/s)
crzy64_decode_unaligned: 12.767ms (7832.91 MB/s)
block repeat (cached):
memcpy (1): 452.90 MB/s
tb64___encode (1): 201.23 MB/s
crzy64_encode (1): 161.82 MB/s
tb64___decode (1): 241.46 MB/s
crzy64_decode (1): 205.83 MB/s
memcpy (2): 724.64 MB/s
tb64___encode (2): 378.18 MB/s
crzy64_encode (2): 323.60 MB/s
tb64___decode (2): 426.13 MB/s
crzy64_decode (2): 376.60 MB/s
memcpy (4): 1610.29 MB/s
tb64___encode (4): 628.96 MB/s
crzy64_encode (4): 381.08 MB/s
tb64___decode (4): 689.79 MB/s
crzy64_decode (4): 720.59 MB/s
memcpy (8): 3220.59 MB/s
tb64___encode (8): 1024.53 MB/s
crzy64_encode (8): 789.95 MB/s
tb64___decode (8): 1025.38 MB/s
crzy64_decode (8): 983.56 MB/s
memcpy (16): 5797.09 MB/s
tb64___encode (16): 1588.46 MB/s
crzy64_encode (16): 915.46 MB/s
tb64___decode (16): 1556.15 MB/s
crzy64_decode (16): 1506.75 MB/s
memcpy (32): 13465.80 MB/s
tb64___encode (32): 3255.20 MB/s
crzy64_encode (32): 2408.02 MB/s
tb64___decode (32): 2393.46 MB/s
crzy64_decode (32): 3541.96 MB/s
memcpy (64): 33126.04 MB/s
tb64___encode (64): 4313.02 MB/s
crzy64_encode (64): 2848.97 MB/s
tb64___decode (64): 3715.40 MB/s
crzy64_decode (64): 6290.44 MB/s
memcpy (128): 23438.94 MB/s
tb64___encode (128): 7545.40 MB/s
crzy64_encode (128): 4900.30 MB/s
tb64___decode (128): 7231.12 MB/s
crzy64_decode (128): 9867.02 MB/s
memcpy (256): 65763.15 MB/s
tb64___encode (256): 8993.35 MB/s
crzy64_encode (256): 4985.46 MB/s
tb64___decode (256): 9760.69 MB/s
crzy64_decode (256): 15139.05 MB/s
memcpy (512): 71007.01 MB/s
tb64___encode (512): 11373.18 MB/s
crzy64_encode (512): 6671.36 MB/s
tb64___decode (512): 13468.98 MB/s
crzy64_decode (512): 18713.91 MB/s
memcpy (1K): 89399.94 MB/s
tb64___encode (1K): 12560.04 MB/s
crzy64_encode (1K): 7324.90 MB/s
tb64___decode (1K): 15516.87 MB/s
crzy64_decode (1K): 21664.02 MB/s
memcpy (2K): 104233.76 MB/s
tb64___encode (2K): 14416.69 MB/s
crzy64_encode (2K): 8108.54 MB/s
tb64___decode (2K): 17416.27 MB/s
crzy64_decode (2K): 23365.08 MB/s
memcpy (4K): 82908.21 MB/s
tb64___encode (4K): 15344.78 MB/s
crzy64_encode (4K): 8453.27 MB/s
tb64___decode (4K): 18175.85 MB/s
crzy64_decode (4K): 23836.40 MB/s
memcpy (8K): 64805.65 MB/s
tb64___encode (8K): 15612.13 MB/s
crzy64_encode (8K): 8926.90 MB/s
tb64___decode (8K): 18671.60 MB/s
crzy64_decode (8K): 24132.40 MB/s
memcpy (16K): 64115.15 MB/s
tb64___encode (16K): 15851.67 MB/s
crzy64_encode (16K): 8811.55 MB/s
tb64___decode (16K): 18446.79 MB/s
crzy64_decode (16K): 19460.10 MB/s
memcpy (32K): 42401.74 MB/s
tb64___encode (32K): 15906.48 MB/s
crzy64_encode (32K): 9010.06 MB/s
tb64___decode (32K): 18791.77 MB/s
crzy64_decode (32K): 19638.86 MB/s
memcpy (64K): 42752.67 MB/s
tb64___encode (64K): 16080.01 MB/s
crzy64_encode (64K): 9017.60 MB/s
tb64___decode (64K): 18720.88 MB/s
crzy64_decode (64K): 19115.18 MB/s
memcpy (128K): 40450.44 MB/s
tb64___encode (128K): 16097.22 MB/s
crzy64_encode (128K): 8869.10 MB/s
tb64___decode (128K): 18825.18 MB/s
crzy64_decode (128K): 16981.21 MB/s
memcpy (256K): 34023.79 MB/s
tb64___encode (256K): 15952.50 MB/s
crzy64_encode (256K): 8903.92 MB/s
tb64___decode (256K): 18855.88 MB/s
crzy64_decode (256K): 16417.96 MB/s
memcpy (512K): 31248.06 MB/s
tb64___encode (512K): 15867.42 MB/s
crzy64_encode (512K): 9036.64 MB/s
tb64___decode (512K): 18907.26 MB/s
crzy64_decode (512K): 16394.81 MB/s
memcpy (1M): 31193.80 MB/s
tb64___encode (1M): 15764.65 MB/s
crzy64_encode (1M): 8896.05 MB/s
tb64___decode (1M): 18756.73 MB/s
crzy64_decode (1M): 16308.18 MB/s
And although in my case it is a non-standard base64, I think that there are cases for which compliance with the standard does not matter. And it's without checks, but your code also compiled without checks (NCHECK=1).
The SVG writer is still not ready because I was working on optimizing the code.
It's great that you can convert gigabytes of data per second, but that's if all those gigabytes are in one long line!
This is a rare scenario. Whereas the usual scenario is processing a large set of small strings (1kb isn't very small). And some people might think that processing a large set of strings will take the same amount of time as if it were one long string, but that would make a huge difference. Try to process random sizes from say 1 to 64 bytes until the N megabytes is reached.
Although prefetching speeds up on big data, for small sizes it fills the cache with data that the application might not need. And worst of all, if the prefetch is done on the unallocated page (which could be next), then an exception will be thrown, and although the OS will handle and ignore this exception, it takes time.
So there also should be a test, where the input addresses are not consecutive, but randomized. (So miss-prefetching will only make worse.)
And the best thing would be a graph showing the cost of each size, starting at 1 until the graph turns into a straight line. How people are doing to show the performance of memcpy. Like this.