Open hg opened 2 years ago
I tried the NEON implementation from the same author. It's doesn't have as much of an edge over the generic one, at least on Neoverse-N1. So iperf3 speeds are comparable.
If anyone has an old Raspberry Pi and wants to test this on a slower CPU — be my guest. The code is in pretty bad shape since it's work in progress, but it does work.
Otherwise, I see no reason to merge NEON support.
32: 2846064.09 op/s
256: 1424432.86 op/s
512: 850478.15 op/s
1024: 473508.42 op/s
16384: 33032.01 op/s
131072: 4146.80 op/s
1048576: 517.55 op/s
[ 5] 55.00-56.00 sec 100 MBytes 839 Mbits/sec 0 658 KBytes
[ 5] 56.00-57.00 sec 101 MBytes 849 Mbits/sec 8 571 KBytes
[ 5] 57.00-58.00 sec 102 MBytes 860 Mbits/sec 2 484 KBytes
[ 5] 58.00-59.00 sec 101 MBytes 849 Mbits/sec 0 622 KBytes
[ 5] 59.00-60.00 sec 101 MBytes 849 Mbits/sec 1 532 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-60.00 sec 5.67 GBytes 812 Mbits/sec 2875 sender
[ 5] 0.00-60.00 sec 5.67 GBytes 812 Mbits/sec receiver
32: 2933899.98 op/s (+3%)
256: 1943603.43 op/s (+36%)
512: 1242317.12 op/s (+46%)
1024: 724495.40 op/s (+53%)
16384: 53621.92 op/s (+62%)
131072: 6754.49 op/s (+62%)
1048576: 844.45 op/s (+63%)
[ 5] 55.00-56.00 sec 101 MBytes 849 Mbits/sec 1 536 KBytes
[ 5] 56.00-57.00 sec 100 MBytes 839 Mbits/sec 0 666 KBytes
[ 5] 57.00-58.00 sec 101 MBytes 849 Mbits/sec 3 584 KBytes
[ 5] 58.00-59.00 sec 100 MBytes 839 Mbits/sec 0 704 KBytes
[ 5] 59.00-60.00 sec 102 MBytes 860 Mbits/sec 2 626 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-60.00 sec 5.81 GBytes 832 Mbits/sec 1660 sender
[ 5] 0.00-60.00 sec 5.81 GBytes 832 Mbits/sec receiver
The benchmark code you wrote is not really measuring the usage pattern tinc has. The sptps_speed
program (which you might have to explicitly build using ninja -C build src/sptps_speed
) tests the SPTPS protocol more thoroughly, using typical MTU-sized packets.
Performance measurements using debug builds are mostly useless. Have you tried configuring the build system with -Dbuildtype=debugoptimized
?
I also have a branch (PR #360) that adds AES-256-GCM support to SPTPS, which depends on OpenSSL, but it will then also use OpenSSL for Chacha20-Poly1305. It might be interesting to compare the speed with this optimized version as well.
Oh yes. If you're willing to reintroduce dependency on libssl, it would of course be best. We get highly optimized code for all possible architectures for free.
TL;DR:
flavor | throughput |
---|---|
generic | 0.974 Gbit/s |
avx2 | 1.19 Gbit/s |
libssl1.1 | 1.28 Gbit/s |
libssl3.1 | 1.23 Gbit/s* |
*: probably a debug build
I'll leave it here for now until after #360 is merged.
With the 'new' protocol, ChaCha is taking a decent amount of CPU time, at least in debug build (optimization makes perf output unreadable):
tincd is using the lowest common denominator implementation of this function. Let's add a couple of optimized ones based on compiler intrinsics.
All the hard work has been done by Romain Dolbeau. I just copied it with some adjustments.
Compatibility
x86 / amd64
We'll be shipping three versions of the function (or two, with old compilers without avx2 support):
The right one is picked at runtime depending on current CPU capabilities.
Other architectures
Only the old C implementation is used. ARM Neon could be added later.
Benchmarks
performance
CPU governor, as few processes as possible, all the basic benchmarking stuffTL;DR: 20-22% increase in throughput.
bench_chacha.c
Percentage is relative to generic C implementation. ## C ``` 32: 2790793.08 op/s 256: 1261587.31 op/s 512: 728262.56 op/s 1024: 390193.12 op/s 16384: 26361.08 op/s 131072: 3320.87 op/s 1048576: 415.73 op/s ``` ## SSE ``` 32: 3112408.34 op/s (+11%) 256: 2441758.81 op/s (+93%) 512: 1627719.13 op/s (+123%) 1024: 972969.81 op/s (+149%) 16384: 74304.47 op/s (+181%) 131072: 9427.75 op/s (+183%) 1048576: 1182.82 op/s (+184%) ``` ## AVX2 ``` 32: 3159181.11 op/s (+13%) 256: 2449003.64 op/s (+94%) 512: 2450859.66 op/s (+236%) 1024: 1628639.74 op/s (+317%) 16384: 145438.38 op/s (+451%) 131072: 18729.81 op/s (+464%) 1048576: 2330.21 op/s (+460%) ``` Always resolving the correct function (instead of doing it once and storing in a pointer) is a bit slower: ``` 32: 3126362.45 op/s 256: 2395000.08 op/s 512: 2399900.36 op/s 1024: 1600087.45 op/s 16384: 144505.38 op/s 131072: 18464.47 op/s 1048576: 2295.46 op/s ```iperf3:
buildtype=release
### C [ 5] 55.00-56.00 sec 115 MBytes 965 Mbits/sec 2 557 KBytes [ 5] 56.00-57.00 sec 115 MBytes 965 Mbits/sec 1 491 KBytes [ 5] 57.00-58.00 sec 116 MBytes 975 Mbits/sec 0 648 KBytes [ 5] 58.00-59.00 sec 115 MBytes 965 Mbits/sec 1 588 KBytes [ 5] 59.00-60.00 sec 115 MBytes 965 Mbits/sec 2 522 KBytes - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-60.00 sec 6.73 GBytes 963 Mbits/sec 136 sender [ 5] 0.00-60.00 sec 6.73 GBytes 963 Mbits/sec receiver ### SSSE3 [ 5] 55.00-56.00 sec 130 MBytes 1.09 Gbits/sec 25 600 KBytes [ 5] 56.00-57.00 sec 131 MBytes 1.10 Gbits/sec 2 560 KBytes [ 5] 57.00-58.00 sec 130 MBytes 1.09 Gbits/sec 2 515 KBytes [ 5] 58.00-59.00 sec 132 MBytes 1.11 Gbits/sec 0 683 KBytes [ 5] 59.00-60.00 sec 131 MBytes 1.10 Gbits/sec 2 649 KBytes - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-60.00 sec 7.64 GBytes 1.09 Gbits/sec 2659 sender [ 5] 0.00-60.00 sec 7.64 GBytes 1.09 Gbits/sec receiver ### AVX2 [ 5] 55.00-56.00 sec 142 MBytes 1.20 Gbits/sec 2 602 KBytes [ 5] 56.00-57.00 sec 141 MBytes 1.19 Gbits/sec 2 574 KBytes [ 5] 57.00-58.00 sec 142 MBytes 1.20 Gbits/sec 1 550 KBytes [ 5] 58.00-59.00 sec 142 MBytes 1.20 Gbits/sec 2 520 KBytes [ 5] 59.00-60.00 sec 142 MBytes 1.20 Gbits/sec 1 494 KBytes - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-60.00 sec 8.29 GBytes 1.19 Gbits/sec 1126 sender [ 5] 0.00-60.00 sec 8.28 GBytes 1.19 Gbits/sec receiverI thought that it might be possible that optimizing for a specific CPU or auto-vectorization that is performed at
-O3
would remove the need of writing assembly:buildtype=release + -march=native -mtune=native
### C [ 5] 55.00-56.00 sec 110 MBytes 923 Mbits/sec 1 498 KBytes [ 5] 56.00-57.00 sec 110 MBytes 923 Mbits/sec 0 646 KBytes [ 5] 57.00-58.00 sec 110 MBytes 923 Mbits/sec 3 581 KBytes [ 5] 58.00-59.00 sec 110 MBytes 923 Mbits/sec 2 506 KBytes [ 5] 59.00-60.00 sec 110 MBytes 923 Mbits/sec 0 650 KBytes - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-60.00 sec 6.40 GBytes 916 Mbits/sec 2579 sender [ 5] 0.00-60.01 sec 6.40 GBytes 916 Mbits/sec receiver ### AVX2 [ 5] 55.00-56.00 sec 141 MBytes 1.18 Gbits/sec 4 649 KBytes [ 5] 56.00-57.00 sec 142 MBytes 1.20 Gbits/sec 1 626 KBytes [ 5] 57.00-58.00 sec 142 MBytes 1.20 Gbits/sec 2 602 KBytes [ 5] 58.00-59.00 sec 142 MBytes 1.20 Gbits/sec 1 571 KBytes [ 5] 59.00-60.00 sec 141 MBytes 1.18 Gbits/sec 3 539 KBytes - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-60.00 sec 8.30 GBytes 1.19 Gbits/sec 981 sender [ 5] 0.00-60.00 sec 8.30 GBytes 1.19 Gbits/sec receiver-O3 + -march=native -mtune=native
### C [ 5] 55.00-56.00 sec 111 MBytes 933 Mbits/sec 0 680 KBytes [ 5] 56.00-57.00 sec 111 MBytes 933 Mbits/sec 3 619 KBytes [ 5] 57.00-58.00 sec 112 MBytes 944 Mbits/sec 1 554 KBytes [ 5] 58.00-59.00 sec 111 MBytes 933 Mbits/sec 0 691 KBytes [ 5] 59.00-60.00 sec 111 MBytes 933 Mbits/sec 2 634 KBytes - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-60.00 sec 6.48 GBytes 927 Mbits/sec 3267 sender [ 5] 0.00-60.00 sec 6.47 GBytes 926 Mbits/sec receiver ### AVX2 [ 5] 55.00-56.00 sec 139 MBytes 1.16 Gbits/sec 1 639 KBytes [ 5] 56.00-57.00 sec 139 MBytes 1.16 Gbits/sec 1 607 KBytes [ 5] 57.00-58.00 sec 138 MBytes 1.15 Gbits/sec 1 578 KBytes [ 5] 58.00-59.00 sec 139 MBytes 1.16 Gbits/sec 2 546 KBytes [ 5] 59.00-60.00 sec 138 MBytes 1.15 Gbits/sec 1 510 KBytes - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-60.00 sec 8.01 GBytes 1.15 Gbits/sec 312 sender [ 5] 0.00-60.00 sec 8.00 GBytes 1.15 Gbits/sec receiverNot really.