gsliepen / tinc

a VPN daemon
http://tinc-vpn.org/
Other
1.87k stars 280 forks source link

ChaCha20: add optimized versions for amd64 (SSSE3 & AVX2) #392

Open hg opened 2 years ago

hg commented 2 years ago

With the 'new' protocol, ChaCha is taking a decent amount of CPU time, at least in debug build (optimization makes perf output unreadable):

Children      Self  Command  Shared Object         Symbol
-   99.72%     0.00%  tincd    libc.so.6             [.] __libc_init_first
     __libc_init_first
   - main
      - 99.71% main_loop
         - 90.08% event_loop
            - 69.62% handle_incoming_vpn_data
               - 69.20% handle_incoming_vpn_packet
                  - 68.06% receive_udppacket
                     - 67.88% sptps_receive_data
                        - 67.68% sptps_receive_data_datagram
                           - 49.15% chacha_poly1305_decrypt
                                39.06% chacha_encrypt_bytes
                                9.83% poly1305_auth
                           + 17.80% receive_sptps_record
                  + 0.67% lookup_node_udp
            + 13.34% handle_device_data
            + 4.69% recvmmsg
            + 1.37% handle_meta_io
         + 9.50% epoll_wait

tincd is using the lowest common denominator implementation of this function. Let's add a couple of optimized ones based on compiler intrinsics.

All the hard work has been done by Romain Dolbeau. I just copied it with some adjustments.

Compatibility

x86 / amd64

We'll be shipping three versions of the function (or two, with old compilers without avx2 support):

The right one is picked at runtime depending on current CPU capabilities.

Other architectures

Only the old C implementation is used. ARM Neon could be added later.

Benchmarks

TL;DR: 20-22% increase in throughput.

bench_chacha.c Percentage is relative to generic C implementation. ## C ``` 32: 2790793.08 op/s 256: 1261587.31 op/s 512: 728262.56 op/s 1024: 390193.12 op/s 16384: 26361.08 op/s 131072: 3320.87 op/s 1048576: 415.73 op/s ``` ## SSE ``` 32: 3112408.34 op/s (+11%) 256: 2441758.81 op/s (+93%) 512: 1627719.13 op/s (+123%) 1024: 972969.81 op/s (+149%) 16384: 74304.47 op/s (+181%) 131072: 9427.75 op/s (+183%) 1048576: 1182.82 op/s (+184%) ``` ## AVX2 ``` 32: 3159181.11 op/s (+13%) 256: 2449003.64 op/s (+94%) 512: 2450859.66 op/s (+236%) 1024: 1628639.74 op/s (+317%) 16384: 145438.38 op/s (+451%) 131072: 18729.81 op/s (+464%) 1048576: 2330.21 op/s (+460%) ``` Always resolving the correct function (instead of doing it once and storing in a pointer) is a bit slower: ``` 32: 3126362.45 op/s 256: 2395000.08 op/s 512: 2399900.36 op/s 1024: 1600087.45 op/s 16384: 144505.38 op/s 131072: 18464.47 op/s 1048576: 2295.46 op/s ```

iperf3:

buildtype=release ### C [ 5] 55.00-56.00 sec 115 MBytes 965 Mbits/sec 2 557 KBytes [ 5] 56.00-57.00 sec 115 MBytes 965 Mbits/sec 1 491 KBytes [ 5] 57.00-58.00 sec 116 MBytes 975 Mbits/sec 0 648 KBytes [ 5] 58.00-59.00 sec 115 MBytes 965 Mbits/sec 1 588 KBytes [ 5] 59.00-60.00 sec 115 MBytes 965 Mbits/sec 2 522 KBytes - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-60.00 sec 6.73 GBytes 963 Mbits/sec 136 sender [ 5] 0.00-60.00 sec 6.73 GBytes 963 Mbits/sec receiver ### SSSE3 [ 5] 55.00-56.00 sec 130 MBytes 1.09 Gbits/sec 25 600 KBytes [ 5] 56.00-57.00 sec 131 MBytes 1.10 Gbits/sec 2 560 KBytes [ 5] 57.00-58.00 sec 130 MBytes 1.09 Gbits/sec 2 515 KBytes [ 5] 58.00-59.00 sec 132 MBytes 1.11 Gbits/sec 0 683 KBytes [ 5] 59.00-60.00 sec 131 MBytes 1.10 Gbits/sec 2 649 KBytes - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-60.00 sec 7.64 GBytes 1.09 Gbits/sec 2659 sender [ 5] 0.00-60.00 sec 7.64 GBytes 1.09 Gbits/sec receiver ### AVX2 [ 5] 55.00-56.00 sec 142 MBytes 1.20 Gbits/sec 2 602 KBytes [ 5] 56.00-57.00 sec 141 MBytes 1.19 Gbits/sec 2 574 KBytes [ 5] 57.00-58.00 sec 142 MBytes 1.20 Gbits/sec 1 550 KBytes [ 5] 58.00-59.00 sec 142 MBytes 1.20 Gbits/sec 2 520 KBytes [ 5] 59.00-60.00 sec 142 MBytes 1.20 Gbits/sec 1 494 KBytes - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-60.00 sec 8.29 GBytes 1.19 Gbits/sec 1126 sender [ 5] 0.00-60.00 sec 8.28 GBytes 1.19 Gbits/sec receiver

I thought that it might be possible that optimizing for a specific CPU or auto-vectorization that is performed at -O3 would remove the need of writing assembly:

buildtype=release + -march=native -mtune=native ### C [ 5] 55.00-56.00 sec 110 MBytes 923 Mbits/sec 1 498 KBytes [ 5] 56.00-57.00 sec 110 MBytes 923 Mbits/sec 0 646 KBytes [ 5] 57.00-58.00 sec 110 MBytes 923 Mbits/sec 3 581 KBytes [ 5] 58.00-59.00 sec 110 MBytes 923 Mbits/sec 2 506 KBytes [ 5] 59.00-60.00 sec 110 MBytes 923 Mbits/sec 0 650 KBytes - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-60.00 sec 6.40 GBytes 916 Mbits/sec 2579 sender [ 5] 0.00-60.01 sec 6.40 GBytes 916 Mbits/sec receiver ### AVX2 [ 5] 55.00-56.00 sec 141 MBytes 1.18 Gbits/sec 4 649 KBytes [ 5] 56.00-57.00 sec 142 MBytes 1.20 Gbits/sec 1 626 KBytes [ 5] 57.00-58.00 sec 142 MBytes 1.20 Gbits/sec 2 602 KBytes [ 5] 58.00-59.00 sec 142 MBytes 1.20 Gbits/sec 1 571 KBytes [ 5] 59.00-60.00 sec 141 MBytes 1.18 Gbits/sec 3 539 KBytes - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-60.00 sec 8.30 GBytes 1.19 Gbits/sec 981 sender [ 5] 0.00-60.00 sec 8.30 GBytes 1.19 Gbits/sec receiver
-O3 + -march=native -mtune=native ### C [ 5] 55.00-56.00 sec 111 MBytes 933 Mbits/sec 0 680 KBytes [ 5] 56.00-57.00 sec 111 MBytes 933 Mbits/sec 3 619 KBytes [ 5] 57.00-58.00 sec 112 MBytes 944 Mbits/sec 1 554 KBytes [ 5] 58.00-59.00 sec 111 MBytes 933 Mbits/sec 0 691 KBytes [ 5] 59.00-60.00 sec 111 MBytes 933 Mbits/sec 2 634 KBytes - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-60.00 sec 6.48 GBytes 927 Mbits/sec 3267 sender [ 5] 0.00-60.00 sec 6.47 GBytes 926 Mbits/sec receiver ### AVX2 [ 5] 55.00-56.00 sec 139 MBytes 1.16 Gbits/sec 1 639 KBytes [ 5] 56.00-57.00 sec 139 MBytes 1.16 Gbits/sec 1 607 KBytes [ 5] 57.00-58.00 sec 138 MBytes 1.15 Gbits/sec 1 578 KBytes [ 5] 58.00-59.00 sec 139 MBytes 1.16 Gbits/sec 2 546 KBytes [ 5] 59.00-60.00 sec 138 MBytes 1.15 Gbits/sec 1 510 KBytes - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-60.00 sec 8.01 GBytes 1.15 Gbits/sec 312 sender [ 5] 0.00-60.00 sec 8.00 GBytes 1.15 Gbits/sec receiver

Not really.

hg commented 2 years ago

I tried the NEON implementation from the same author. It's doesn't have as much of an edge over the generic one, at least on Neoverse-N1. So iperf3 speeds are comparable.

If anyone has an old Raspberry Pi and wants to test this on a slower CPU — be my guest. The code is in pretty bad shape since it's work in progress, but it does work.

Otherwise, I see no reason to merge NEON support.

generic

      32:     2846064.09 op/s
     256:     1424432.86 op/s
     512:      850478.15 op/s
    1024:      473508.42 op/s
   16384:       33032.01 op/s
  131072:        4146.80 op/s
 1048576:         517.55 op/s
[  5]  55.00-56.00  sec   100 MBytes   839 Mbits/sec    0    658 KBytes
[  5]  56.00-57.00  sec   101 MBytes   849 Mbits/sec    8    571 KBytes
[  5]  57.00-58.00  sec   102 MBytes   860 Mbits/sec    2    484 KBytes
[  5]  58.00-59.00  sec   101 MBytes   849 Mbits/sec    0    622 KBytes
[  5]  59.00-60.00  sec   101 MBytes   849 Mbits/sec    1    532 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-60.00  sec  5.67 GBytes   812 Mbits/sec  2875             sender
[  5]   0.00-60.00  sec  5.67 GBytes   812 Mbits/sec                  receiver

NEON

       32:     2933899.98 op/s (+3%)
      256:     1943603.43 op/s (+36%)
      512:     1242317.12 op/s (+46%)
     1024:      724495.40 op/s (+53%)
    16384:       53621.92 op/s (+62%)
   131072:        6754.49 op/s (+62%)
  1048576:         844.45 op/s (+63%)
[  5]  55.00-56.00  sec   101 MBytes   849 Mbits/sec    1    536 KBytes
[  5]  56.00-57.00  sec   100 MBytes   839 Mbits/sec    0    666 KBytes
[  5]  57.00-58.00  sec   101 MBytes   849 Mbits/sec    3    584 KBytes
[  5]  58.00-59.00  sec   100 MBytes   839 Mbits/sec    0    704 KBytes
[  5]  59.00-60.00  sec   102 MBytes   860 Mbits/sec    2    626 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-60.00  sec  5.81 GBytes   832 Mbits/sec  1660             sender
[  5]   0.00-60.00  sec  5.81 GBytes   832 Mbits/sec                  receiver
Badly modified `ns_ping.py` for running iperf3 ```py #!/usr/bin/env python3 """Create two network namespaces and run ping between them.""" import os import signal import subprocess as subp import typing as T from testlib import external as ext, util, template, cmd from testlib.log import log from testlib.proc import Tinc, Script from testlib.test import Test util.require_root() util.require_command("ip", "netns", "list") util.require_path("/dev/net/tun") IP_FOO = "192.168.1.1" IP_BAR = "192.168.1.2" MASK = 24 def init(ctx: Test) -> T.Tuple[Tinc, Tinc]: """Initialize new test nodes.""" foo, bar = ctx.node(), ctx.node() log.info("create network namespaces") assert ext.netns_add(foo.name) assert ext.netns_add(bar.name) log.info("initialize two nodes") stdin = f""" init {foo} set Port 0 set Subnet {IP_FOO} set Interface {foo} set Address localhost set AutoConnect no """ foo.cmd(stdin=stdin) foo.add_script(Script.TINC_UP, template.make_netns_config(foo.name, IP_FOO, MASK)) foo.start() stdin = f""" init {bar} set Port 0 set Subnet {IP_BAR} set Interface {bar} set Address localhost set AutoConnect no """ bar.cmd(stdin=stdin) bar.add_script(Script.TINC_UP, template.make_netns_config(bar.name, IP_BAR, MASK)) cmd.exchange(foo, bar) return foo, bar def ping(namespace: str, ip_addr: str) -> int: """Send pings between two network namespaces.""" log.info("pinging node from netns %s at %s", namespace, ip_addr) proc = subp.run( ["ip", "netns", "exec", namespace, "iperf3", "-t", "60", "-c", ip_addr], check=False ) log.info("ping finished with code %d", proc.returncode) return proc.returncode with Test("ns-ping") as context: foo_node, bar_node = init(context) bar_node.cmd("start") log.info("waiting for nodes to come up") bar_node[Script.TINC_UP].wait() log.info("add script foo/host-up") bar_node.add_script(foo_node.script_up) log.info("add ConnectTo clause") bar_node.cmd("add", "ConnectTo", foo_node.name) log.info("bar waits for foo") bar_node[foo_node.script_up].wait() subp.Popen(["ip", "netns", "exec", bar_node.name, "iperf3", "-s"], stdout=subp.DEVNULL, stderr=subp.DEVNULL) log.info("ping must work after connection is up") assert not ping(foo_node.name, IP_BAR) ```
gsliepen commented 2 years ago

The benchmark code you wrote is not really measuring the usage pattern tinc has. The sptps_speed program (which you might have to explicitly build using ninja -C build src/sptps_speed) tests the SPTPS protocol more thoroughly, using typical MTU-sized packets.

Performance measurements using debug builds are mostly useless. Have you tried configuring the build system with -Dbuildtype=debugoptimized?

I also have a branch (PR #360) that adds AES-256-GCM support to SPTPS, which depends on OpenSSL, but it will then also use OpenSSL for Chacha20-Poly1305. It might be interesting to compare the speed with this optimized version as well.

hg commented 2 years ago

Oh yes. If you're willing to reintroduce dependency on libssl, it would of course be best. We get highly optimized code for all possible architectures for free.

TL;DR:

flavor throughput
generic 0.974 Gbit/s
avx2 1.19 Gbit/s
libssl1.1 1.28 Gbit/s
libssl3.1 1.23 Gbit/s*

*: probably a debug build

detailed results ## generic ``` Generating keys for 10 seconds: 27182.71 op/s Ed25519 sign for 10 seconds: 25932.53 op/s Ed25519 verify for 10 seconds: 8360.30 op/s ECDH for 10 seconds: 6783.36 op/s SPTPS/TCP authenticate for 10 seconds: 3173.35 op/s SPTPS/TCP transmit for 10 seconds: 2.47 Gbit/s SPTPS/UDP authenticate for 10 seconds: 3167.00 op/s SPTPS/UDP transmit for 10 seconds: 2.46 Gbit/s [ 5] 55.00-56.00 sec 116 MBytes 975 Mbits/sec 0 686 KBytes [ 5] 56.00-57.00 sec 119 MBytes 996 Mbits/sec 1 634 KBytes [ 5] 57.00-58.00 sec 118 MBytes 986 Mbits/sec 2 573 KBytes [ 5] 58.00-59.00 sec 118 MBytes 986 Mbits/sec 1 506 KBytes [ 5] 59.00-60.00 sec 118 MBytes 986 Mbits/sec 0 662 KBytes - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-60.00 sec 6.80 GBytes 974 Mbits/sec 5725 sender [ 5] 0.00-60.01 sec 6.80 GBytes 973 Mbits/sec receiver ``` ## avx2 ``` Generating keys for 10 seconds: 27313.96 op/s Ed25519 sign for 10 seconds: 26049.17 op/s Ed25519 verify for 10 seconds: 8411.54 op/s ECDH for 10 seconds: 6792.54 op/s SPTPS/TCP authenticate for 10 seconds: 3156.87 op/s SPTPS/TCP transmit for 10 seconds: 4.00 Gbit/s SPTPS/UDP authenticate for 10 seconds: 3160.10 op/s SPTPS/UDP transmit for 10 seconds: 4.13 Gbit/s ``` ``` [ 5] 25.00-26.00 sec 142 MBytes 1.20 Gbits/sec 1 509 KBytes [ 5] 26.00-27.00 sec 142 MBytes 1.20 Gbits/sec 0 691 KBytes [ 5] 27.00-28.00 sec 142 MBytes 1.20 Gbits/sec 1 672 KBytes [ 5] 28.00-29.00 sec 142 MBytes 1.20 Gbits/sec 1 649 KBytes [ 5] 29.00-30.00 sec 142 MBytes 1.20 Gbits/sec 2 619 KBytes - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-30.00 sec 4.12 GBytes 1.19 Gbits/sec 273 sender [ 5] 0.00-30.00 sec 4.12 GBytes 1.19 Gbits/sec receiver ``` ## alt (OpenSSL 1.1.1.o) ``` Generating keys for 10 seconds: 27294.88 op/s Ed25519 sign for 10 seconds: 26004.59 op/s Ed25519 verify for 10 seconds: 8409.03 op/s ECDH for 10 seconds: 6786.60 op/s Cipher suite 0 (Chacha20-Poly1305): SPTPS/TCP authenticate for 10 seconds: 3182.28 op/s SPTPS/TCP transmit for 10 seconds: 6.24 Gbit/s SPTPS/UDP authenticate for 10 seconds: 3178.83 op/s SPTPS/UDP transmit for 10 seconds: 6.16 Gbit/s Cipher suite 1 (AES-256-GCM): SPTPS/TCP authenticate for 10 seconds: 3177.70 op/s SPTPS/TCP transmit for 10 seconds: 8.89 Gbit/s SPTPS/UDP authenticate for 10 seconds: 3169.44 op/s SPTPS/UDP transmit for 10 seconds: 8.73 Gbit/s ``` ``` [ 5] 25.00-26.00 sec 154 MBytes 1.29 Gbits/sec 1 628 KBytes [ 5] 26.00-27.00 sec 154 MBytes 1.29 Gbits/sec 4 619 KBytes [ 5] 27.00-28.00 sec 155 MBytes 1.30 Gbits/sec 2 611 KBytes [ 5] 28.00-29.00 sec 155 MBytes 1.30 Gbits/sec 2 595 KBytes [ 5] 29.00-30.00 sec 154 MBytes 1.29 Gbits/sec 1 585 KBytes - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-30.00 sec 4.42 GBytes 1.28 Gbits/sec 3427 sender [ 5] 0.00-30.00 sec 4.41 GBytes 1.28 Gbits/sec receiver ``` ## alt (OpenSSL 3.1) I don't remember what flags were used to build the library. It may be a debug build, so don't pay much attention to it. ``` Generating keys for 10 seconds: 27368.60 op/s Ed25519 sign for 10 seconds: 26068.22 op/s Ed25519 verify for 10 seconds: 8434.89 op/s ECDH for 10 seconds: 6839.51 op/s Cipher suite 0 (Chacha20-Poly1305): SPTPS/TCP authenticate for 10 seconds: 3152.25 op/s SPTPS/TCP transmit for 10 seconds: 5.83 Gbit/s SPTPS/UDP authenticate for 10 seconds: 3149.35 op/s SPTPS/UDP transmit for 10 seconds: 5.77 Gbit/s Cipher suite 1 (AES-256-GCM): SPTPS/TCP authenticate for 10 seconds: 3151.33 op/s SPTPS/TCP transmit for 10 seconds: 7.69 Gbit/s SPTPS/UDP authenticate for 10 seconds: 3149.60 op/s SPTPS/UDP transmit for 10 seconds: 7.88 Gbit/s ``` ``` [ 5] 25.00-26.00 sec 148 MBytes 1.24 Gbits/sec 4 670 KBytes [ 5] 26.00-27.00 sec 148 MBytes 1.24 Gbits/sec 2 649 KBytes [ 5] 27.00-28.00 sec 148 MBytes 1.24 Gbits/sec 1 632 KBytes [ 5] 28.00-29.00 sec 148 MBytes 1.24 Gbits/sec 2 609 KBytes [ 5] 29.00-30.00 sec 148 MBytes 1.24 Gbits/sec 1 591 KBytes - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-30.00 sec 4.24 GBytes 1.23 Gbits/sec 973 sender [ 5] 0.00-30.00 sec 4.24 GBytes 1.22 Gbits/sec receiver ```

I'll leave it here for now until after #360 is merged.