hyperboria / bugs

Peer-to-peer IPv6 networking, secure and near-zero-conf.
154 stars 17 forks source link

Question about performance and how CryptoAuth works #176

Open ccaapton opened 6 years ago

ccaapton commented 6 years ago

Hi, I did a ./cjdroute --bench on my laptop, and the result is like this:

Benchmark salsa20/poly1305 in 1663ms. 721587 kilobits per second
Benchmark Switching in 960ms. 213333 packets per second

So assume each packet is 1300 byte, the switch could only deliver about 277333 kilobits per second, a lot lower than the encrypt/decrypt process. This looks quite absurd to me, as usually switches could transport data in line rate, and crypto-modules are usually the bottleneck.

Also, I read the cjdns whitepaper, but got confused about how CryptoAuth works. For instance, Alice want to send a packet to David, though intermediate node Bob and Charlie, so the traffic is A -> B -> C -> D. From my understanding on the whitepaper, the packet will go through 3 layers of CryptoAuth, does this mean the packets will be encrypt/decrypt three times, and it will carrie 3 layers of crypto headers? What if it take 20 hops to get the final destination?

progval commented 6 years ago

So assume each packet is 1300 byte, the switch could only deliver about 277333 kilobits per second, a lot lower than the encrypt/decrypt process. This looks quite absurd to me, as usually switches could transport data in line rate, and crypto-modules are usually the bottleneck.

In practice with cjdns, the bottleneck is context switch between the userland process and the kernel, not the cryptography.

the packet will go through 3 layers of CryptoAuth, does this mean the packets will be encrypt/decrypt three times, and it will carrie 3 layers of crypto headers?

No. There are only two layers. The point-to-point layer (A-B, B-C, and C-D), and the end-to-end layer (A-D). But that is a total of 4 CryptoAuth sessions (3 p-t-p and 1 e-t-e).

What if it take 20 hops to get the final destination?

That would be 20 p-t-p sessions and 1 e-t-e.

ccaapton commented 6 years ago

the bottleneck is context switch between the userland process and the kernel

Thanks for the promt reply. Zerotier did a benchmark here: https://www.zerotier.com/blog/2017-04-20-benchmarks.shtml. In that test zerotier is on par with ipsec, so their context switch seems cost nothing. I guess there is some room for optimization, for example, linux 3.8 support multi-queue tun/tap, so you can have multiple threads polling for the same tun device.

Regarding the p-t-p and e-t-e: since there is already e-t-e, will authentication without encryption enough for p-t-p?

progval commented 6 years ago

Do you know what MTU Zerotier used for this test?

p-t-p encryption is used to prevent middle boxes from accessing/tampering with the switch header, mainly the route header, which they may want to do either maliciously or with good intentions.

It also prevents protocol ossification, ie. once middle boxes start making assumptions about what a packet (or protocol) looks like (which they will inevitably do if they can), we can't change the protocol anymore. Encrypting all fields of a packet solves this issue.

ccaapton commented 6 years ago

I don't know their MTU size, but I guess assuming 1500 should be reasonable. The performance difference between ipsec and openvpn is similar to the benchmarks release by wireguard, so I won't worry too much about zerotier's methodology.

Regarding p-t-p, I think hmac-sha256 like authentication should be enough to defend tampering from middlebox. I don't know exactly the performance numbers compare to salsa20-poly1305 combo. Could you give some insights about aead vs authentication-only?

progval commented 6 years ago

Regarding p-t-p, I think hmac-sha256 like authentication should be enough to defend tampering from middlebox.

Yeah, but it can't hurt to hide stuff from the middle boxes, because they may want to do stuff like QoS based on some fields.

I don't know exactly the performance numbers compare to salsa20-poly1305 combo. Could you give some insights about aead vs authentication-only?

I have no idea

ccaapton commented 6 years ago

I did a strace -c cjdroute --bench, and the result shows very few syscalls are performed, most of the time is spent on crypto and switching logic:

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
  0.00    0.000000           0        10           read
  0.00    0.000000           0         1           write
  0.00    0.000000           0        10           open
  0.00    0.000000           0        10           close
  0.00    0.000000           0         2           mprotect
  0.00    0.000000           0        11           brk
  0.00    0.000000           0         1           ioctl
  0.00    0.000000           0        46           writev
  0.00    0.000000           0         1           execve
  0.00    0.000000           0         1           arch_prctl
  0.00    0.000000           0         1           set_tid_address
  0.00    0.000000           0         1           clock_getres
  0.00    0.000000           0         2           eventfd2
  0.00    0.000000           0         2           epoll_create1
  0.00    0.000000           0         3           pipe2
  0.00    0.000000           0         2         2 getrandom
------ ----------- ----------- --------- --------- ----------------
100.00    0.000000                   104         2 total

I also looked at the source code in net/Benchmark.c, and there is no I/O in the benchmark part of code, only in ram manipulation of bits. It is disappointing to see the packet per second is so low even without context switch cost...

progval commented 6 years ago

Did you try with the -f option of strace?

ccaapton commented 6 years ago

Just tried '-c -f' and '-c -F', results are the same.

progval commented 6 years ago

could you try benchmarking zerotier on the same computer?

ccaapton commented 6 years ago

Sure, but zerotier does not exactly have similar facility like "--bench". Would you suggest how to proceed?

progval commented 6 years ago

It looks like they used iperf between two different computers.

ccaapton commented 6 years ago

Test environment: Intel 2955U laptop, 4G RAM. Host OS: linux 3.8.11 Guest OS: user-mode-linux 4.16.0, connected to host via TAP Because of uml, the kernel->user context switch cost is extremely high, and the upload/download speed is asymmetric. More specifically: uml->host is extremely slow, even for native tap connection. So I only tested host->uml direction. Below is the numbers:

 -                    Speed      Ping(min/avg/max)
 - NativeTap          1.5Gbit    0.218/0.443/0.805 ms
 - CJDNS             27.5Mbit   0.484/1.329/8.691 ms
 - ZeroTier          23.6Mbit   2.081/10.817/96.039 ms

Remark:

So there is not much conclusion drawn from my test due to the environment constraints.

progval commented 6 years ago

you can disable cjdns' eth peering in its config

ccaapton commented 6 years ago

New benchmark! Same physical machine, but two instance are both linux docker containers, connected via a pair of veth link. So the overhead of uml is gone. Kernel version 3.8.11. Ethernet is disabled for CJDNS

              Speed           Ping(min/avg/max/mdev)
NativeVeth    1.46Gbit/sec    0.057/0.076/0.093/0.015 ms
CJDNS         161Mbit/sec     0.269/0.652/3.961/0.768 ms
Zerotier      194Mbit/sec     0.288/0.944/7.631/1.640 ms

From the result, CJDNS is slower in transmission than Zerotier even without p-t-p encryption. Still not able to estimate the cost of p-t-p encryption in the environment.

progval commented 6 years ago

Nice, thanks!

Could you try with iperf now?

progval commented 6 years ago

Do you have scripts available somewhere to run the same tests?

ccaapton commented 6 years ago

Both tests are already performed with iperf and ping. I was running the commands by hand in a adhoc chroot inside chromebook, so not script, sorry.