Open nh2 opened 2 years ago
I'm guessing this is cpu bottlenecked. What is the output of cat /proc/cpuinfo
?
The servers are SX133 and SX134 servers from Hetzner (linked in issue description).
They have Xeon W-2145 CPU
and Ryzen 7 3700X
respectively.
nebula
CPU usage in htop
is around 150% when the iperf3
is running; is that expected, or should it go higher?
The machines have 4 and 8 physical cores respectively.
Hey, just a short followup on whether anything can be done to achieve proper 10 Gbit/s throughput, or how to investigate when it doesn't happen.
Had the same problem with the Hetzner CX cloud servers. Without nebula iperf3 would report around 7 Gbit/s between two servers. With nebula it wouldn't go above 1 Gbit/s. I think it has something to do with nebula being tcp over udp traffic and that udp traffic on Hetzner is either rate limited or the routers can't handle the udp traffic. Tcp over tcp would be the solution IMO, but nebula does not support that at the moment.
Here's the link to the thread on the NebulaOSS slack channel: https://nebulaoss.slack.com/archives/CS01XE0KZ/p1619532900073100
that udp traffic on Hetzner is either rate limited or the routers can't handle the udp traffic
@HenkVanMaanen I cannot confirm that.
What speed does iperf3
show in UDP mode between your Hetzner servers?
For me it's as fast as TCP mode between 2 dedicated 10 Gbit/s servers (Hetzner SX133 and SX134):
iperf3 -c otherserver
: 9.41 Gbits/sec on on the other sideiperf3 -c otherserver --udp -b 10G
:
[ ID] Interval Transfer Bitrate Jitter Lost/Total Datagrams
[ 5] 0.00-10.00 sec 4.76 GBytes 4.09 Gbits/sec 0.000 ms 0/3529420 (0%) sender
[ 5] 0.00-10.00 sec 4.76 GBytes 4.09 Gbits/sec 0.001 ms 96/3529419 (0.0027%) receiver
iperf3 -c otherserver --udp -b 2G -P5
(the idea to need multiple flows comes from this AWS article on 100 Gbit/s networking):
[ ID] Interval Transfer Bitrate Jitter Lost/Total Datagrams
[SUM] 0.00-10.00 sec 11.1 GBytes 9.56 Gbits/sec 0.000 ms 0/8254495 (0%) sender
[SUM] 0.00-10.00 sec 10.9 GBytes 9.35 Gbits/sec 0.004 ms 184417/8254338 (2.2%) receiver
So 10 Gbit/s works on UDP between these 2 machines on the Hetzner network.
Nebula-based iperf3 tops out on ~3.5 Gbit/s between the same machines, no matter if via TCP or UDP (no matter the flows).
I also re-measured with nuttcp
to confirm I'm not hitting iperf3
-specific limitations, using e.g. for UDP:
nuttcp -S -P5200 -p5201 # this backgrounds itself
nuttcp -P5200 -p5201 -u -R9g -w2m otherserver
Here's the link to the thread on the NebulaOSS slack channel: https://nebulaoss.slack.com/archives/CS01XE0KZ/p1619532900073100
Replying to some more topics I read on that thread:
The fact that your cpu graphs in the first bit don't show a single core maxed out is unexpected. It means something is holding nebula back from running at full clip.
The same appears for me: Nebula uses only ~125% CPU usage which is evenly spread out across cores, not peaking out a single core. htop
screenshot:
Interface MTUs:
internet0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
nebula: flags=4305<UP,POINTOPOINT,RUNNING,NOARP,MULTICAST> mtu 1300
There is only 1 physical link active on my servers, so confusing different links is impossible.
- Ensure you aren't dropping packets at tun. This shows up in
ifconfig
withdropped <number>
under thenebula1
interface. Raisetun.tx_queue
until drops are not increasing.
I did see some packet drops at tun. I changed tun.tx_queue
from the default (500) to 2000
and the tun drops disappeared.
But this did not improve the throughput.
- Ensure you aren't seeing packets dropped by the udp buffers (
listen.read_buffer
andlisten.write_buffer
should be increased untilss -numpile
shows no dropped packets, it's the last field,d<number>
). Generally the read buffer is the problem.
I **do see drops in ss -numpile
on the receiving side.
Increasing listen.*_buffer
didn't help though; I tried values between 10 MiB and 640 MiB, and I continue to see e.g. d403244
increasing in watch -n1 ss -numpile
.
I tried with the default listen.batch
setting, and setting it to 256
.
*Is it possible to verify that the `listen._buffer` settings are really in effect?**
Similarly, in netstat -suna
, these fields keep increasing during the iperf transmission over Nebula:
35188361 packet receive errors
35188361 receive buffer errors
Following this post I used dropwatch
to get details of the drops as they happen. Output:
# dropwatch -l kas
Initializing kallsyms db
dropwatch> start
Enabling monitoring...
Kernel monitoring activated.
Issue Ctrl-C to stop monitoring
5 drops at sk_stream_kill_queues+50 (0xffffffff9b458bf0) [software]
1 drops at unix_stream_connect+800 (0xffffffff9b586bc0) [software]
67498 drops at udp_queue_rcv_one_skb+396 (0xffffffff9b537f86) [software]
3 drops at udp_queue_rcv_one_skb+396 (0xffffffff9b537f86) [software]
5 drops at udp_queue_rcv_one_skb+396 (0xffffffff9b537f86) [software]
3 drops at udp_queue_rcv_one_skb+396 (0xffffffff9b537f86) [software]
2 drops at udp_queue_rcv_one_skb+396 (0xffffffff9b537f86) [software]
1 drops at sk_stream_kill_queues+50 (0xffffffff9b458bf0) [software]
69804 drops at udp_queue_rcv_one_skb+396 (0xffffffff9b537f86) [software]
1 drops at udp_queue_rcv_one_skb+396 (0xffffffff9b537f86) [software]
2 drops at sk_stream_kill_queues+50 (0xffffffff9b458bf0) [software]
3 drops at sk_stream_kill_queues+50 (0xffffffff9b458bf0) [software]
68186 drops at udp_queue_rcv_one_skb+396 (0xffffffff9b537f86) [software]
... more of that ...
Increasing
listen.*_buffer
didn't help though; I tried values between 10 MiB and 640 MiB, and I continue to see e.g.d403244
increasing inwatch -n1 ss -numpile
.
I found that changing the sysctl net.core.rmem_default
from its default 212992
to 100x that value (21299200
) gets rid of all those drops (in ss
and netstat -suna
), and dropwatch
now looks like:
# dropwatch -l kas
Initializing kallsyms db
dropwatch>
dropwatch> start
Enabling monitoring...
Kernel monitoring activated.
Issue Ctrl-C to stop monitoring
5 drops at sk_stream_kill_queues+50 (0xffffffff9b458bf0) [software]
1 drops at sk_stream_kill_queues+50 (0xffffffff9b458bf0) [software]
5 drops at sk_stream_kill_queues+50 (0xffffffff9b458bf0) [software]
1 drops at nf_hook_slow+8f (0xffffffff9b4f1e0f) [software]
1 drops at nf_hook_slow+8f (0xffffffff9b4f1e0f) [software]
1 drops at sk_stream_kill_queues+50 (0xffffffff9b458bf0) [software]
4 drops at sk_stream_kill_queues+50 (0xffffffff9b458bf0) [software]
1 drops at nf_hook_slow+8f (0xffffffff9b4f1e0f) [software]
1 drops at nf_hook_slow+8f (0xffffffff9b4f1e0f) [software]
1 drops at unix_stream_connect+800 (0xffffffff9b586bc0) [software]
5 drops at sk_stream_kill_queues+50 (0xffffffff9b458bf0) [software]
1 drops at sk_stream_kill_queues+50 (0xffffffff9b458bf0) [software]
1 drops at nf_hook_slow+8f (0xffffffff9b4f1e0f) [software]
The fact that I had to set net.core.rmem_default
suggests to me that Nebula's own buffer adjustments (listen.read_buffer
) aren't working, as suspected above.
But even with all drops being fixed the throughput of Nebula does not improve.
These are the results between two CX servers, direct tunnel:
iperf3 -c otherserver -b 10G -b 5G -P2
= 5 Gbit/s
iperf3 -c otherserver --udp -b 10G -b 5G -P2
= 1 Gbit/s
-b 10G -b 5G -P2
@HenkVanMaanen You're giving -b
twice -- I realise that this is because I typoed that in my summary above, and I also swapped my values of -b
and -P
(which I just fixed), sorry for that.
My run was with -b 2G -P5
. Could you try with that, just for completeness (perhaps also with smaller values of -b
, e.g. `-b 1G -P10)?
Some more info:
In the thread view in htop
, I can see that there are generally 2 threads that use CPU:
Interestingly, if I taskset -c 1
on both sides to pin nebula onto a single core, we get on the receiver: 60% and 10%.
So now in sum it takes less than 100%.
Then, for the time the process is spending, I checked in htop that the fractions are 25% user
, 75% sys
.
On the receiver side, timeout 10 strace -fyp "$(pidof nebula)" -c
gives (strace started while iperf3 is transmitting over the single-threaded Nebula):
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
54.81 12.410432 439 28210 30 futex
20.97 4.747149 38 122748 nanosleep
9.42 2.132681 65 32333 recvmmsg
7.19 1.628890 3 485574 write
3.33 0.754306 3 242637 sendto
2.25 0.510448 2 242631 read
2.02 0.457547 490 932 epoll_pwait
0.00 0.000162 3 43 sched_yield
0.00 0.000088 1 56 getsockname
0.00 0.000067 3 18 recvfrom
0.00 0.000042 42 1 restart_syscall
0.00 0.000020 3 6 socket
0.00 0.000013 2 6 close
0.00 0.000013 2 6 bind
0.00 0.000011 1 6 getsockopt
0.00 0.000009 9 1 tgkill
0.00 0.000006 6 1 getpid
0.00 0.000003 3 1 rt_sigreturn
------ ----------- ----------- --------- --------- ----------------
100.00 22.641887 19 1155210 30 total
Not sure how accurate that is, as the throughput over the single-threaded one breaks down from 1.4 Gbit/s to 0.44 Gbit/s when strace is active.
futex
here is shown as the main bottleneck, but it may be that futex would vanish if it wasn't ptrace()
d by strace.
I wonder what the futex
es are though; there aren't even that many of them (2800 per second).
--udp -b 2G -P5
--udp -b 1G -P10
--udp -b 5G -P2
All around 1 Gbit/s
Via TCP I get 4 Gbit/s
@nh2 just curious, for the encryption method in your config are you using aes?
@sfxworks Yes, AES.
@HenkVanMaanen is using CX servers (Hetzner Cloud virtual servers), I'm using SX servers (dedicated bare-metal). This might explain why I can get up to 10 Gbit/s outside of Nebula.
The content of this comment is the most telling for me https://github.com/slackhq/nebula/issues/637#issuecomment-1086643211
When you are testing your underlay network with multiple flows directly (5 in that run) you see maximum throughput of about 9.5Gbit/s, a single flow gets about 4Gbit/s. When you run with nebula you see nearly the same throughput as the single flow underlay network test at 3.5 Gbit/s.
Nebula will (currently) only be 1 flow on the underlay network between two hosts. The throughput limitation is likely to be anything between and/or including the two NICs in the network since it looks like you have already ruled out cpu on the host directly.
The folks at Slack have run into similar situations with AWS and this PR may be of interest to you https://github.com/slackhq/nebula/pull/768
https://github.com/slackhq/nebula/issues/637#issuecomment-1086671441
I do not see the output for ss -numpile
but I do see the output for the system wide drop counters. It looks like you are doing a number of performance tests using UDP on the overlay and it is very possible the nettcp
or iperf3
udp buffers are overflowing while nebula
buffers are not.
ss -numpile
will output the kernel skmem
struct per socket for all sockets on the system. I usually do sudo ss -numpile | grep -A1 nebula
to ensure I am only looking at nebula sockets when tuning (-A1
is assuming you are configured to run with a single routine).
Closing this for inactivity. Please see also the discussion at https://github.com/slackhq/nebula/discussions/911
Reopened by request from @nh2.
An update from my side:
I have tried for a long time now, and failed to get 10 Gbit/s speed out of Nebula in any setting I tried.
If anybody has a reproducible setup where this works, it would be great to post it (I saw the linked https://github.com/slackhq/nebula/discussions/911 but in there I can also only find claims like "Nebula is used to do many gigabits per second in production on hundreds of thousands of hosts", but not basic evidence such as "here's how I set up these 2 servers with Nebula, look at my iperf showing 10 Gbit/s).
In other words: Instead of finding out why 10 Gbit/s doesn't work in this case, it seems better to first find anybody for whom 10 Gbit/s throughput reliably works.
I also observed that when putting a big data pusher such as Ceph inside Nebula, it would make Nebula cap out at 1-2 GBit/s and 100% CPU, and Nebula would start dropping packets. As a result, important small-data services inside Nebula would also get their packets dropped; for example Consul consensus. This would then destabilise my entire cluster.
My only solution so far was to remove big data pushers such as Ceph from Nebula, defeating the point of running everything inside the VPN.
Overall the "many gigabits per second" relates to exactly what @nbrownus mentions above. This cited number is in aggregate.
At Slack, we didn't encounter workloads that have single path host-to-host tunnels trying to do 10gbit/s, but with a small-ish MTU. Nebula allows you to configure MTUs for different network segments, and Slack uses this internally across production. I do understand that in your case, Hetzner does not allow a higher MTU, which contributes to this bottleneck.
More broadly, Nebula's default division of work is per-tunnel. If you have 4+ hosts talking to a single host over Nebula, and you turn on muiltiroutine processing, Nebula will quickly match the maximum line rate of a single 10gbit interface.
In the case of Ceph, are you often sending many gbit/s between individual hosts?
We are certainly open to enhancing this if more people ask for a bump when using individual tunnels with small MTUs. We will also be sharing our research here in a future blog post for people to validate, and which will have tips for optimizing performance.
Hi @nh2 - We've identified a bug in Nebula, beginning with v1.6.0, released June 2022 where Nebula nodes configured with a listen port of 0
(random) would not properly utilize multiple routines when the routines
config option was configured.
I understand that you opened this issue in February 2022, prior to the bug, but have continued debugging since v1.6.0. Given that this is the case, I will humbly request that you re-test your configuration.
Additionally, in December 2022, prior to closing this issue, @nbrownus asked you to run a few commands to collect some extra debugging information. We believe that the output of ss -numpile
would've identified the recently-fixed bug, had you been affected by it. Is it possible to please collect that debug information now?
Thank you!
nodes configured with a listen port of
0
(random) would not properly utilize multiple routines
@johnmaguire Thanks! I'm using a fixed listen port of 4242
for all my nodes.
ss -numpile
shows:
UNCONN 0 0 [::ffff:0.0.0.0]:4242 *:* users:(("nebula",pid=2102,fd=7)) uid:991 ino:13155 sk:b cgroup:/system.slice/system-nebula.slice/nebula@servers.service v6only:0 <->
Hi @nh2, I just wanted to make note of the blog post we recently published about performance here: https://www.defined.net/blog/nebula-is-not-the-fastest-mesh-vpn/
I hope that answers some of your questions here, and I'm happy to clarify any of the points. I'll close this issue in a week, unless there is something further to discuss that isn't covered there. Thanks!
@rawdigits The blog post looks great and is very useful.
But I believe it is still about aggregrate throughput, when indeed my issue report is for the point-to-point connection between single hosts.
I can get 10 Gbit/s between 2 Hetzner servers via WireGuard and via iperf3 UDP (5 Gbit/s with single flow, full 10 Gbit/s with multiple flows, as mentioned in https://github.com/slackhq/nebula/issues/637#issuecomment-1086643211).
But I cannot get this with Nebula.
In the case of Ceph, are you often sending many gbit/s between individual hosts?
Yes, that is the standard workflow. When you write a file to CephFS, the client that does the write()
syscall sends the data to one of the Ceph servers, which then distributes the write to the replicas before the write()
returns.
So for example, you write a 10 GB file. With Ceph-on-Nebula it takes ~100 seconds (capped at ~1 Gbit/s), with Ceph outside of the VPN it takes ~10 secons (capped at ~1 Gbit/s).
This factor makes a big difference for what workloads/apps you can handle.
A tangentially related issue issue is that in my tests, Nebula starts dropping packets when large transfer rates occur.
Concretely, when I had both Ceph and Consul (the consensus server) running on Nebula, and Ceph would do some large transfer, Nebula would drop packets, including those of Consul. This caused instability (consensus being lost). The issue disappears when running the same over a normal link instead of Nebula, apparently even when the normal link is 1 Gbit/s instead of 10 Gbit/s. My guess is that Nebula gets CPU-bottlenecked and thus leading to UDP packet loss that would happen differently on a real link.
But I still don't fully understand why that causes such big instabilities: Both Ceph and Consul use TCP, so theoretically a CPU-bottlenecked Nebula on a 10 Gbit/s interface should not lose more Consul-related packets than physical 1 Gbit/s interface; but it somehow does.
I think we should probably rename the issue to make clear it's about point-to-point performance, not aggregate.
I understand the blog post says
If you are using a mesh VPN, you probably have more than two hosts communicating at any given time. Honestly, if you only care a point-to-point connection, use whatever you like. Wireguard is great. IPsec exists. OpenVPN isn’t even that bad these days.
but there are still good reasons to use Nebula even when point-to-point is the main use case:
@nh2 what is the upper limit you're able to achieve using Nebula? Also would it be possible for you to share your tweaks to default config values?
I'm facing similar issue, but cannot saturate even 1Gbps link (iperf
shows something like 500-550 Mbps in TCP mode), though I'm certainly running it in some worse conditions, as I'm surely running the stuff in VMs.
@rawdigits I did read the blog, and I do understand the limitations, but I was hoping (looking at the "performance per core" graphs) that Nebula would be able to give me 1Gbps speed. I can get up to 5Gbps in multi-threaded iperf
(with -P 5
) without VPN, and I can reach 1Gbps on single-threaded iperf
and when using Tailscale for VPN.
Also, when I run iperf
in UDP mode with bandwidth limited to 1G, its server reports unusually high packet loss (~36%) which, if you subtract it from the 1G, would again yield something about 500-600 Mbps as reported in TCP case.
When I do this, ss
reports no packet drops at all, though (first socket is for Lighthouse container, and second and third are for two routines of my main receiver container):
$ sudo ss -numpile|grep nebula -A1
UNCONN 0 0 [::ffff:0.0.0.0]:4242 *:* users:(("nebula",pid=314026,fd=7)) ino:1167283 sk:1004 cgroup:/docker/7bc62b5a8601c9d6e2129fd456bd462687e3794a15f3c3adf601c443720915a8 v6only:0 <->
skmem:(r0,rb212992,t0,tb212992,f4096,w0,o0,bl0,d0)
UNCONN 0 0 [::ffff:0.0.0.0]:4243 *:* users:(("nebula",pid=314102,fd=7)) ino:1168468 sk:1005 cgroup:/docker/7dfa00a7fbe11ab75c283dcfa394e4ede11f5aad9a32c5f0fcd5990e88ba348f v6only:0 <->
skmem:(r0,rb209715200,t0,tb209715200,f4096,w0,o0,bl0,d0)
UNCONN 0 0 [::ffff:0.0.0.0]:4243 *:* users:(("nebula",pid=314102,fd=8)) ino:1168469 sk:1006 cgroup:/docker/7dfa00a7fbe11ab75c283dcfa394e4ede11f5aad9a32c5f0fcd5990e88ba348f v6only:0 <->
skmem:(r0,rb209715200,t0,tb209715200,f4096,w0,o0,bl0,d0)
@nh2 what is the upper limit you're able to achieve using Nebula?
@vnlitvinov
iperf3 -c
on dedicated, 10 Gbit/s, 0.3 ms ping: 9.35 Gbits/sec
2.32 Gbits/sec
The config I'm using in production currently has no tuning, only non-performance relevant settings, as I have not managed to boost the performance significantly with any settings:
same here as @nh2 ... just tested today on Hetzner dedicated cloud servers ... tried tunning multiple parameters and nothing helped significantly
just tested tailscale (following their getting started) and got basically same results ~ 2.32Gbit
,
wireguard also reports 2.4Gbit
I think only https://github.com/slackhq/nebula/pull/768 could improve situation - is there anything I can do to make it merged (even as experimental feature) - @rawdigits ?
Maybe some of the ideas from https://toonk.io/sending-network-packets-in-go/ could be useful?
I think only #768 could improve situation - is there anything I can do to make it merged (even as experimental feature)
@ondrej-smola I made a v1.8.2-multiport release that is just v1.8.2 with this PR merged in if you want to test with it, binaries here: https://github.com/wadey/nebula/releases/tag/v1.8.2-multiport
Hey @ondrej-smola - I was just wondering if you had a chance to test the build @wadey provided. If so, how did it go?
@wadey @johnmaguire thank you for creating release - I am on parental leave but should be back in June
I've noticed a fairly drastic drop using Nebula over Hetzners cloud networks
Hetzner private network (no Nebula)
iperf -c 10.0.0.5
------------------------------------------------------------
Client connecting to 10.0.0.5, TCP port 5001
TCP window size: 85.0 KByte (default)
------------------------------------------------------------
[ 1] local 10.0.0.4 port 37204 connected with 10.0.0.5 port 5001
[ ID] Interval Transfer Bandwidth
[ 1] 0.00-10.01 sec 5.91 GBytes 5.07 Gbits/sec
with 1.9.3
iperf -c 10.11.3.1
------------------------------------------------------------
Client connecting to 10.11.3.1, TCP port 5001
TCP window size: 45.0 KByte (default)
------------------------------------------------------------
[ 1] local 10.11.3.3 port 42506 connected with 10.11.3.1 port 5001
[ ID] Interval Transfer Bandwidth
[ 1] 0.00-10.02 sec 732 MBytes 613 Mbits/sec
with the above build (1.8.2-multiport)
iperf -c 10.11.3.1
------------------------------------------------------------
Client connecting to 10.11.3.1, TCP port 5001
TCP window size: 45.0 KByte (default)
------------------------------------------------------------
[ 1] local 10.11.3.3 port 40704 connected with 10.11.3.1 port 5001
[ ID] Interval Transfer Bandwidth
[ 1] 0.00-10.03 sec 854 MBytes 714 Mbits/sec
Servers are both Hetzners Ampere servers (hardware AES is enabled)
I'm benchmarking Nebula with storage servers from dedicated server provider Hetzner where 10 Gbit/s links are cheap.
Unless you ask them to connect your servers by a dedicated switch, the MTU cannot be changed, so jumbo frames are not possible.
In this setup, I have not been able to achieve more than 2 Gbit/s with
iperf3
over Nebula, no matter how I tuneread_buffer
/write_buffer
/batch
/routines
.In https://theorangeone.net/posts/nebula-intro/ it was said
and on https://i.reddit.com/r/networking/comments/iksyuu/overlay_network_mesh_options_nebula_wireguard/
but that's evidently not the case for me.
Did all those setups use jumbo frames?
Is there anything that can be done to achieve 10 Gbit/s throughput without jumbo frames?