Unable to achieve 10 Gbit/s throughput on Hetzner server

nh2 commented 2 years ago

I'm benchmarking Nebula with storage servers from dedicated server provider Hetzner where 10 Gbit/s links are cheap.

Unless you ask them to connect your servers by a dedicated switch, the MTU cannot be changed, so jumbo frames are not possible.

In this setup, I have not been able to achieve more than 2 Gbit/s with iperf3 over Nebula, no matter how I tune read_buffer/write_buffer/batch/routines.

In https://theorangeone.net/posts/nebula-intro/ it was said

Slack have seen Nebula networks fully saturate 5 / 10 gigabit links without sweat

and on https://i.reddit.com/r/networking/comments/iksyuu/overlay_network_mesh_options_nebula_wireguard/

Slack regularly does many gigabits per second over nebula on individual hosts.

but that's evidently not the case for me.

Did all those setups use jumbo frames?

Is there anything that can be done to achieve 10 Gbit/s throughput without jumbo frames?

rawdigits commented 2 years ago

I'm guessing this is cpu bottlenecked. What is the output of cat /proc/cpuinfo ?

nh2 commented 2 years ago

The servers are SX133 and SX134 servers from Hetzner (linked in issue description).

They have Xeon W-2145 CPU and Ryzen 7 3700X respectively.

Click to expand `/proc/cpuinfo` details:

``` processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 85 model name : Intel(R) Xeon(R) W-2145 CPU @ 3.70GHz stepping : 4 microcode : 0x2006906 cpu MHz : 1200.027 cache size : 11264 KB physical id : 0 siblings : 8 core id : 0 cpu cores : 8 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 22 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16crdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single intel_ppin ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req md_clear flush_l1d vmx flags : vnmi preemption_timer posted_intr invvpid ept_x_only ept_ad ept_1gb flexpriority apicv tsc_offset vtpr mtf vapic ept vpid unrestricted_guest vapic_reg vid ple shadow_vmcs pml ept_mode_based_exec tsc_scaling bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs taa itlb_multihit bogomips : 7399.70 clflush size : 64 cache_alignment : 64 address sizes : 46 bits physical, 48 bits virtual power management: ``` ``` processor : 0 vendor_id : AuthenticAMD cpu family : 23 model : 113 model name : AMD Ryzen 7 3700X 8-Core Processor stepping : 0 microcode : 0x8701021 cpu MHz : 2251.193 cache size : 512 KB physical id : 0 siblings : 16 core id : 0 cpu cores : 8 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 16 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate sme ssbd mba sev ibpb stibp vmmcall sev_es fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif umip rdpid overflow_recov succor smca bugs : sysret_ss_attrs spectre_v1 spectre_v2 spec_store_bypass bogomips : 7186.10 TLB size : 3072 4K pages clflush size : 64 cache_alignment : 64 address sizes : 43 bits physical, 48 bits virtual power management: ts ttp tm hwpstate cpb eff_freq_ro [13] [14] ```

nh2 commented 2 years ago

nebula CPU usage in htop is around 150% when the iperf3 is running; is that expected, or should it go higher?

The machines have 4 and 8 physical cores respectively.

nh2 commented 2 years ago

Hey, just a short followup on whether anything can be done to achieve proper 10 Gbit/s throughput, or how to investigate when it doesn't happen.

HenkVanMaanen commented 2 years ago

Had the same problem with the Hetzner CX cloud servers. Without nebula iperf3 would report around 7 Gbit/s between two servers. With nebula it wouldn't go above 1 Gbit/s. I think it has something to do with nebula being tcp over udp traffic and that udp traffic on Hetzner is either rate limited or the routers can't handle the udp traffic. Tcp over tcp would be the solution IMO, but nebula does not support that at the moment.

HenkVanMaanen commented 2 years ago

Here's the link to the thread on the NebulaOSS slack channel: https://nebulaoss.slack.com/archives/CS01XE0KZ/p1619532900073100

nh2 commented 2 years ago

that udp traffic on Hetzner is either rate limited or the routers can't handle the udp traffic

@HenkVanMaanen I cannot confirm that.

What speed does iperf3 show in UDP mode between your Hetzner servers?

For me it's as fast as TCP mode between 2 dedicated 10 Gbit/s servers (Hetzner SX133 and SX134):

TCP via iperf3 -c otherserver: 9.41 Gbits/sec on on the other side

UDP single-flow via iperf3 -c otherserver --udp -b 10G:

[ ID] Interval           Transfer     Bitrate         Jitter    Lost/Total Datagrams
[  5]   0.00-10.00  sec  4.76 GBytes  4.09 Gbits/sec  0.000 ms  0/3529420 (0%)  sender
[  5]   0.00-10.00  sec  4.76 GBytes  4.09 Gbits/sec  0.001 ms  96/3529419 (0.0027%)  receiver

UDP multi-flow (5 parallel flows, each limited to 2 Gbit/s) via iperf3 -c otherserver --udp -b 2G -P5 (the idea to need multiple flows comes from this AWS article on 100 Gbit/s networking):

[ ID] Interval           Transfer     Bitrate         Jitter    Lost/Total Datagrams
[SUM]   0.00-10.00  sec  11.1 GBytes  9.56 Gbits/sec  0.000 ms  0/8254495 (0%)  sender
[SUM]   0.00-10.00  sec  10.9 GBytes  9.35 Gbits/sec  0.004 ms  184417/8254338 (2.2%)  receiver

So 10 Gbit/s works on UDP between these 2 machines on the Hetzner network.

Nebula-based iperf3 tops out on ~3.5 Gbit/s between the same machines, no matter if via TCP or UDP (no matter the flows).

I also re-measured with nuttcp to confirm I'm not hitting iperf3-specific limitations, using e.g. for UDP:

nuttcp -S -P5200 -p5201  # this backgrounds itself
nuttcp -P5200 -p5201 -u -R9g -w2m otherserver

nh2 commented 2 years ago

Here's the link to the thread on the NebulaOSS slack channel: https://nebulaoss.slack.com/archives/CS01XE0KZ/p1619532900073100

Replying to some more topics I read on that thread:

The fact that your cpu graphs in the first bit don't show a single core maxed out is unexpected. It means something is holding nebula back from running at full clip.

The same appears for me: Nebula uses only ~125% CPU usage which is evenly spread out across cores, not peaking out a single core. htop screenshot:

Screenshot from 2022-04-02 15-45-56

Interface MTUs:

internet0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500

nebula: flags=4305<UP,POINTOPOINT,RUNNING,NOARP,MULTICAST>  mtu 1300

There is only 1 physical link active on my servers, so confusing different links is impossible.

Ensure you aren't dropping packets at tun. This shows up in ifconfig with dropped <number> under the nebula1 interface. Raise tun.tx_queue until drops are not increasing.

I did see some packet drops at tun. I changed tun.tx_queue from the default (500) to 2000 and the tun drops disappeared. But this did not improve the throughput.

Ensure you aren't seeing packets dropped by the udp buffers (listen.read_buffer and listen.write_buffer should be increased until ss -numpile shows no dropped packets, it's the last field, d<number>). Generally the read buffer is the problem.

I **do see drops in ss -numpile on the receiving side.

Increasing listen.*_buffer didn't help though; I tried values between 10 MiB and 640 MiB, and I continue to see e.g. d403244 increasing in watch -n1 ss -numpile.

I tried with the default listen.batch setting, and setting it to 256.

*Is it possible to verify that the `listen._buffer` settings are really in effect?**

Similarly, in netstat -suna, these fields keep increasing during the iperf transmission over Nebula:

    35188361 packet receive errors
    35188361 receive buffer errors

Following this post I used dropwatch to get details of the drops as they happen. Output:

# dropwatch -l kas
Initializing kallsyms db
dropwatch> start
Enabling monitoring...
Kernel monitoring activated.
Issue Ctrl-C to stop monitoring
5 drops at sk_stream_kill_queues+50 (0xffffffff9b458bf0) [software]
1 drops at unix_stream_connect+800 (0xffffffff9b586bc0) [software]
67498 drops at udp_queue_rcv_one_skb+396 (0xffffffff9b537f86) [software]
3 drops at udp_queue_rcv_one_skb+396 (0xffffffff9b537f86) [software]
5 drops at udp_queue_rcv_one_skb+396 (0xffffffff9b537f86) [software]
3 drops at udp_queue_rcv_one_skb+396 (0xffffffff9b537f86) [software]
2 drops at udp_queue_rcv_one_skb+396 (0xffffffff9b537f86) [software]
1 drops at sk_stream_kill_queues+50 (0xffffffff9b458bf0) [software]
69804 drops at udp_queue_rcv_one_skb+396 (0xffffffff9b537f86) [software]
1 drops at udp_queue_rcv_one_skb+396 (0xffffffff9b537f86) [software]
2 drops at sk_stream_kill_queues+50 (0xffffffff9b458bf0) [software]
3 drops at sk_stream_kill_queues+50 (0xffffffff9b458bf0) [software]
68186 drops at udp_queue_rcv_one_skb+396 (0xffffffff9b537f86) [software]
... more of that ...

nh2 commented 2 years ago

Increasing listen.*_buffer didn't help though; I tried values between 10 MiB and 640 MiB, and I continue to see e.g. d403244 increasing in watch -n1 ss -numpile.

I found that changing the sysctl net.core.rmem_default from its default 212992 to 100x that value (21299200) gets rid of all those drops (in ss and netstat -suna), and dropwatch now looks like:

# dropwatch -l kas                        
Initializing kallsyms db
dropwatch> 
dropwatch> start
Enabling monitoring...
Kernel monitoring activated.
Issue Ctrl-C to stop monitoring
5 drops at sk_stream_kill_queues+50 (0xffffffff9b458bf0) [software]
1 drops at sk_stream_kill_queues+50 (0xffffffff9b458bf0) [software]
5 drops at sk_stream_kill_queues+50 (0xffffffff9b458bf0) [software]
1 drops at nf_hook_slow+8f (0xffffffff9b4f1e0f) [software]
1 drops at nf_hook_slow+8f (0xffffffff9b4f1e0f) [software]
1 drops at sk_stream_kill_queues+50 (0xffffffff9b458bf0) [software]
4 drops at sk_stream_kill_queues+50 (0xffffffff9b458bf0) [software]
1 drops at nf_hook_slow+8f (0xffffffff9b4f1e0f) [software]
1 drops at nf_hook_slow+8f (0xffffffff9b4f1e0f) [software]
1 drops at unix_stream_connect+800 (0xffffffff9b586bc0) [software]
5 drops at sk_stream_kill_queues+50 (0xffffffff9b458bf0) [software]
1 drops at sk_stream_kill_queues+50 (0xffffffff9b458bf0) [software]
1 drops at nf_hook_slow+8f (0xffffffff9b4f1e0f) [software]

The fact that I had to set net.core.rmem_default suggests to me that Nebula's own buffer adjustments (listen.read_buffer) aren't working, as suspected above.

But even with all drops being fixed the throughput of Nebula does not improve.

HenkVanMaanen commented 2 years ago

These are the results between two CX servers, direct tunnel:

iperf3 -c otherserver -b 10G -b 5G -P2 = 5 Gbit/s

iperf3 -c otherserver --udp -b 10G -b 5G -P2 = 1 Gbit/s

nh2 commented 2 years ago

-b 10G -b 5G -P2

@HenkVanMaanen You're giving -b twice -- I realise that this is because I typoed that in my summary above, and I also swapped my values of -b and -P (which I just fixed), sorry for that.

My run was with -b 2G -P5. Could you try with that, just for completeness (perhaps also with smaller values of -b, e.g. `-b 1G -P10)?

nh2 commented 2 years ago

Some more info:

In the thread view in htop, I can see that there are generally 2 threads that use CPU:

on the sender side: 60% and 40%
on the receiver side: 98% and 60%

Interestingly, if I taskset -c 1 on both sides to pin nebula onto a single core, we get on the receiver: 60% and 10%.

So now in sum it takes less than 100%.

Then, for the time the process is spending, I checked in htop that the fractions are 25% user, 75% sys.

On the receiver side, timeout 10 strace -fyp "$(pidof nebula)" -c gives (strace started while iperf3 is transmitting over the single-threaded Nebula):

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 54.81   12.410432         439     28210        30 futex
 20.97    4.747149          38    122748           nanosleep
  9.42    2.132681          65     32333           recvmmsg
  7.19    1.628890           3    485574           write
  3.33    0.754306           3    242637           sendto
  2.25    0.510448           2    242631           read
  2.02    0.457547         490       932           epoll_pwait
  0.00    0.000162           3        43           sched_yield
  0.00    0.000088           1        56           getsockname
  0.00    0.000067           3        18           recvfrom
  0.00    0.000042          42         1           restart_syscall
  0.00    0.000020           3         6           socket
  0.00    0.000013           2         6           close
  0.00    0.000013           2         6           bind
  0.00    0.000011           1         6           getsockopt
  0.00    0.000009           9         1           tgkill
  0.00    0.000006           6         1           getpid
  0.00    0.000003           3         1           rt_sigreturn
------ ----------- ----------- --------- --------- ----------------
100.00   22.641887          19   1155210        30 total

Not sure how accurate that is, as the throughput over the single-threaded one breaks down from 1.4 Gbit/s to 0.44 Gbit/s when strace is active.

futex here is shown as the main bottleneck, but it may be that futex would vanish if it wasn't ptrace()d by strace.

I wonder what the futexes are though; there aren't even that many of them (2800 per second).

HenkVanMaanen commented 2 years ago

--udp -b 2G -P5
--udp -b 1G -P10
--udp -b 5G -P2

All around 1 Gbit/s

Via TCP I get 4 Gbit/s

sfxworks commented 2 years ago

@nh2 just curious, for the encryption method in your config are you using aes?

nh2 commented 2 years ago

@sfxworks Yes, AES.

nh2 commented 2 years ago

@HenkVanMaanen is using CX servers (Hetzner Cloud virtual servers), I'm using SX servers (dedicated bare-metal). This might explain why I can get up to 10 Gbit/s outside of Nebula.

nbrownus commented 1 year ago

The content of this comment is the most telling for me https://github.com/slackhq/nebula/issues/637#issuecomment-1086643211

When you are testing your underlay network with multiple flows directly (5 in that run) you see maximum throughput of about 9.5Gbit/s, a single flow gets about 4Gbit/s. When you run with nebula you see nearly the same throughput as the single flow underlay network test at 3.5 Gbit/s.

Nebula will (currently) only be 1 flow on the underlay network between two hosts. The throughput limitation is likely to be anything between and/or including the two NICs in the network since it looks like you have already ruled out cpu on the host directly.

The folks at Slack have run into similar situations with AWS and this PR may be of interest to you https://github.com/slackhq/nebula/pull/768

https://github.com/slackhq/nebula/issues/637#issuecomment-1086671441

I do not see the output for ss -numpile but I do see the output for the system wide drop counters. It looks like you are doing a number of performance tests using UDP on the overlay and it is very possible the nettcp or iperf3 udp buffers are overflowing while nebula buffers are not.

ss -numpile will output the kernel skmem struct per socket for all sockets on the system. I usually do sudo ss -numpile | grep -A1 nebula to ensure I am only looking at nebula sockets when tuning (-A1 is assuming you are configured to run with a single routine).

johnmaguire commented 1 year ago

Closing this for inactivity. Please see also the discussion at https://github.com/slackhq/nebula/discussions/911

johnmaguire commented 1 year ago

Reopened by request from @nh2.

nh2 commented 1 year ago

An update from my side:

I have tried for a long time now, and failed to get 10 Gbit/s speed out of Nebula in any setting I tried.

If anybody has a reproducible setup where this works, it would be great to post it (I saw the linked https://github.com/slackhq/nebula/discussions/911 but in there I can also only find claims like "Nebula is used to do many gigabits per second in production on hundreds of thousands of hosts", but not basic evidence such as "here's how I set up these 2 servers with Nebula, look at my iperf showing 10 Gbit/s).

In other words: Instead of finding out why 10 Gbit/s doesn't work in this case, it seems better to first find anybody for whom 10 Gbit/s throughput reliably works.

I also observed that when putting a big data pusher such as Ceph inside Nebula, it would make Nebula cap out at 1-2 GBit/s and 100% CPU, and Nebula would start dropping packets. As a result, important small-data services inside Nebula would also get their packets dropped; for example Consul consensus. This would then destabilise my entire cluster.

My only solution so far was to remove big data pushers such as Ceph from Nebula, defeating the point of running everything inside the VPN.

rawdigits commented 1 year ago

Overall the "many gigabits per second" relates to exactly what @nbrownus mentions above. This cited number is in aggregate.

At Slack, we didn't encounter workloads that have single path host-to-host tunnels trying to do 10gbit/s, but with a small-ish MTU. Nebula allows you to configure MTUs for different network segments, and Slack uses this internally across production. I do understand that in your case, Hetzner does not allow a higher MTU, which contributes to this bottleneck.

More broadly, Nebula's default division of work is per-tunnel. If you have 4+ hosts talking to a single host over Nebula, and you turn on muiltiroutine processing, Nebula will quickly match the maximum line rate of a single 10gbit interface.

In the case of Ceph, are you often sending many gbit/s between individual hosts?

We are certainly open to enhancing this if more people ask for a bump when using individual tunnels with small MTUs. We will also be sharing our research here in a future blog post for people to validate, and which will have tips for optimizing performance.

johnmaguire commented 10 months ago

Hi @nh2 - We've identified a bug in Nebula, beginning with v1.6.0, released June 2022 where Nebula nodes configured with a listen port of 0 (random) would not properly utilize multiple routines when the routines config option was configured.

I understand that you opened this issue in February 2022, prior to the bug, but have continued debugging since v1.6.0. Given that this is the case, I will humbly request that you re-test your configuration.

Additionally, in December 2022, prior to closing this issue, @nbrownus asked you to run a few commands to collect some extra debugging information. We believe that the output of ss -numpile would've identified the recently-fixed bug, had you been affected by it. Is it possible to please collect that debug information now?

Thank you!

nh2 commented 10 months ago

nodes configured with a listen port of 0 (random) would not properly utilize multiple routines

@johnmaguire Thanks! I'm using a fixed listen port of 4242 for all my nodes.

ss -numpile shows:

UNCONN 0      0      [::ffff:0.0.0.0]:4242            *:*    users:(("nebula",pid=2102,fd=7)) uid:991 ino:13155 sk:b cgroup:/system.slice/system-nebula.slice/nebula@servers.service v6only:0 <->

rawdigits commented 9 months ago

Hi @nh2, I just wanted to make note of the blog post we recently published about performance here: https://www.defined.net/blog/nebula-is-not-the-fastest-mesh-vpn/

I hope that answers some of your questions here, and I'm happy to clarify any of the points. I'll close this issue in a week, unless there is something further to discuss that isn't covered there. Thanks!

nh2 commented 9 months ago

@rawdigits The blog post looks great and is very useful.

But I believe it is still about aggregrate throughput, when indeed my issue report is for the point-to-point connection between single hosts.

I can get 10 Gbit/s between 2 Hetzner servers via WireGuard and via iperf3 UDP (5 Gbit/s with single flow, full 10 Gbit/s with multiple flows, as mentioned in https://github.com/slackhq/nebula/issues/637#issuecomment-1086643211).

But I cannot get this with Nebula.

In the case of Ceph, are you often sending many gbit/s between individual hosts?

Yes, that is the standard workflow. When you write a file to CephFS, the client that does the write() syscall sends the data to one of the Ceph servers, which then distributes the write to the replicas before the write() returns.

So for example, you write a 10 GB file. With Ceph-on-Nebula it takes ~100 seconds (capped at ~1 Gbit/s), with Ceph outside of the VPN it takes ~10 secons (capped at ~1 Gbit/s).

This factor makes a big difference for what workloads/apps you can handle.

A tangentially related issue issue is that in my tests, Nebula starts dropping packets when large transfer rates occur.

Concretely, when I had both Ceph and Consul (the consensus server) running on Nebula, and Ceph would do some large transfer, Nebula would drop packets, including those of Consul. This caused instability (consensus being lost). The issue disappears when running the same over a normal link instead of Nebula, apparently even when the normal link is 1 Gbit/s instead of 10 Gbit/s. My guess is that Nebula gets CPU-bottlenecked and thus leading to UDP packet loss that would happen differently on a real link.

But I still don't fully understand why that causes such big instabilities: Both Ceph and Consul use TCP, so theoretically a CPU-bottlenecked Nebula on a 10 Gbit/s interface should not lose more Consul-related packets than physical 1 Gbit/s interface; but it somehow does.

I think we should probably rename the issue to make clear it's about point-to-point performance, not aggregate.

I understand the blog post says

If you are using a mesh VPN, you probably have more than two hosts communicating at any given time. Honestly, if you only care a point-to-point connection, use whatever you like. Wireguard is great. IPsec exists. OpenVPN isn’t even that bad these days.

but there are still good reasons to use Nebula even when point-to-point is the main use case:

It is nice to have the mesh ability for the case that some link goes down.
Some machines (e.g. developer machines) may need mesh connectivity. It is easier to run just Nebula instead of multiple VPNs.
Nebula is easier to use and maintain than e.g. plain WireGuard: Wireguard has mutable state in the kernel, which makes it more difficult to configure fully declaratively. With Nebula, it's simpler: If the Nebula process is running, the VPN is up, otherwise it's down.

vnlitvinov commented 8 months ago

@nh2 what is the upper limit you're able to achieve using Nebula? Also would it be possible for you to share your tweaks to default config values?

I'm facing similar issue, but cannot saturate even 1Gbps link (iperf shows something like 500-550 Mbps in TCP mode), though I'm certainly running it in some worse conditions, as I'm surely running the stuff in VMs.

@rawdigits I did read the blog, and I do understand the limitations, but I was hoping (looking at the "performance per core" graphs) that Nebula would be able to give me 1Gbps speed. I can get up to 5Gbps in multi-threaded iperf (with -P 5) without VPN, and I can reach 1Gbps on single-threaded iperf and when using Tailscale for VPN.

Also, when I run iperf in UDP mode with bandwidth limited to 1G, its server reports unusually high packet loss (~36%) which, if you subtract it from the 1G, would again yield something about 500-600 Mbps as reported in TCP case.

When I do this, ss reports no packet drops at all, though (first socket is for Lighthouse container, and second and third are for two routines of my main receiver container):

$ sudo ss -numpile|grep nebula -A1
UNCONN 0      0      [::ffff:0.0.0.0]:4242            *:*    users:(("nebula",pid=314026,fd=7)) ino:1167283 sk:1004 cgroup:/docker/7bc62b5a8601c9d6e2129fd456bd462687e3794a15f3c3adf601c443720915a8 v6only:0 <->
         skmem:(r0,rb212992,t0,tb212992,f4096,w0,o0,bl0,d0)      
UNCONN 0      0      [::ffff:0.0.0.0]:4243            *:*    users:(("nebula",pid=314102,fd=7)) ino:1168468 sk:1005 cgroup:/docker/7dfa00a7fbe11ab75c283dcfa394e4ede11f5aad9a32c5f0fcd5990e88ba348f v6only:0 <->
         skmem:(r0,rb209715200,t0,tb209715200,f4096,w0,o0,bl0,d0)
UNCONN 0      0      [::ffff:0.0.0.0]:4243            *:*    users:(("nebula",pid=314102,fd=8)) ino:1168469 sk:1006 cgroup:/docker/7dfa00a7fbe11ab75c283dcfa394e4ede11f5aad9a32c5f0fcd5990e88ba348f v6only:0 <->
         skmem:(r0,rb209715200,t0,tb209715200,f4096,w0,o0,bl0,d0)

nh2 commented 7 months ago

@nh2 what is the upper limit you're able to achieve using Nebula?

@vnlitvinov

Plain iperf3 -c on dedicated, 10 Gbit/s, 0.3 ms ping: 9.35 Gbits/sec
Same over Nebula 1.7.1: 2.32 Gbits/sec

The config I'm using in production currently has no tuning, only non-performance relevant settings, as I have not managed to boost the performance significantly with any settings:

``` firewall: inbound: - host: any port: any proto: any outbound: - host: any port: any proto: any lighthouse: am_lighthouse: false hosts: [] listen: host: 0.0.0.0 port: 4242 pki: ca: /nix/store/wsm806wfrrhmz2ac2gzvkzpkbkinaan0-ca.crt cert: /var/run/nebula-corp/nebula.crt key: /var/run/nebula-corp/nebula.key relay: am_relay: false relays: [] use_relays: true static_host_map: 10.1.3.1: - 192.168.0.1:4242 10.1.3.2: - 192.168.0.2:4242 10.1.3.3: - 192.168.0.3:4242 10.1.5.1: - 192.168.0.4:4242 10.1.5.2: - 192.168.0.5:4242 10.1.5.3: - 192.168.0.6:4242 10.1.6.1: - 192.168.0.7:4242 tun: dev: nebula.corp disabled: false ```

ondrej-smola commented 7 months ago

same here as @nh2 ... just tested today on Hetzner dedicated cloud servers ... tried tunning multiple parameters and nothing helped significantly

ondrej-smola commented 7 months ago

just tested tailscale (following their getting started) and got basically same results ~ 2.32Gbit, wireguard also reports 2.4Gbit

ondrej-smola commented 7 months ago

I think only https://github.com/slackhq/nebula/pull/768 could improve situation - is there anything I can do to make it merged (even as experimental feature) - @rawdigits ?

dropwhile commented 7 months ago

Maybe some of the ideas from https://toonk.io/sending-network-packets-in-go/ could be useful?

wadey commented 7 months ago

I think only #768 could improve situation - is there anything I can do to make it merged (even as experimental feature)

@ondrej-smola I made a v1.8.2-multiport release that is just v1.8.2 with this PR merged in if you want to test with it, binaries here: https://github.com/wadey/nebula/releases/tag/v1.8.2-multiport

johnmaguire commented 6 months ago

Hey @ondrej-smola - I was just wondering if you had a chance to test the build @wadey provided. If so, how did it go?

ondrej-smola commented 6 months ago

@wadey @johnmaguire thank you for creating release - I am on parental leave but should be back in June

p1u3o commented 4 months ago

I've noticed a fairly drastic drop using Nebula over Hetzners cloud networks

Hetzner private network (no Nebula)

iperf -c 10.0.0.5
------------------------------------------------------------
Client connecting to 10.0.0.5, TCP port 5001
TCP window size: 85.0 KByte (default)
------------------------------------------------------------
[  1] local 10.0.0.4 port 37204 connected with 10.0.0.5 port 5001
[ ID] Interval       Transfer     Bandwidth
[  1] 0.00-10.01 sec  5.91 GBytes  5.07 Gbits/sec

with 1.9.3

iperf -c 10.11.3.1
------------------------------------------------------------
Client connecting to 10.11.3.1, TCP port 5001
TCP window size: 45.0 KByte (default)
------------------------------------------------------------
[  1] local 10.11.3.3 port 42506 connected with 10.11.3.1 port 5001
[ ID] Interval       Transfer     Bandwidth
[  1] 0.00-10.02 sec   732 MBytes   613 Mbits/sec

with the above build (1.8.2-multiport)

iperf -c 10.11.3.1
------------------------------------------------------------
Client connecting to 10.11.3.1, TCP port 5001
TCP window size: 45.0 KByte (default)
------------------------------------------------------------
[  1] local 10.11.3.3 port 40704 connected with 10.11.3.1 port 5001
[ ID] Interval       Transfer     Bandwidth
[  1] 0.00-10.03 sec   854 MBytes   714 Mbits/sec

Servers are both Hetzners Ampere servers (hardware AES is enabled)

slackhq / nebula

Unable to achieve 10 Gbit/s throughput on Hetzner server #637