Slow/stuck transfer files in nebula network between America/Europe/Asia nodes.

asyslinux commented 2 years ago

Hello, I can't figure out what's wrong and why the transfer rate drops significantly to KB / s, and almost freezes. I tried to set lower mtu on two nodes(but not on all network nodes), but this is didn`t help. I tried to disable Europe lighthouse nodes in all network, but result same too. This problem i had on 1.5.2 version of Nebula too. Can anyone advise or check? Thank you.

Asia / Usa through Nebula:

root@sg:/dev# rsync -av --progress america.vpn.ip:/tmp/50M.file /tmp/ root@america.vpn.ip's password: receiving incremental file list 50M.file 7,634,944 14% 67.03kB/s 0:11:08 1,179,648 2% 71.06kB/s 0:12:01 1,277,952 2% 58.15kB/s 0:14:39 1,310,720 2% 9.79kB/s 1:27:00 1,966,080 3% 45.57kB/s 0:18:27 1,998,848 3% 39.42kB/s 0:21:19 2,064,384 3% 46.14kB/s 0:18:11 2,097,152 4% 49.54kB/s 0:16:56

Asia / Usa through Internet:

root@sg:/dev# rsync -av --progress america.real.ip:/tmp/50M.file /tmp/ root@america.real.ip's password: receiving incremental file list 50M.file 31,817,728 60% 6.58MB/s 0:00:03

With other transfers files from any continent to any continent in any direction - i have same problems.

I have 14 lighthouse nodes: 4 in Europe, 10 in America

Lighthouse configuration:

static_host_map:
  "10.10.0.1": ["1.2.3.1:12345"] #Europe
  "10.10.0.2": ["1.2.3.2:12345"] #Europe
  "10.10.0.3": ["1.2.3.3:12345"] #Europe
  "10.10.0.4": ["1.2.3.4:12345"] #Europe
  "10.10.0.5": ["1.2.3.5:12345"] #America
  "10.10.0.6": ["1.2.3.6:12345"] #America
  "10.10.0.7": ["1.2.3.7:12345"] #America
  "10.10.0.8": ["1.2.3.8:12345"] #America
  "10.10.0.9": ["1.2.3.9:12345"] #America
  "10.10.0.10": ["1.2.3.10:12345"] #America
  "10.10.0.11": ["1.2.3.11:12345"] #America
  "10.10.0.12": ["1.2.3.12:12345"] #America
  "10.10.0.13": ["1.2.3.13:12345"] #America
  "10.10.0.14": ["1.2.3.14:12345"] #America
lighthouse:
  am_lighthouse: true
  interval: 30
  hosts:
listen:
  host: 0.0.0.0
  port: 12345
punchy:
  punch: true
relay:
  am_relay: true
  use_relays: false
tun:
  disabled: false
  dev: n2n0
  drop_local_broadcast: true
  drop_multicast: true
  tx_queue: 500
  mtu: 1290
  routes:
  unsafe_routes: #Additionally i have some unsafe routes
    - route: 192.168.24.0/24
      via: 10.10.0.79
      mtu: 1290
      metric: 100
    - route: 192.168.32.0/24
      via: 10.10.0.87
      mtu: 1290
      metric: 100
logging:
  level: warning
  format: text
firewall:
  conntrack:
    tcp_timeout: 15m
    udp_timeout: 3m
    default_timeout: 10m
  outbound:
    - port: any
      proto: any
      host: any
  inbound:
    - port: any
      proto: any
      group: any

Others nodes configuration:

static_host_map:
  "10.10.0.1": ["1.2.3.1:12345"] #Europe
  "10.10.0.2": ["1.2.3.2:12345"] #Europe
  "10.10.0.3": ["1.2.3.3:12345"] #Europe
  "10.10.0.4": ["1.2.3.4:12345"] #Europe
  "10.10.0.5": ["1.2.3.5:12345"] #America
  "10.10.0.6": ["1.2.3.6:12345"] #America
  "10.10.0.7": ["1.2.3.7:12345"] #America
  "10.10.0.8": ["1.2.3.8:12345"] #America
  "10.10.0.9": ["1.2.3.9:12345"] #America
  "10.10.0.10": ["1.2.3.10:12345"] #America
  "10.10.0.11": ["1.2.3.11:12345"] #America
  "10.10.0.12": ["1.2.3.12:12345"] #America
  "10.10.0.13": ["1.2.3.13:12345"] #America
  "10.10.0.14": ["1.2.3.14:12345"] #America
lighthouse:
  am_lighthouse: false
  interval: 30
  hosts:
    - "10.10.0.1"
    - "10.10.0.2"
    - "10.10.0.3"
    - "10.10.0.4"
    - "10.10.0.5"
    - "10.10.0.6"
    - "10.10.0.7"
    - "10.10.0.8"
    - "10.10.0.9"
    - "10.10.0.10"
    - "10.10.0.11"
    - "10.10.0.12"
    - "10.10.0.13"
    - "10.10.0.14"
listen:
  host: 0.0.0.0
  port: 12345
punchy:
  punch: true
relay:
  relays:
    - "10.10.0.1"
    - "10.10.0.2"
    - "10.10.0.3"
    - "10.10.0.4"
    - "10.10.0.5"
    - "10.10.0.6"
    - "10.10.0.7"
    - "10.10.0.8"
    - "10.10.0.9"
    - "10.10.0.10"
    - "10.10.0.11"
    - "10.10.0.12"
    - "10.10.0.13"
    - "10.10.0.14"
  am_relay: false
  use_relays: true
tun:
  disabled: false
  dev: n2n0
  drop_local_broadcast: true
  drop_multicast: true
  tx_queue: 500
  mtu: 1290
  routes:
  unsafe_routes: #Additionally i have some unsafe routes
    - route: 192.168.24.0/24
      via: 10.10.0.79
      mtu: 1290
      metric: 100
    - route: 192.168.32.0/24
      via: 10.10.0.87
      mtu: 1290
      metric: 100
logging:
  level: warning
  format: text
firewall:
  conntrack:
    tcp_timeout: 15m
    udp_timeout: 3m
    default_timeout: 10m
  outbound:
    - port: any
      proto: any
      host: any
  inbound:
    - port: any
      proto: any
      group: any

brad-defined commented 2 years ago

That's a lot of lighthouses. Why does your network have so many?

A few quick thoughts that might help - (1) In your host's relay.relays section, only list relays that are close to that host, in terms of ping time. Meaning, I expect European hosts would only list the European relays, and American hosts would only list American relays. Your American relays could even be further segmented, if they're in different geographic regions - so American east-coast hosts would only list relays on the east coast, and vice versa for west-coast hosts. I expect those geographic realities to result in lower latency, and therefore faster ping times. (2) In each host's config, specify

listen:
  read_buffer: 10485760
  write_buffer: 10485760

(these values come out of the commented-out values in the example Nebula config file here: https://github.com/slackhq/nebula/blob/master/examples/config.yml#L106)

If you hop into the OSS Nebula slack channel, you can get support there, too.

asyslinux commented 2 years ago

Hello thanks for reply.

I uncomment read/write buffers on all network hosts but this is didn`t help. Result is same, transfer file sometimes started fast, then stuck and continues with modem 56k speed, then can sometimes increase.

I try to set routines: 8 additionally. This is didn`t help.

I try to leave only 4 lighthouse nodes in USA in all network hosts, this is not help. And on previous version of Nebula: 1.5.2 when no had relays result has been same.

Is there any way to find out what could be the problem? Maybe change the mtu from 1290 to 1127, put it even lower? Or increase tx_queue from 500 to 3000?

What I know for sure is that the internet is fast between hosts from Asia and the US or Europe and the US.

Thanks.


50M.file
        950,272   1%  791.81kB/s    0:01:05
        983,040   1%  133.50kB/s    0:06:25
      1,015,808   1%   73.94kB/s    0:11:35
      1,048,576   2%   52.77kB/s    0:16:13
      1,081,344   2%    6.10kB/s    2:20:24
      1,343,488   2%   17.97kB/s    0:47:23
      1,376,256   2%   18.18kB/s    0:46:48
      3,080,192   5%  104.83kB/s    0:07:50
      3,309,568   6%  104.76kB/s    0:07:48
      3,342,336   6%   87.21kB/s    0:09:22
      3,375,104   6%   86.32kB/s    0:09:28
      3,407,872   6%   13.74kB/s    0:59:26
      3,440,640   6%    5.14kB/s    2:38:47
      3,473,408   6%    5.14kB/s    2:38:42
      3,506,176   6%    5.14kB/s    2:38:34
      3,538,944   6%    5.14kB/s    2:38:29

asyslinux commented 2 years ago

Additionally I attach my sysctl.conf (same on most servers in network) Maybe something in it interferes with the normal operation of the tunnels? Although there are no such problems between servers where ping is good, so I'm not sure if something is interfering.


#IP Forward

net/ipv4/ip_forward=1

#High Load Systems

net/ipv4/tcp_tw_reuse=1

#Disable ipv6

net/ipv6/conf/all/disable_ipv6=1
net/ipv6/conf/default/disable_ipv6=1
net/ipv6/conf/lo/disable_ipv6=1

#Max Concurent Connections

net/core/somaxconn=262144

#Disable Accept Source Routing

net/ipv4/conf/all/accept_source_route=0

#Disable Accept Redirects

net/ipv4/conf/all/accept_redirects=0

#Enable Anti Spoofing

net/ipv4/conf/all/rp_filter=1

#Enable Ignore Broadcast Packets

net/ipv4/icmp_echo_ignore_broadcasts=1

#Enable Logging Bad Error Message Protection

net/ipv4/icmp_ignore_bogus_error_responses=1

#Disable Logging Spoofes Packets, Source Routed Packets, Redirect Packets

net/ipv4/conf/all/log_martians=0

#Optimal Network Parameters

net/ipv4/tcp_congestion_control=yeah

net/core/netdev_max_backlog=262144

net/ipv4/tcp_no_metrics_save=1
net/ipv4/tcp_low_latency=1
net/ipv4/tcp_max_syn_backlog=262144
net/ipv4/tcp_mtu_probing=1

net/core/optmem_max=67108864

net/core/rmem_default=212992
net/core/wmem_default=212992

net/core/rmem_max=67108864
net/core/wmem_max=67108864

net/ipv4/tcp_rmem=4096 87380 33554432
net/ipv4/tcp_wmem=4096 65536 33554432

#Decrease TCP FIN TimeOut

net/ipv4/tcp_fin_timeout=3

#Decrease TCP KeepAlive Connections Interval

net/ipv4/tcp_keepalive_time=300

#Decrease TCP KeepAlive Sents

net/ipv4/tcp_keepalive_probes=3

#Disable SACK

net/ipv4/tcp_sack=0

#Time Orphan Retries

net/ipv4/tcp_orphan_retries=1

#Swap On 10% of Memory

vm/swappiness=10

#Core Pids

kernel/core_uses_pid=1

#Increase Inotify Settings

fs/inotify/max_user_watches=524288
fs/inotify/max_queued_events=65536

#Virtual Memory Settings

vm/overcommit_memory=1
vm/max_map_count=262144

#Auto-Reboot on Kernel Panic

kernel/panic=60

#Auto-Log on Kernel Panic

kernel/panic_on_oops=1

johnmaguire commented 6 months ago

Hi @asyslinux - I realize this ticket is a bit stale, but I wanted to know if you made any progress in solving your issues.

One thing that was pointed out earlier in the thread is that relays can certainly act as a bottleneck, and you have quite a few configured in your host's configuration. Have you verified whether this issue exists when relays are taken out of the equation?

asyslinux commented 6 months ago

Hi, @johnmaguire - I do not use nebula now, problem early didn`t solved, my infrastructure no have any bottleneck, through direct connection all files transferred without any stucks.

You able close this issue, in past, I do not know this is only my problem or not.

slackhq / nebula

Slow/stuck transfer files in nebula network between America/Europe/Asia nodes. #723