slackhq / nebula

A scalable overlay networking tool with a focus on performance, simplicity and security
MIT License
14.01k stars 939 forks source link

🐛 BUG: Unable to reconnect after server crash #1125

Open Cyberes opened 2 months ago

Cyberes commented 2 months ago

What version of nebula are you using? (nebula -version)

1.8.2

What operating system are you using?

Linux

Describe the Bug

One of my servers has been unstable and will sometimes become unresponsive, requiring a hard reset (this is host A, 172.0.3.109). When it comes back up, it's unable to establish a connection to my node that is a nginx reverse proxy (this is host B, 172.0.2.120). If I manually restart the Nebula service on host A, it fixes the issue and host A can connect to host B.

Logs from affected hosts

Lighthouse ``` Apr 22 12:13:31 cloud-proxy nebula[689]: time="2024-04-22T12:13:31-00:00" level=debug msg="Recv error sent" index=0 udpAddr=":5694" Apr 22 12:13:31 cloud-proxy nebula[689]: time="2024-04-22T12:13:31-00:00" level=debug msg="Recv error sent" index=0 udpAddr=":5694" Apr 22 12:13:31 cloud-proxy nebula[689]: time="2024-04-22T12:13:31-00:00" level=debug msg="Recv error sent" index=0 udpAddr=":5694" Apr 22 12:13:31 cloud-proxy nebula[689]: time="2024-04-22T12:13:31-00:00" level=debug msg="Recv error sent" index=0 udpAddr=":5694" Apr 22 12:13:31 cloud-proxy nebula[689]: time="2024-04-22T12:13:31-00:00" level=debug msg="Recv error sent" index=0 udpAddr=":5694" Apr 22 12:13:31 cloud-proxy nebula[689]: time="2024-04-22T12:13:31-00:00" level=debug msg="Recv error sent" index=0 udpAddr=":5694" Apr 22 12:13:32 cloud-proxy nebula[689]: time="2024-04-22T12:13:32-00:00" level=debug msg="Recv error sent" index=0 udpAddr=":5694" Apr 22 12:13:32 cloud-proxy nebula[689]: time="2024-04-22T12:13:32-00:00" level=debug msg="Recv error sent" index=0 udpAddr=":5694" Apr 22 12:13:32 cloud-proxy nebula[689]: time="2024-04-22T12:13:32-00:00" level=debug msg="Recv error sent" index=0 udpAddr=":5694" Apr 22 12:13:32 cloud-proxy nebula[689]: time="2024-04-22T12:13:32-00:00" level=debug msg="Recv error sent" index=0 udpAddr=":5694" Apr 22 12:13:33 cloud-proxy nebula[689]: time="2024-04-22T12:13:33-00:00" level=debug msg="Recv error sent" index=0 udpAddr=":5694" Apr 22 12:13:33 cloud-proxy nebula[689]: time="2024-04-22T12:13:33-00:00" level=debug msg="Recv error sent" index=0 udpAddr=":5694" Apr 22 12:13:34 cloud-proxy nebula[689]: time="2024-04-22T12:13:34-00:00" level=debug msg="Recv error sent" index=0 udpAddr=":5694" Apr 22 12:13:34 cloud-proxy nebula[689]: time="2024-04-22T12:13:34-00:00" level=debug msg="Recv error sent" index=0 udpAddr=":5694" ```
Host A ``` Apr 22 12:12:25 host-a nebula[811]: time="2024-04-22T12:12:25-00:00" level=info msg="Handshake message sent" cached=true handshake="map[stage:2 style:ix_psk0]" relay=172.0.0.2 vpnIp=172.0.2.120 Apr 22 12:12:26 host-a nebula[811]: time="2024-04-22T12:12:26-00:00" level=info msg="Tunnel status" certName=host-b localIndex=804429705 remoteIndex=3094684430 tunnelCheck="map[method:active state:dead]" v> Apr 22 12:12:26 host-a nebula[811]: time="2024-04-22T12:12:26-00:00" level=info msg="Handshake message received" certName=host-b fingerprint=...> Apr 22 12:12:26 host-a nebula[811]: time="2024-04-22T12:12:26-00:00" level=info msg="Handshake message sent" cached=true handshake="map[stage:2 style:ix_psk0]" relay=172.0.0.2 vpnIp=172.0.2.120 Apr 22 12:12:26 host-a nebula[811]: time="2024-04-22T12:12:26-00:00" level=info msg="Handshake message received" certName=host-b fingerprint=...> Apr 22 12:12:26 host-a nebula[811]: time="2024-04-22T12:12:26-00:00" level=info msg="Handshake message sent" cached=true handshake="map[stage:2 style:ix_psk0]" relay=172.0.0.2 vpnIp=172.0.2.120 Apr 22 12:12:27 host-a nebula[811]: time="2024-04-22T12:12:27-00:00" level=info msg="Handshake message received" certName=host-b fingerprint=...> Apr 22 12:12:27 host-a nebula[811]: time="2024-04-22T12:12:27-00:00" level=info msg="Handshake message sent" cached=true handshake="map[stage:2 style:ix_psk0]" relay=172.0.0.2 vpnIp=172.0.2.120 Apr 22 12:12:27 host-a nebula[811]: time="2024-04-22T12:12:27-00:00" level=info msg="Handshake message received" certName=host-b fingerprint=...> Apr 22 12:12:27 host-a nebula[811]: time="2024-04-22T12:12:27-00:00" level=info msg="Handshake message sent" cached=true handshake="map[stage:2 style:ix_psk0]" relay=172.0.0.2 vpnIp=172.0.2.120 Apr 22 12:12:28 host-a nebula[811]: time="2024-04-22T12:12:28-00:00" level=info msg="sendNoMetrics failed to find HostInfo" certName=host-b error="unable to find host with relay" localIndex=2480024129 rela> Apr 22 12:12:28 host-a nebula[811]: time="2024-04-22T12:12:28-00:00" level=info msg="Handshake message received" certName=host-b fingerprint=...> Apr 22 12:12:28 host-a nebula[811]: time="2024-04-22T12:12:28-00:00" level=info msg="Handshake message sent" cached=true handshake="map[stage:2 style:ix_psk0]" relay=172.0.0.2 vpnIp=172.0.2.120 Apr 22 12:12:28 host-a nebula[811]: time="2024-04-22T12:12:28-00:00" level=info msg="Handshake message received" certName=host-b fingerprint=...> Apr 22 12:12:28 host-a nebula[811]: time="2024-04-22T12:12:28-00:00" level=info msg="Handshake message sent" cached=true handshake="map[stage:2 style:ix_psk0]" relay=172.0.0.2 vpnIp=172.0.2.120 Apr 22 12:12:31 host-a nebula[811]: time="2024-04-22T12:12:31-00:00" level=info msg="Handshake message received" certName=host-b fingerprint=...> Apr 22 12:12:31 host-a nebula[811]: time="2024-04-22T12:12:31-00:00" level=info msg="Taking new handshake" certName=host-b localIndex=2480024129 remoteIndex=108412485 vpnIp=172.0.2.120 Apr 22 12:12:31 host-a nebula[811]: time="2024-04-22T12:12:31-00:00" level=info msg="Handshake message sent" certName=host-b fingerprint=...> Apr 22 12:12:31 host-a nebula[811]: time="2024-04-22T12:12:31-00:00" level=info msg="Handshake message received" certName=host-b fingerprint=...> Apr 22 12:12:31 host-a nebula[811]: time="2024-04-22T12:12:31-00:00" level=info msg="Handshake message sent" cached=true handshake="map[stage:2 style:ix_psk0]" relay=172.0.0.2 vpnIp=172.0.2.120 Apr 22 12:12:31 host-a nebula[811]: time="2024-04-22T12:12:31-00:00" level=info msg="Handshake message received" certName=host-b fingerprint=...> Apr 22 12:12:31 host-a nebula[811]: time="2024-04-22T12:12:31-00:00" level=info msg="Handshake message sent" cached=true handshake="map[stage:2 style:ix_psk0]" relay=172.0.0.2 vpnIp=172.0.2.120 Apr 22 12:12:31 host-a nebula[811]: time="2024-04-22T12:12:31-00:00" level=info msg="Handshake message received" certName=host-b fingerprint=...> Apr 22 12:12:31 host-a nebula[811]: time="2024-04-22T12:12:31-00:00" level=info msg="Handshake message sent" cached=true handshake="map[stage:2 style:ix_psk0]" relay=172.0.0.2 vpnIp=172.0.2.120 Apr 22 12:12:31 host-a nebula[811]: time="2024-04-22T12:12:31-00:00" level=info msg="sendNoMetrics failed to find HostInfo" certName=host-b error="unable to find host with relay" localIndex=817659693 relay> Apr 22 12:12:31 host-a nebula[811]: time="2024-04-22T12:12:31-00:00" level=info msg="Handshake message received" certName=host-b fingerprint=...> Apr 22 12:10:23 host-a nebula[811]: time="2024-04-22T12:10:23-00:00" level=info msg="Handshake message sent" cached=true handshake="map[stage:2 style:ix_psk0]" relay=172.0.0.2 vpnIp=172.0.2.120 Apr 22 12:10:23 host-a nebula[811]: time="2024-04-22T12:10:23-00:00" level=info msg="Handshake message received" certName=host-b fingerprint=...> Apr 22 12:10:23 host-a nebula[811]: time="2024-04-22T12:10:23-00:00" level=info msg="Handshake message sent" cached=true handshake="map[stage:2 style:ix_psk0]" relay=172.0.0.2 vpnIp=172.0.2.120 Apr 22 12:10:24 host-a nebula[811]: time="2024-04-22T12:10:24-00:00" level=info msg="Handshake message received" certName=host-b fingerprint=...> Apr 22 12:10:24 host-a nebula[811]: time="2024-04-22T12:10:24-00:00" level=info msg="Handshake message sent" cached=true handshake="map[stage:2 style:ix_psk0]" relay=172.0.0.2 vpnIp=172.0.2.120 Apr 22 12:10:24 host-a nebula[811]: time="2024-04-22T12:10:24-00:00" level=info msg="Handshake message received" certName=host-b fingerprint=...> Apr 22 12:10:24 host-a nebula[811]: time="2024-04-22T12:10:24-00:00" level=info msg="Handshake message sent" cached=true handshake="map[stage:2 style:ix_psk0]" relay=172.0.0.2 vpnIp=172.0.2.120 Apr 22 12:10:24 host-a nebula[811]: time="2024-04-22T12:10:24-00:00" level=info msg="Handshake message received" certName=host-b fingerprint=...> Apr 22 12:10:24 host-a nebula[811]: time="2024-04-22T12:10:24-00:00" level=info msg="Handshake message sent" cached=true handshake="map[stage:2 style:ix_psk0]" relay=172.0.0.2 vpnIp=172.0.2.120 Apr 22 12:10:24 host-a nebula[811]: time="2024-04-22T12:10:24-00:00" level=info msg="Handshake message received" certName=host-b fingerprint=...> Apr 22 12:10:24 host-a nebula[811]: time="2024-04-22T12:10:24-00:00" level=info msg="Handshake message sent" cached=true handshake="map[stage:2 style:ix_psk0]" relay=172.0.0.2 vpnIp=172.0.2.120 Apr 22 12:10:25 host-a nebula[811]: time="2024-04-22T12:10:25-00:00" level=info msg="Handshake message received" certName=host-b fingerprint=...> Apr 22 12:10:25 host-a nebula[811]: time="2024-04-22T12:10:25-00:00" level=info msg="Handshake message sent" cached=true handshake="map[stage:2 style:ix_psk0]" relay=172.0.0.2 vpnIp=172.0.2.120 Apr 22 12:10:25 host-a nebula[811]: time="2024-04-22T12:10:25-00:00" level=info msg="Handshake message received" certName=host-b fingerprint=...> Apr 22 12:10:25 host-a nebula[811]: time="2024-04-22T12:10:25-00:00" level=info msg="Handshake message sent" cached=true handshake="map[stage:2 style:ix_psk0]" relay=172.0.0.2 vpnIp=172.0.2.120 Apr 22 12:10:25 host-a nebula[811]: time="2024-04-22T12:10:25-00:00" level=info msg="Tunnel status" certName=host-b localIndex=1362300847 remoteIndex=2069131475 tunnelCheck="map[method:active state:dead]" > Apr 22 12:10:25 host-a nebula[811]: time="2024-04-22T12:10:25-00:00" level=info msg="Handshake message received" certName=host-b fingerprint=...> Apr 22 12:10:25 host-a nebula[811]: time="2024-04-22T12:10:25-00:00" level=info msg="Handshake message sent" cached=true handshake="map[stage:2 style:ix_psk0]" relay=172.0.0.2 vpnIp=172.0.2.120 Apr 22 12:10:25 host-a nebula[811]: time="2024-04-22T12:10:25-00:00" level=info msg="Handshake message received" certName=host-b fingerprint=...> Apr 22 12:10:25 host-a nebula[811]: time="2024-04-22T12:10:25-00:00" level=info msg="Handshake message sent" cached=true handshake="map[stage:2 style:ix_psk0]" relay=172.0.0.2 vpnIp=172.0.2.120 Apr 22 12:10:26 host-a nebula[811]: time="2024-04-22T12:10:26-00:00" level=info msg="Handshake message received" certName=host-b fingerprint=...> Apr 22 12:10:26 host-a nebula[811]: time="2024-04-22T12:10:26-00:00" level=info msg="Handshake message sent" cached=true handshake="map[stage:2 style:ix_psk0]" relay=172.0.0.2 vpnIp=172.0.2.120 Apr 22 12:10:26 host-a nebula[811]: time="2024-04-22T12:10:26-00:00" level=info msg="Handshake message received" certName=host-b fingerprint=...> Apr 22 12:10:26 host-a nebula[811]: time="2024-04-22T12:10:26-00:00" level=info msg="Handshake message sent" cached=true handshake="map[stage:2 style:ix_psk0]" relay=172.0.0.2 vpnIp=172.0.2.120 Apr 22 12:10:27 host-a nebula[811]: time="2024-04-22T12:10:27-00:00" level=info msg="Handshake message received" certName=host-b fingerprint=...> Apr 22 12:10:27 host-a nebula[811]: time="2024-04-22T12:10:27-00:00" level=info msg="Handshake message sent" cached=true handshake="map[stage:2 style:ix_psk0]" relay=172.0.0.2 vpnIp=172.0.2.120 Apr 22 12:10:27 host-a nebula[811]: time="2024-04-22T12:10:27-00:00" level=info msg="Handshake message received" certName=host-b fingerprint=...> Apr 22 12:10:27 host-a nebula[811]: time="2024-04-22T12:10:27-00:00" level=info msg="Handshake message sent" cached=true handshake="map[stage:2 style:ix_psk0]" relay=172.0.0.2 vpnIp=172.0.2.120 Apr 22 12:10:28 host-a nebula[811]: time="2024-04-22T12:10:28-00:00" level=info msg="sendNoMetrics failed to find HostInfo" certName=host-b error="unable to find host with relay" localIndex=2903481744 rela> Apr 22 12:10:28 host-a nebula[811]: time="2024-04-22T12:10:28-00:00" level=info msg="Handshake message received" certName=host-b fingerprint=...> Apr 22 12:10:28 host-a nebula[811]: time="2024-04-22T12:10:28-00:00" level=info msg="Handshake message sent" cached=true handshake="map[stage:2 style:ix_psk0]" relay=172.0.0.2 vpnIp=172.0.2.120 Apr 22 12:10:28 host-a nebula[811]: time="2024-04-22T12:10:28-00:00" level=info msg="Handshake message received" certName=host-b fingerprint=...> Apr 22 12:10:28 host-a nebula[811]: time="2024-04-22T12:10:28-00:00" level=info msg="Handshake message sent" cached=true handshake="map[stage:2 style:ix_psk0]" relay=172.0.0.2 vpnIp=172.0.2.120 Apr 22 12:10:30 host-a nebula[811]: time="2024-04-22T12:10:30-00:00" level=info msg="Handshake message received" certName=host-b fingerprint=...> Apr 22 12:10:30 host-a nebula[811]: time="2024-04-22T12:10:30-00:00" level=info msg="Taking new handshake" certName=host-b localIndex=2903481744 remoteIndex=3824716788 vpnIp=172.0.2.120 Apr 22 12:10:30 host-a nebula[811]: time="2024-04-22T12:10:30-00:00" level=info msg="Handshake message sent" certName=host-b fingerprint=...> Apr 22 12:10:30 host-a nebula[811]: time="2024-04-22T12:10:30-00:00" level=info msg="Handshake message received" certName=host-b fingerprint=...> Apr 22 12:10:30 host-a nebula[811]: time="2024-04-22T12:10:30-00:00" level=info msg="Handshake message sent" cached=true handshake="map[stage:2 style:ix_psk0]" relay=172.0.0.2 vpnIp=172.0.2.120 Apr 22 12:10:30 host-a nebula[811]: time="2024-04-22T12:10:30-00:00" level=info msg="Handshake message received" certName=host-b fingerprint=...> Apr 22 12:10:30 host-a nebula[811]: time="2024-04-22T12:10:30-00:00" level=info msg="Handshake message sent" cached=true handshake="map[stage:2 style:ix_psk0]" relay=172.0.0.2 vpnIp=172.0.2.120 Apr 22 12:10:30 host-a nebula[811]: time="2024-04-22T12:10:30-00:00" level=info msg="Handshake message received" certName=host-b fingerprint=...> Apr 22 12:10:30 host-a nebula[811]: time="2024-04-22T12:10:30-00:00" level=info msg="Handshake message sent" cached=true handshake="map[stage:2 style:ix_psk0]" relay=172.0.0.2 vpnIp=172.0.2.120 ``` This pattern repeats.
Host B ``` Apr 22 12:06:25 host-b nebula[111]: time="2024-04-22T12:06:25-00:00" level=info msg="Attempt to relay through hosts" localIndex=1813799298 relays="[172.0.0.2 172.0.0.3 172.0.0.2 172.0.0.3]" remoteIndex=0 vpnIp=1> Apr 22 12:06:25 host-b nebula[111]: time="2024-04-22T12:06:25-00:00" level=info msg="Send handshake via relay" localIndex=1813799298 relay=172.0.0.2 remoteIndex=0 vpnIp=172.0.3.109 Apr 22 12:06:25 host-b nebula[111]: time="2024-04-22T12:06:25-00:00" level=info msg="Send handshake via relay" localIndex=1813799298 relay=172.0.0.3 remoteIndex=0 vpnIp=172.0.3.109 Apr 22 12:06:25 host-b nebula[111]: time="2024-04-22T12:06:25-00:00" level=info msg="Send handshake via relay" localIndex=1813799298 relay=172.0.0.2 remoteIndex=0 vpnIp=172.0.3.109 Apr 22 12:06:25 host-b nebula[111]: time="2024-04-22T12:06:25-00:00" level=info msg="Send handshake via relay" localIndex=1813799298 relay=172.0.0.3 remoteIndex=0 vpnIp=172.0.3.109 Apr 22 12:06:26 host-b nebula[111]: time="2024-04-22T12:06:26-00:00" level=info msg="Handshake timed out" durationNs=6152868332 handshake="map[stage:1 style:ix_psk0]" initiatorIndex=1813799298 localIndex=1813799> Apr 22 12:06:30 host-b nebula[111]: time="2024-04-22T12:06:30-00:00" level=info msg="Handshake message sent" handshake="map[stage:1 style:ix_psk0]" initiatorIndex=3495069400 localIndex=3495069400 remoteIndex=0 u> Apr 22 12:06:30 host-b nebula[111]: time="2024-04-22T12:06:30-00:00" level=info msg="Attempt to relay through hosts" localIndex=3495069400 relays="[172.0.0.2 172.0.0.3 172.0.0.2 172.0.0.3]" remoteIndex=0 vpnIp=1> Apr 22 12:06:30 host-b nebula[111]: time="2024-04-22T12:06:30-00:00" level=info msg="Send handshake via relay" localIndex=3495069400 relay=172.0.0.2 remoteIndex=0 vpnIp=172.0.3.109 Apr 22 12:06:30 host-b nebula[111]: time="2024-04-22T12:06:30-00:00" level=info msg="Send handshake via relay" localIndex=3495069400 relay=172.0.0.3 remoteIndex=0 vpnIp=172.0.3.109 ``` This pattern repeats

Config files from affected hosts

Host A ```yaml firewall: conntrack: default_timeout: 10m max_connections: 100000 tcp_timeout: 12m udp_timeout: 3m inbound: - host: any port: any proto: icmp - host: any port: any proto: any outbound: - host: any port: any proto: any lighthouse: am_lighthouse: false hosts: - 172.0.0.2 - 172.0.0.3 interval: 60 listen: batch: 128 host: 0.0.0.0 port: 0 read_buffer: 10485760 write_buffer: 10485760 logging: format: text level: info pki: ca: | cert: | key: | punchy: delay: 1s punch: true punch_back: true respond: true relay: am_relay: false relays: - 172.0.0.2 - 172.0.0.3 use_relays: true static_host_map: 172.0.0.2: - ....:4242 172.0.0.3: - ....:4242 tun: dev: nebula1 disabled: false drop_local_broadcast: false drop_multicast: false routes: null tx_queue: 5000 unsafe_routes: null ```
Host B ```yaml firewall: conntrack: default_timeout: 10m max_connections: 100000 tcp_timeout: 12m udp_timeout: 3m inbound: - host: any port: any proto: icmp - host: any port: any proto: any outbound: - host: any port: any proto: any lighthouse: am_lighthouse: false hosts: - 172.0.0.2 - 172.0.0.3 interval: 60 listen: batch: 128 host: 0.0.0.0 port: 0 read_buffer: 10485760 write_buffer: 10485760 logging: format: text level: info pki: ca: | cert: | key: | preferred_ranges: - 10.0.0.0/24 punchy: delay: 1s punch: true punch_back: true respond: true relay: am_relay: false relays: - 172.0.0.2 - 172.0.0.3 use_relays: true static_host_map: 172.0.0.2: - ....:4242 172.0.0.3: - ....:4242 tun: dev: nebula1 disabled: false drop_local_broadcast: false drop_multicast: false routes: null tx_queue: 5000 unsafe_routes: null ```
Lighthouse ```yaml firewall: conntrack: default_timeout: 10m max_connections: 100000 tcp_timeout: 12m udp_timeout: 3m inbound: - host: any port: any proto: icmp - host: any port: any proto: any outbound: - host: any port: any proto: any lighthouse: am_lighthouse: true interval: 60 listen: batch: 128 host: 0.0.0.0 port: 4242 logging: format: text level: debug pki: ca: | cert: | key: | punchy: delay: 1s punch: true punch_back: true respond: true relay: am_relay: true use_relays: false stats: host: ...:2003 interval: 10s lighthouse_metrics: true message_metrics: true prefix: nebula.cloud_proxy protocol: tcp type: graphite tun: dev: nebula1 disabled: false drop_local_broadcast: false drop_multicast: false routes: null tx_queue: 5000 unsafe_routes: null ```
Cyberes commented 2 months ago

Also worth noting that this issue doesn't occur after a normal reboot.

nbrownus commented 2 months ago

On the surface this looks like a nebula IP reuse issue but there's so much redacted that it's hard to understand exactly what is happening.

Apr 22 12:13:31 cloud-proxy nebula[689]: time="2024-04-22T12:13:31-00:00" level=debug msg="Recv error sent" index=0 udpAddr="<host's external IP>:5694"

Both Host A and Host B appear to be attempting to initiate a tunnel with the other but neither logs show a matching handshake despite Host A responding. This means we should see logs on Host A responding to handshakes initiated by Host B.

To help can you do the following:

Cyberes commented 1 month ago

Great, I've sent you an email.

And there aren't multiple processes running. It's controlled by systemd.

Cyberes commented 2 weeks ago

I was wondering if you found anything noteworthy in the logs? This isn't a big problem but I did want to bring this problem to your attention if it was worth looking at.