slackhq / nebula

A scalable overlay networking tool with a focus on performance, simplicity and security
MIT License
14.44k stars 975 forks source link

🐛 BUG: unable to connect to lighthouse from the node with handshake timing out #1234

Open coralislands opened 1 week ago

coralislands commented 1 week ago

What version of nebula are you using? (nebula -version)

1.9.4

What operating system are you using?

Mac

Describe the Bug

I have setup 2 Macs - 1 as a lighthouse and other as a node. When I run the config on lighthouse, I can see that the lighthouse is running without any errors and the logs mention it is listening to port 4244 but when I check the "lsof -I :4244", it is empty.

I think this is leading to handshake failure between the node and the lighthouse, and hence unable to connect.

  1. On Both the Macs, I have disabled the firewall.
  2. I am able to ping the public IP of the lighthouse from the node.

Logs from affected hosts

lighthouse logs

sudo ./nebula -config /Users/svc_tcm_int/Downloads/nebula/config.yml
DEBU[0000] Client nebula certificate                     cert="NebulaCertificate {\n\tDetails {\n\t\tName: lighthouse\n\t\tIps: [\n\t\t\t192.168.100.1/24\n\t\t]\n\t\tSubnets: []\n\t\tGroups: []\n\t\tNot before: 2024-10-03 23:14:40 -0700 PDT\n\t\tNot After: 2025-10-03 23:14:18 -0700 PDT\n\t\tIs CA: false\n\t\tIssuer: 2a68b7216071384af321ebfa265fce83cdc9d394c21c2b222b4afbab1de50b62\n\t\tPublic key: 86d9bf1fd5d9b49a59f8996e2786a351c2f4baef696cfa0d686f8651a3a06174\n\t\tCurve: CURVE25519\n\t}\n\tFingerprint: 453ff239a86dc027dec796d9be65894bf02b3f7718ec0d1e226caaf5f12c5aa1\n\tSignature: 0d900209a5bd612d23a6c761bd6e5fb29d05d15e0553485a38db5bf963375f32aa090ab2f95f81374c6d7661f3e27b33c5585f4dfbbddc13ca1f198c8b644b0d\n}"
DEBU[0000] Trusted CA fingerprints                       fingerprints="[2a68b7216071384af321ebfa265fce83cdc9d394c21c2b222b4afbab1de50b62]"
INFO[0000] Firewall rule added                           firewallRule="map[caName: caSha: direction:outgoing endPort:0 groups:[] host:any ip: localIp: proto:0 startPort:0]"
INFO[0000] Firewall rule added                           firewallRule="map[caName: caSha: direction:incoming endPort:0 groups:[] host:any ip: localIp: proto:0 startPort:0]"
INFO[0000] Firewall started                              firewallHashes="SHA:498215dec4e5687a2353f51c10838c113bd1af35ef72b8e8c9f536986ada5417,FNV:2782948616"
WARN[0000] interface name must be utun[0-9]+ on Darwin, ignoring 
INFO[0000] listening on 0.0.0.0:4244                    
INFO[0000] Main HostMap created                          network=192.168.100.1/24 preferredRanges="[]"
INFO[0000] punchy enabled                               
INFO[0000] Loaded send_recv_error config                 sendRecvError=always
INFO[0000] Nebula interface is active                    boringcrypto=false build=1.9.4 interface=utun5 network=192.168.100.1/24 udpAddr="[::]:4244"

lsof output is empty.


nebula % lsof -i :4244  
nebula % 

node logs


sudo ./nebula -config config.yml 
Password:
INFO[0000] Firewall rule added                           firewallRule="map[caName: caSha: direction:outgoing endPort:0 groups:[] host:any ip: localIp: proto:0 startPort:0]"
INFO[0000] Firewall rule added                           firewallRule="map[caName: caSha: direction:incoming endPort:0 groups:[] host:any ip: localIp: proto:0 startPort:0]"
INFO[0000] Firewall started                              firewallHashes="SHA:498215dec4e5687a2353f51c10838c113bd1af35ef72b8e8c9f536986ada5417,FNV:2782948616"
INFO[0000] listening on 0.0.0.0:0                       
INFO[0000] Main HostMap created                          network=192.168.100.101/24 preferredRanges="[]"
INFO[0000] punchy enabled                               
INFO[0000] Loaded send_recv_error config                 sendRecvError=always
INFO[0000] Nebula interface is active                    boringcrypto=false build=1.9.4 interface=utun9 network=192.168.100.101/24 udpAddr="[::]:50644"
INFO[0000] Handshake message sent                        handshake="map[stage:1 style:ix_psk0]" initiatorIndex=2502649511 localIndex=2502649511 remoteIndex=0 udpAddrs="[12.248.207.98:4243]" vpnIp=192.168.100.1
INFO[0006] Handshake timed out                           durationNs=6600818625 handshake="map[stage:1 style:ix_psk0]" initiatorIndex=2502649511 localIndex=2502649511 remoteIndex=0 udpAddrs="[12.248.207.98:4243]" vpnIp=192.168.100.1
INFO[0010] Handshake message sent                        handshake="map[stage:1 style:ix_psk0]" initiatorIndex=3762089884 localIndex=3762089884 remoteIndex=0 udpAddrs="[12.248.207.98:4243]" vpnIp=192.168.100.1
INFO[0016] Handshake timed out                           durationNs=6599866209 handshake="map[stage:1 style:ix_psk0]" initiatorIndex=3762089884 localIndex=3762089884 remoteIndex=0 udpAddrs="[12.248.207.98:4243]" vpnIp=192.168.100.1
INFO[0020] Handshake message sent                        handshake="map[stage:1 style:ix_psk0]" initiatorIndex=3859964258 localIndex=3859964258 remoteIndex=0 udpAddrs="[12.248.207.98:4243]" vpnIp=192.168.100.1
INFO[0026] Handshake timed out                           durationNs=6501049417 handshake="map[stage:1 style:ix_psk0]" initiatorIndex=3859964258 localIndex=3859964258 remoteIndex=0 udpAddrs="[12.248.207.98:4243]" vpnIp=192.168.100.1
INFO[0030] Handshake message sent                        handshake="map[stage:1 style:ix_psk0]" initiatorIndex=1997989733 localIndex=1997989733 remoteIndex=0 udpAddrs="[12.248.207.98:4243]" vpnIp=192.168.100.1
INFO[0036] Handshake timed out                           durationNs=6499934542 handshake="map[stage:1 style:ix_psk0]" initiatorIndex=1997989733 localIndex=1997989733 remoteIndex=0 udpAddrs="[12.248.207.98:4243]" vpnIp=192.168.100.1
INFO[0040] Handshake message sent                        handshake="map[stage:1 style:ix_psk0]" initiatorIndex=864138480 localIndex=864138480 remoteIndex=0 udpAddrs="[12.248.207.98:4243]" vpnIp=192.168.100.1
INFO[0046] Handshake timed out                           durationNs=6500255708 handshake="map[stage:1 style:ix_psk0]" initiatorIndex=864138480 localIndex=864138480 remoteIndex=0 udpAddrs="[12.248.207.98:4243]" vpnIp=192.168.100.1
INFO[0050] Handshake message sent                        handshake="map[stage:1 style:ix_psk0]" initiatorIndex=3445599736 localIndex=3445599736 remoteIndex=0 udpAddrs="[12.248.207.98:4243]" vpnIp=192.168.100.1
INFO[0056] Handshake timed out                           durationNs=6599338667 handshake="map[stage:1 style:ix_psk0]" initiatorIndex=3445599736 localIndex=3445599736 remoteIndex=0 udpAddrs="[12.248.207.98:4243]" vpnIp=192.168.100.1
INFO[0060] Handshake message sent                        handshake="map[stage:1 style:ix_psk0]" initiatorIndex=3461595144 localIndex=3461595144 remoteIndex=0 udpAddrs="[12.248.207.98:4243]" vpnIp=192.168.100.1
INFO[0066] Handshake timed out                           durationNs=6500565084 handshake="map[stage:1 style:ix_psk0]" initiatorIndex=3461595144 localIndex=3461595144 remoteIndex=0 udpAddrs="[12.248.207.98:4243]" vpnIp=192.168.100.1
INFO[0070] Handshake message sent                        handshake="map[stage:1 style:ix_psk0]" initiatorIndex=2886350399 localIndex=2886350399 remoteIndex=0 udpAddrs="[12.248.207.98:4243]" vpnIp=192.168.100.1
INFO[0076] Handshake timed out                           durationNs=6499197000 handshake="map[stage:1 style:ix_psk0]" initiatorIndex=2886350399 localIndex=2886350399 remoteIndex=0 udpAddrs="[12.248.207.98:4243]" vpnIp=192.168.100.1
INFO[0080] Handshake message sent                        handshake="map[stage:1 style:ix_psk0]" initiatorIndex=1816099952 localIndex=1816099952 remoteIndex=0 udpAddrs="[12.248.207.98:4243]" vpnIp=192.168.100.1
INFO[0086] Handshake timed out                           durationNs=6499589000 handshake="map[stage:1 style:ix_psk0]" initiatorIndex=1816099952 localIndex=1816099952 remoteIndex=0 udpAddrs="[12.248.207.98:4243]" vpnIp=192.168.100.1

Config files from affected hosts

lighthouse config file

pki:
  ca: /Users/xxx/Downloads/nebula/ca.crt
  cert: /Users/xxx/Downloads/nebula/lighthouse.crt
  key: /Users/xxx/Downloads/nebula/lighthouse.key

lighthouse:
  am_lighthouse: True
  # serve_dns optionally starts a dns listener that responds to various queries and can even be
  # delegated to for resolution
  #serve_dns: false
  #dns:
    # The DNS host defines the IP to bind the dns listener to. This also allows binding to the nebula node IP.
    #host: 0.0.0.0
    #port: 53
  # interval is the number of seconds between updates from this node to a lighthouse.
  # during updates, a node sends information about its current IP addresses to each node.
  interval: 60
  # hosts is a list of lighthouse hosts this node should report to and query from
  # IMPORTANT: THIS SHOULD BE EMPTY ON LIGHTHOUSE NODES
  hosts:

# Port Nebula will be listening on. The default here is 4242. For a lighthouse node, the port should be defined,
# however using port 0 will dynamically assign a port and is recommended for roaming nodes.
listen:
  host: 0.0.0.0
  port: 4244
  # Sets the max number of packets to pull from the kernel for each syscall (under systems that support recvmmsg)
  # default is 64, does not support reload
  #batch: 64
  # Configure socket buffers for the udp side (outside), leave unset to use the system defaults. Values will be doubled by the kernel
  # Default is net.core.rmem_default and net.core.wmem_default (/proc/sys/net/core/rmem_default and /proc/sys/net/core/rmem_default)
  # Maximum is limited by memory in the system, SO_RCVBUFFORCE and SO_SNDBUFFORCE is used to avoid having to raise the system wide
  # max, net.core.rmem_max and net.core.wmem_max
  #read_buffer: 10485760
  #write_buffer: 10485760

# Punchy continues to punch inbound/outbound at a regular interval to avoid expiration of firewall nat mappings
punchy:
  punch: true
  respond: true
# punch_back means that a node you are trying to reach will connect back out to you if your hole punching fails
# this is extremely useful if one node is behind a difficult nat, such as symmetric
#punch_back: true

# Cipher allows you to choose between the available ciphers for your network.
# IMPORTANT: this value must be identical on ALL NODES/LIGHTHOUSES. We do not/will not support use of different ciphers simultaneously!
#cipher: chachapoly

# Local range is used to define a hint about the local network range, which speeds up discovering the fastest
# path to a network adjacent nebula node.
#local_range: "172.16.0.0/24"

# sshd can expose informational and administrative functions via ssh this is a
#sshd:
  # Toggles the feature
  #enabled: true
  # Host and port to listen on, port 22 is not allowed for your safety
  #listen: 127.0.0.1:2222
  # A file containing the ssh host private key to use
  # A decent way to generate one: ssh-keygen -t ed25519 -f ssh_host_ed25519_key -N "" < /dev/null
  #host_key: ./ssh_host_ed25519_key
  # A file containing a list of authorized public keys
  #authorized_users:
    #- user: steeeeve
      # keys can be an array of strings or single string
      #keys:
        #- "ssh public key string"

# Configure the private interface. Note: addr is baked into the nebula certificate
tun:
  # Name of the device
  dev: nebula1
  # Toggles forwarding of local broadcast packets, the address of which depends on the ip/mask encoded in pki.cert
  drop_local_broadcast: false
  # Toggles forwarding of multicast packets
  drop_multicast: false
  # Sets the transmit queue length, if you notice lots of transmit drops on the tun it may help to raise this number. Default is 500
  tx_queue: 500
  # Default MTU for every packet, safe setting is (and the default) 1300 for internet based traffic
  mtu: 1300
  # Route based MTU overrides, you have known vpn ip paths that can support larger MTUs you can increase/decrease them here
  routes:
    #- mtu: 8800
    #  route: 10.0.0.0/16
  # Unsafe routes allows you to route traffic over nebula to non-nebula nodes
  # Unsafe routes should be avoided unless you have hosts/services that cannot run nebula
  # NOTE: The nebula certificate of the "via" node *MUST* have the "route" defined as a subnet in its certificate
  unsafe_routes:
    #- route: 172.16.1.0/24
    #  via: 192.168.100.99
    #  mtu: 1300 #mtu will default to tun mtu if this option is not sepcified

# TODO
# Configure logging level
logging:
  # panic, fatal, error, warning, info, or debug. Default is info
  level: debug
  # json or text formats currently available. Default is text
  format: text

#stats:
  #type: graphite
  #prefix: nebula
  #protocol: tcp
  #host: 127.0.0.1:9999
  #interval: 10s

  #type: prometheus
  #listen: 127.0.0.1:8080
  #path: /metrics
  #namespace: prometheusns
  #subsystem: nebula
  #interval: 10s

# Nebula security group configuration
firewall:
  conntrack:
    tcp_timeout: 120h
    udp_timeout: 3m
    default_timeout: 10m
    max_connections: 100000

  # The firewall is default deny. There is no way to write a deny rule.
  # Rules are comprised of a protocol, port, and one or more of host, group, or CIDR
  # Logical evaluation is roughly: port AND proto AND ca_sha AND ca_name AND (host OR group OR groups OR cidr)
  # - port: Takes `0` or `any` as any, a single number `80`, a range `200-901`, or `fragment` to match second and further fragments of fragmented packets (since there is no port available).
  #   code: same as port but makes more sense when talking about ICMP, TODO: this is not currently implemented in a way that works, use `any`
  #   proto: `any`, `tcp`, `udp`, or `icmp`
  #   host: `any` or a literal hostname, ie `test-host`
  #   group: `any` or a literal group name, ie `default-group`
  #   groups: Same as group but accepts a list of values. Multiple values are AND'd together and a certificate would have to contain all groups to pass
  #   cidr: a CIDR, `0.0.0.0/0` is any.
  #   ca_name: An issuing CA name
  #   ca_sha: An issuing CA shasum

  outbound:
    # Allow all outbound traffic from this node
    - port: any
      proto: any
      host: any

  inbound:
    # Allow icmp between any nebula hosts
    - port: any
      proto: any
      host: any

    # - port: any
    #   proto: any
    #   host: any
    # # Allow tcp/443 from any host with BOTH laptop and home group
    # - port: 443
    #   proto: tcp
    #   groups:
    #     - laptop
    #     - home%     

node config

pki:
  ca: /Users/vijaykrishnan/Downloads/nebula/ca.crt
  cert: /Users/vijaykrishnan/Downloads/nebula/hostA.crt
  key: /Users/vijaykrishnan/Downloads/nebula/hostA.key

static_host_map:
  "192.168.100.1": ["12.248.207.98:4243"]

lighthouse:
  hosts:
    - "192.168.100.1"

punchy:
  punch: true

firewall:
  outbound:
    - port: any
      proto: any
      host: any
  inbound:
    - port: any
      proto: any
      host: any
coralislands commented 1 week ago

Not sure if this is useful, but looks like the ifconfig thinks that the utun interface is assigned and properly running here.

utun5: flags=8051<UP,POINTOPOINT,RUNNING,MULTICAST> mtu 1300 inet 192.168.100.1 --> 0.0.0.0 netmask 0xffffff00

brad-defined commented 1 week ago
static_host_map:
  "192.168.100.1": ["12.248.207.98:4243"]

Your static host map identifies the listening port as 4243

Your lighthouse config is set to listen on 4244

listen:
  host: 0.0.0.0
  port: 4244

Does it work if you set both of those ports to the same number and restart?

coralislands commented 1 week ago

Sorry my bad. I was trying out different ports and did not update the node-config in the last attempt. But even with the updated port in the config, I am running into the same issue.

brad-defined commented 1 week ago

The next thing to check is that there is no host or network firewall blocking the traffic.

hole punching can't work without lighthouse coordination, so that first connection from a peer to the lighthouse must be permitted by the network.

You could try tcpdump on the lighthouse to see if the peer's UDP traffic is arriving or not.

johnmaguire commented 1 week ago

Regarding lsof, please try running it again with sudo lsof -i :4244 to verify the Lighthouse is listening on the specified port.

As you have updated the port on the node from 4243 to 4244, please provide updated logs - the node logs you shared are showing the erroneous 4243 port. I am curious if the error has changed since correcting the port.

Please also verify that any firewalls on the Lighthouse are allowing UDP traffic on 4244 (and if you have a router in front of the Lighthouse, ensure port forwarding is setup correctly.) As Brad mentioned, you can use tcpdump to verify whether the packets are making it to the destination.

On both the node and Lighthouse you can run: sudo tcpdump 'dst port 4244' and then restart Nebula to ensure handshakes are flowing.