slackhq / nebula

A scalable overlay networking tool with a focus on performance, simplicity and security
MIT License
14.55k stars 979 forks source link

Node outside of LAN can only talk to light house #71

Closed lamarios closed 4 years ago

lamarios commented 4 years ago

I have a bunch of computers on my LAN with one light house that is accessible from the outside world Lighthouse: 192.168.42.99 (mydomain.com:4242) Lan Machine 1 (A) : 192.168.42.200 Lan Machine 2 (B): 192.168.42.203

Outside lan machine (C): 192.168.42.10

using the 192.168.42.0 IPs:

Light house config:

# This is the nebula example configuration file. You must edit, at a minimum, the static_host_map, lighthouse, and firewall sections
# Some options in this file are HUPable, including the pki section. (A HUP will reload credentials from disk without affecting existing tunnels)

# PKI defines the location of credentials for this node. Each of these can also be inlined by using the yaml ": |" syntax.
pki:
  # The CAs that are accepted by this node. Must contain one or more certificates created by 'nebula-cert ca'
  ca: /etc/nebula/ca.crt
  cert: /etc/nebula/pihole.crt
  key: /etc/nebula/pihole.key
  #blacklist is a list of certificate fingerprints that we will refuse to talk to
  #blacklist:
  #  - c99d4e650533b92061b09918e838a5a0a6aaee21eed1d12fd937682865936c72

# The static host map defines a set of hosts with fixed IP addresses on the internet (or any network).
# A host can have multiple fixed IP addresses defined here, and nebula will try each when establishing a tunnel.
# The syntax is:
#   "{nebula ip}": ["{routable ip/dns name}:{routable port}"]
# Example, if your lighthouse has the nebula IP of 192.168.100.1 and has the real ip address of 100.64.22.11 and runs on port 4242:
static_host_map:
  "192.168.42.99": ["mydomain.com:4242"]

lighthouse:
  # am_lighthouse is used to enable lighthouse functionality for a node. This should ONLY be true on nodes
  # you have configured to be lighthouses in your network
  am_lighthouse: true
  # serve_dns optionally starts a dns listener that responds to various queries and can even be
  # delegated to for resolution
  # serve_dns: true
  # interval is the number of seconds between updates from this node to a lighthouse.
  # during updates, a node sends information about its current IP addresses to each node.
  interval: 60
  # hosts is a list of lighthouse hosts this node should report to and query from
  # IMPORTANT: THIS SHOULD BE EMPTY ON LIGHTHOUSE NODES
  hosts:
          #  - "192.168.42.1"

# Port Nebula will be listening on. The default here is 4242. For a lighthouse node, the port should be defined,
# however using port 0 will dynamically assign a port and is recommended for roaming nodes.
listen:
  host: 0.0.0.0
  port: 4242
  # Sets the max number of packets to pull from the kernel for each syscall (under systems that support recvmmsg)
  # default is 64, does not support reload
  #batch: 64
  # Configure socket buffers for the udp side (outside), leave unset to use the system defaults. Values will be doubled by the kernel
  # Default is net.core.rmem_default and net.core.wmem_default (/proc/sys/net/core/rmem_default and /proc/sys/net/core/rmem_default)
  # Maximum is limited by memory in the system, SO_RCVBUFFORCE and SO_SNDBUFFORCE is used to avoid having to raise the system wide
  # max, net.core.rmem_max and net.core.wmem_max
  #read_buffer: 10485760
  #write_buffer: 10485760

# Punchy continues to punch inbound/outbound at a regular interval to avoid expiration of firewall nat mappings
punchy: true
# punch_back means that a node you are trying to reach will connect back out to you if your hole punching fails
# this is extremely useful if one node is behind a difficult nat, such as symmetric
punch_back: true

# Cipher allows you to choose between the available ciphers for your network.
# IMPORTANT: this value must be identical on ALL NODES/LIGHTHOUSES. We do not/will not support use of different ciphers simultaneously!
#cipher: chachapoly

# Local range is used to define a hint about the local network range, which speeds up discovering the fastest
# path to a network adjacent nebula node.
#local_range: "172.16.0.0/24"

# sshd can expose informational and administrative functions via ssh this is a
#sshd:
  # Toggles the feature
  #enabled: true
  # Host and port to listen on, port 22 is not allowed for your safety
  #listen: 127.0.0.1:2222
  # A file containing the ssh host private key to use
  # A decent way to generate one: ssh-keygen -t ed25519 -f ssh_host_ed25519_key -N "" < /dev/null
  #host_key: ./ssh_host_ed25519_key
  # A file containing a list of authorized public keys
  #authorized_users:
    #- user: steeeeve
      # keys can be an array of strings or single string
      #keys:
        #- "ssh public key string"

# Configure the private interface. Note: addr is baked into the nebula certificate
tun:
  # Name of the device
  dev: nebula1
  # Toggles forwarding of local broadcast packets, the address of which depends on the ip/mask encoded in pki.cert
  drop_local_broadcast: false
  # Toggles forwarding of multicast packets
  drop_multicast: false
  # Sets the transmit queue length, if you notice lots of transmit drops on the tun it may help to raise this number. Default is 500
  tx_queue: 500
  # Default MTU for every packet, safe setting is (and the default) 1300 for internet based traffic
  mtu: 1300
  # Route based MTU overrides, you have known vpn ip paths that can support larger MTUs you can increase/decrease them here
  routes:
    #- mtu: 8800
    #  route: 10.0.0.0/16

# TODO
# Configure logging level
logging:
  # panic, fatal, error, warning, info, or debug. Default is info
  level: info
  # json or text formats currently available. Default is text
  format: text

#stats:
  #type: graphite
  #prefix: nebula
  #protocol: tcp
  #host: 127.0.0.1:9999
  #interval: 10s

  #type: prometheus
  #listen: 127.0.0.1:8080
  #path: /metrics
  #namespace: prometheusns
  #subsystem: nebula
  #interval: 10s

# Nebula security group configuration
firewall:
  conntrack:
    tcp_timeout: 120h
    udp_timeout: 3m
    default_timeout: 10m
    max_connections: 100000

  # The firewall is default deny. There is no way to write a deny rule.
  # Rules are comprised of a protocol, port, and one or more of host, group, or CIDR
  # Logical evaluation is roughly: port AND proto AND ca_sha AND ca_name AND (host OR group OR groups OR cidr)
  # - port: Takes `0` or `any` as any, a single number `80`, a range `200-901`, or `fragment` to match second and further fragments of fragmented packets (since there is no port available).
  #   code: same as port but makes more sense when talking about ICMP, TODO: this is not currently implemented in a way that works, use `any`
  #   proto: `any`, `tcp`, `udp`, or `icmp`
  #   host: `any` or a literal hostname, ie `test-host`
  #   group: `any` or a literal group name, ie `default-group`
  #   groups: Same as group but accepts a list of values. Multiple values are AND'd together and a certificate would have to contain all groups to pass
  #   cidr: a CIDR, `0.0.0.0/0` is any.
  #   ca_name: An issuing CA name
  #   ca_sha: An issuing CA shasum

  outbound:
    # Allow all outbound traffic from this node
    - port: any
      proto: any
      host: any

  inbound:
    # Allow icmp between any nebula hosts
    - port: any
      proto: any
      host: any

C config:

# This is the nebula example configuration file. You must edit, at a minimum, the static_host_map, lighthouse, and firewall sections
# Some options in this file are HUPable, including the pki section. (A HUP will reload credentials from disk without affecting existing tunnels)

# PKI defines the location of credentials for this node. Each of these can also be inlined by using the yaml ": |" syntax.
pki:
  # The CAs that are accepted by this node. Must contain one or more certificates created by 'nebula-cert ca'
  ca: /etc/nebula/ca.crt
  cert: /etc/nebula/work.crt
  key: /etc/nebula/work.key
  #blacklist is a list of certificate fingerprints that we will refuse to talk to
  #blacklist:
  #  - c99d4e650533b92061b09918e838a5a0a6aaee21eed1d12fd937682865936c72

# The static host map defines a set of hosts with fixed IP addresses on the internet (or any network).
# A host can have multiple fixed IP addresses defined here, and nebula will try each when establishing a tunnel.
# The syntax is:
#   "{nebula ip}": ["{routable ip/dns name}:{routable port}"]
# Example, if your lighthouse has the nebula IP of 192.168.100.1 and has the real ip address of 100.64.22.11 and runs on port 4242:
static_host_map:
  "192.168.42.99": ["ftpix.com:4242"]

lighthouse:
  # am_lighthouse is used to enable lighthouse functionality for a node. This should ONLY be true on nodes
  # you have configured to be lighthouses in your network
  am_lighthouse: false
  # serve_dns optionally starts a dns listener that responds to various queries and can even be
  # delegated to for resolution
  #serve_dns: false
  # interval is the number of seconds between updates from this node to a lighthouse.
  # during updates, a node sends information about its current IP addresses to each node.
  interval: 60
  # hosts is a list of lighthouse hosts this node should report to and query from
  # IMPORTANT: THIS SHOULD BE EMPTY ON LIGHTHOUSE NODES
  hosts:
    - "192.168.42.99"

# Port Nebula will be listening on. The default here is 4242. For a lighthouse node, the port should be defined,
# however using port 0 will dynamically assign a port and is recommended for roaming nodes.
listen:
  host: 0.0.0.0
  port: 0
  # Sets the max number of packets to pull from the kernel for each syscall (under systems that support recvmmsg)
  # default is 64, does not support reload
  #batch: 64
  # Configure socket buffers for the udp side (outside), leave unset to use the system defaults. Values will be doubled by the kernel
  # Default is net.core.rmem_default and net.core.wmem_default (/proc/sys/net/core/rmem_default and /proc/sys/net/core/rmem_default)
  # Maximum is limited by memory in the system, SO_RCVBUFFORCE and SO_SNDBUFFORCE is used to avoid having to raise the system wide
  # max, net.core.rmem_max and net.core.wmem_max
  #read_buffer: 10485760
  #write_buffer: 10485760

# Punchy continues to punch inbound/outbound at a regular interval to avoid expiration of firewall nat mappings
punchy: true
# punch_back means that a node you are trying to reach will connect back out to you if your hole punching fails
# this is extremely useful if one node is behind a difficult nat, such as symmetric
punch_back: true

# Cipher allows you to choose between the available ciphers for your network.
# IMPORTANT: this value must be identical on ALL NODES/LIGHTHOUSES. We do not/will not support use of different ciphers simultaneously!
#cipher: chachapoly

# Local range is used to define a hint about the local network range, which speeds up discovering the fastest
# path to a network adjacent nebula node.
#local_range: "172.16.0.0/24"

# sshd can expose informational and administrative functions via ssh this is a
#sshd:
  # Toggles the feature
  #enabled: true
  # Host and port to listen on, port 22 is not allowed for your safety
  #listen: 127.0.0.1:2222
  # A file containing the ssh host private key to use
  # A decent way to generate one: ssh-keygen -t ed25519 -f ssh_host_ed25519_key -N "" < /dev/null
  #host_key: ./ssh_host_ed25519_key
  # A file containing a list of authorized public keys
  #authorized_users:
    #- user: steeeeve
      # keys can be an array of strings or single string
      #keys:
        #- "ssh public key string"

# Configure the private interface. Note: addr is baked into the nebula certificate
tun:
  # Name of the device
  dev: nebula1
  # Toggles forwarding of local broadcast packets, the address of which depends on the ip/mask encoded in pki.cert
  drop_local_broadcast: false
  # Toggles forwarding of multicast packets
  drop_multicast: false
  # Sets the transmit queue length, if you notice lots of transmit drops on the tun it may help to raise this number. Default is 500
  tx_queue: 500
  # Default MTU for every packet, safe setting is (and the default) 1300 for internet based traffic
  mtu: 1300
  # Route based MTU overrides, you have known vpn ip paths that can support larger MTUs you can increase/decrease them here
  routes:
    #- mtu: 8800
    #  route: 10.0.0.0/16

# TODO
# Configure logging level
logging:
  # panic, fatal, error, warning, info, or debug. Default is info
  level: info
  # json or text formats currently available. Default is text
  format: text

#stats:
  #type: graphite
  #prefix: nebula
  #protocol: tcp
  #host: 127.0.0.1:9999
  #interval: 10s

  #type: prometheus
  #listen: 127.0.0.1:8080
  #path: /metrics
  #namespace: prometheusns
  #subsystem: nebula
  #interval: 10s

# Nebula security group configuration
firewall:
  conntrack:
    tcp_timeout: 120h
    udp_timeout: 3m
    default_timeout: 10m
    max_connections: 100000

  # The firewall is default deny. There is no way to write a deny rule.
  # Rules are comprised of a protocol, port, and one or more of host, group, or CIDR
  # Logical evaluation is roughly: port AND proto AND ca_sha AND ca_name AND (host OR group OR groups OR cidr)
  # - port: Takes `0` or `any` as any, a single number `80`, a range `200-901`, or `fragment` to match second and further fragments of fragmented packets (since there is no port available).
  #   code: same as port but makes more sense when talking about ICMP, TODO: this is not currently implemented in a way that works, use `any`
  #   proto: `any`, `tcp`, `udp`, or `icmp`
  #   host: `any` or a literal hostname, ie `test-host`
  #   group: `any` or a literal group name, ie `default-group`
  #   groups: Same as group but accepts a list of values. Multiple values are AND'd together and a certificate would have to contain all groups to pass
  #   cidr: a CIDR, `0.0.0.0/0` is any.
  #   ca_name: An issuing CA name
  #   ca_sha: An issuing CA shasum

  outbound:
    # Allow all outbound traffic from this node
    - port: any
      proto: any
      host: any

  inbound:
    # Allow icmp between any nebula hosts
    - port: any
      proto: icmp
      host: any

    # Allow tcp/443 from any host with BOTH laptop and home group
    - port: any
      proto: tcp
      host: any

    - port: any
      proto: udp
      host: any

Logs from C:

Dec 05 15:55:20 gz-t480 nebula[32698]: time="2019-12-05T15:55:20+08:00" level=info msg="Handshake message sent" handshake="map[stage:1 style:ix_psk0]" initiatorIndex=89620360 remoteIndex=0 udpAddr="192.168.1.1:52803" vpnIp=192.168.42.198
Dec 05 15:55:22 gz-t480 nebula[32698]: time="2019-12-05T15:55:22+08:00" level=info msg="Handshake message sent" handshake="map[stage:1 style:ix_psk0]" initiatorIndex=89620360 remoteIndex=0 udpAddr="192.168.1.198:52803" vpnIp=192.168.42.198
Dec 05 15:55:23 gz-t480 nebula[32698]: time="2019-12-05T15:55:23+08:00" level=info msg="Handshake message sent" handshake="map[stage:1 style:ix_psk0]" initiatorIndex=89620360 remoteIndex=0 udpAddr="192.168.200.198:52803" vpnIp=192.168.42.198
Dec 05 15:55:25 gz-t480 nebula[32698]: time="2019-12-05T15:55:25+08:00" level=info msg="Handshake message sent" handshake="map[stage:1 style:ix_psk0]" initiatorIndex=89620360 remoteIndex=0 udpAddr="172.21.0.1:52803" vpnIp=192.168.42.198
Dec 05 15:55:27 gz-t480 nebula[32698]: time="2019-12-05T15:55:27+08:00" level=info msg="Handshake message sent" handshake="map[stage:1 style:ix_psk0]" initiatorIndex=89620360 remoteIndex=0 udpAddr="172.19.0.1:52803" vpnIp=192.168.42.198
Dec 05 15:55:29 gz-t480 nebula[32698]: time="2019-12-05T15:55:29+08:00" level=info msg="Handshake message sent" handshake="map[stage:1 style:ix_psk0]" initiatorIndex=89620360 remoteIndex=0 udpAddr="172.17.0.1:52803" vpnIp=192.168.42.198
Dec 05 15:55:31 gz-t480 nebula[32698]: time="2019-12-05T15:55:31+08:00" level=info msg="Handshake message sent" handshake="map[stage:1 style:ix_psk0]" initiatorIndex=89620360 remoteIndex=0 udpAddr="172.20.0.1:52803" vpnIp=192.168.42.198
Dec 05 15:55:33 gz-t480 nebula[32698]: time="2019-12-05T15:55:33+08:00" level=info msg="Handshake message sent" handshake="map[stage:1 style:ix_psk0]" initiatorIndex=89620360 remoteIndex=0 udpAddr="172.21.0.1:58904" vpnIp=192.168.42.198
Dec 05 15:55:35 gz-t480 nebula[32698]: time="2019-12-05T15:55:35+08:00" level=info msg="Handshake message sent" handshake="map[stage:1 style:ix_psk0]" initiatorIndex=89620360 remoteIndex=0 udpAddr="172.19.0.1:58904" vpnIp=192.168.42.198
Dec 05 15:55:38 gz-t480 nebula[32698]: time="2019-12-05T15:55:38+08:00" level=info msg="Handshake message sent" handshake="map[stage:1 style:ix_psk0]" initiatorIndex=89620360 remoteIndex=0 udpAddr="172.17.0.1:58904" vpnIp=192.168.42.198
Dec 05 15:55:40 gz-t480 nebula[32698]: time="2019-12-05T15:55:40+08:00" level=info msg="Handshake message sent" handshake="map[stage:1 style:ix_psk0]" initiatorIndex=89620360 remoteIndex=0 udpAddr="172.20.0.1:58904" vpnIp=192.168.42.198
poosterl commented 4 years ago

IMHO it's the LAN nodes (behind firewall? NAT?) that should have punchback set to true. I have a similar setup and had initially similar problems. By setting punchback to true on the LAN nodes, the problem was solved.

lamarios commented 4 years ago

All the nodes have the same config (except the light house and the certificates) so punch_back is set to true.

poosterl commented 4 years ago

I noticed the following in your C node config

IMPORTANT: THIS SHOULD BE EMPTY ON LIGHTHOUSE NODES
hosts:
- "192.168.42.99"

Since C node is not a light house, you should comment out these lines (as it says in the comment above). At least ... that's what I did on all non lighthouse nodes and my setup works.

lamarios commented 4 years ago

It says it should be empty on lighthouse nodes. C is not a lighthouse.

I'll give it a try anyway.

poosterl commented 4 years ago

Please disregard my last comment, I was mistaken. You DID comment out the static host on your light house node and left it in on the other nodes. This is similar to what I have. My apologies.

Kerwood commented 4 years ago

I have the same issue. The lighthouse has a public IP at Digital Ocean. I have two nodes at different locations both behind NAT. If I try and ping the nebula private address of the other I can see in the logs that they are both trying to create a tunnel. And they are both trying to send Handshake messages on each others internal docker addresses.

level=info msg="Handshake message sent" handshake="map[stage:1 style:ix_psk0]" initiatorIndex=3364254650 remoteIndex=0 udpAddr="192.168.200.2:4242" vpnIp=10.22.0.21 <-- LAN Interface
level=info msg="Handshake message sent" handshake="map[stage:1 style:ix_psk0]" initiatorIndex=3364254650 remoteIndex=0 udpAddr="172.17.0.1:4242" vpnIp=10.22.0.21 <-- Docker Interface
level=info msg="Handshake message sent" handshake="map[stage:1 style:ix_psk0]" initiatorIndex=3364254650 remoteIndex=0 udpAddr="172.18.0.1:4242" vpnIp=10.22.0.21 <-- Docker Interface
level=info msg="Handshake message sent" handshake="map[stage:1 style:ix_psk0]" initiatorIndex=3364254650 remoteIndex=0 udpAddr="<external-ip-here>:60892" vpnIp=10.22.0.21 <-- External Interface
level=info msg="Handshake message sent" handshake="map[stage:1 style:ix_psk0]" initiatorIndex=3364254650 remoteIndex=0 udpAddr="192.168.200.2:60892" vpnIp=10.22.0.21 <-- LAN Interface
level=info msg="Handshake message sent" handshake="map[stage:1 style:ix_psk0]" initiatorIndex=3364254650 remoteIndex=0 udpAddr="172.17.0.1:60892" vpnIp=10.22.0.21 <-- Docker Interface
level=info msg="Handshake message sent" handshake="map[stage:1 style:ix_psk0]" initiatorIndex=3364254650 remoteIndex=0 udpAddr="172.18.0.1:60892" vpnIp=10.22.0.21 <-- Docker Interface

And that is properly what you see in your logs to @lamarios The 172..x.x.x addresses.

nfam commented 4 years ago

If you have a lot of docker containers running, setting local_range and punch_back is a must. Otherwise it would take like forever for one machine finding a path to another.

Kerwood commented 4 years ago

@nfam local_range has to be set to the overlay subnet right ?

lamarios commented 4 years ago

I don't run nebula in a container. I tried anyway to set the local_range to 192.168.1.0/24 but that didn't help. On which node should i set the local range ? Only the lighthouse ?

Kerwood commented 4 years ago

@lamarios That was not what i meant.. Do you run other containers on the nebula node ?

lamarios commented 4 years ago

Only on my node A. light house and B are not (nodes from the LAN) the node from outside the lan (C) is not running any). C can only reach light house but not A nor B

lamarios commented 4 years ago

I manage to "fix" the issue by setting up fixed port on the node I want to connect often, opening port on my router for that and adding in the host list on node C.

Could it be an issue if the network of C is using same network ip range as my home LAN ? (192.168.1.0/24)

nfam commented 4 years ago

@Kerwood local_range is the real (usually physical) network that you want your nebula traffic to run through.

@lamarios

Could it be an issue if the network of C is using same network ip range as my home LAN ? (192.168.1.0/24)

Nebula network cannot use the same ip range as you home LAN.

lamarios commented 4 years ago

it is not, Maybe I was not clear sorry. Nebula IPs: A - 192.168.42.198 B - 192.168.42.200 C - 192.168.42.10 Light house - 192.168.42.99

Physical LAN range for C: 192.168.1.0/24 (office network) Physical LAN range for A,B,Light house 192.168.1.0/24 (home network)

Kerwood commented 4 years ago

@nfam Setting local_range on a rather static server is easy. But setting it on a more dynamic machine like a laptop, where the subnet change from place to place is not optimal.

nfam commented 4 years ago

@Kerwood that's why whitelist/blacklist network interface #52 is superior to local_range. From network interface, nebula should easily get IP range.

Kerwood commented 4 years ago

@nfam So whitelist/blacklist interfaces is an upcoming feature ?

slimm609 commented 4 years ago

how does nebula handle local_ranges that overlap private ip space but are not the same network?

If I have 3 nodes on network A - 192.168.1.0/24 and 2 nodes on network B - 192.168.1.0/24

how does it handle discovery?

I would think you would want a unique network ID as well as a CIDR address

something like local_range: "7:192.168.1.0/24" and local_range: "4:192.168.1.0/24" and the 7 and 4 would be the network ID, so that 2 distinct but overlapping private ranges can be used.

that would allow nodes that are on network 4 to discover each other and not have to try and discover network 7 addresses because they are not on the same LAN

nbrownus commented 4 years ago

Nebula will notice and attempt the best looking local path first, if it fails to stand up a tunnel it will begin handshaking with the other known/learned ip addresses. This is why having a lighthouse on the internet is effectively a requirement, unless you use static_host_map.

Xachman commented 4 years ago

@nbrownus Why would nebula fail to setup a tunnel? I can see my nodes trying alot a handshaking on some public and private ip addresses but they never connect. They just connect to the lighthouse.

hans-fischer commented 4 years ago

Is nebula meant to be used to connect NodeA (LAN-A with Internet) and NodeB (LAN-B with Internet) to each other?

radcool commented 4 years ago

@lamarios Your issue is potentially two-fold:

The way I've troubleshooted my Nebula setup is by looking at each node's log for handshake messages sent vs handshake messages received. If A and B are sending handshake messages and C doesn't receive them (and vice versa), then that specific Nebula flow's not going to work.

MikePadge commented 4 years ago

@lamarios set your C Config section from

static_host_map:
  "192.168.42.99": ["ftpix.com:4242"]
  am_lighthouse: false
  interval: 60
  hosts:
    - "192.168.42.99"

to

static_host_map:
  "192.168.42.99": ["ftpix.com:4242"]
lighthouse:  
  am_lighthouse: false
  interval: 60
  hosts:
    - "192.168.42.99"
lamarios commented 4 years ago

Hmm, not sure why my original post is like that, i must have messed up my copy/paste but my config is actually set as you're telling me to.

radcool commented 4 years ago

@lamarios Have you taken a look at the A,B and C logs for handshake messages? The ones going to the lighthouse should show success. But what about messages to the other nodes?

lamarios commented 4 years ago

ping from C (outside home network) to B logs on C

Jan 06 10:39:44 gz-t480 nebula[6656]: time="2020-01-06T10:39:44+08:00" level=info msg="Handshake message sent" handshake="map[stage:1 style:ix_psk0]" initiatorIndex=3435170973 remoteIndex=0 udpAddr="172.17.0.1:51930" vpnIp=192.168.42.203
^CJan 06 10:39:45 gz-t480 nebula[6656]: time="2020-01-06T10:39:45+08:00" level=info msg="Handshake message sent" handshake="map[stage:1 style:ix_psk0]" initiatorIndex=3435170973 remoteIndex=0 udpAddr="10.244.1.0:51930" vpnIp=192.168.42.203
Jan 06 10:39:46 gz-t480 nebula[6656]: time="2020-01-06T10:39:46+08:00" level=info msg="Handshake message sent" handshake="map[stage:1 style:ix_psk0]" initiatorIndex=3435170973 remoteIndex=0 udpAddr="10.244.1.1:51930" vpnIp=192.168.42.203
Jan 06 10:39:48 gz-t480 nebula[6656]: time="2020-01-06T10:39:48+08:00" level=info msg="Handshake message sent" handshake="map[stage:1 style:ix_psk0]" initiatorIndex=3435170973 remoteIndex=0 udpAddr="192.168.1.203:4242" vpnIp=192.168.42.203

logs on B

2:39:36 k8-node-3 nebula[7026]: time="2020-01-06T02:39:36Z" level=info msg="Handshake message sent" handshake="map[stage:1 style:ix_psk0]" initiatorIndex=1608714499 remoteIndex=0 udpAddr="183.171.67.174:40536" vpnIp=192.168.42.10
Jan 06 02:39:38 k8-node-3 nebula[7026]: time="2020-01-06T02:39:38Z" level=info msg="Handshake message sent" handshake="map[stage:1 style:ix_psk0]" initiatorIndex=1608714499 remoteIndex=0 udpAddr="172.20.10.2:4242" vpnIp=192.168.42.10
Jan 06 02:39:41 k8-node-3 nebula[7026]: time="2020-01-06T02:39:41Z" level=info msg="Handshake message sent" handshake="map[stage:1 style:ix_psk0]" initiatorIndex=1608714499 remoteIndex=0 udpAddr="172.18.0.1:4242" vpnIp=192.168.42.10
Jan 06 02:39:45 k8-node-3 nebula[7026]: time="2020-01-06T02:39:45Z" level=info msg="Handshake message sent" handshake="map[stage:1 style:ix_psk0]" initiatorIndex=2437276409 remoteIndex=0 udpAddr="202.187.183.124:4242" vpnIp=192.168.42.10
Jan 06 02:39:45 k8-node-3 nebula[7026]: time="2020-01-06T02:39:45Z" level=info msg="Handshake message sent" handshake="map[stage:1 style:ix_psk0]" initiatorIndex=2437276409 remoteIndex=0 udpAddr="202.187.183.124:4242" vpnIp=192.168.42.10
Jan 06 02:39:46 k8-node-3 nebula[7026]: time="2020-01-06T02:39:46Z" level=info msg="Handshake message sent" handshake="map[stage:1 style:ix_psk0]" initiatorIndex=2437276409 remoteIndex=0 udpAddr="202.187.183.124:4242" vpnIp=192.168.42.10

202.187.183.124 being C's public address

radcool commented 4 years ago

I initially prepared a few questions, but then went back up the history of the issue and noticed you mentioned that the lighthouse is in the same LAN as nodes A and B. I take this to mean that A and B don't have to traverse NAT to reach the lighthouse. This could be problematic; here's why:

Based on my observations and testing I think that as nodes connect to the lighthouse, they send information about their local IP addresses to the lighthouse. If they traverse a NAT to reach the lighthouse the lighthouse also keeps track of the public IP and port the handshake came from. Then, when nodes want to connect to other nodes outside their LAN they can get the reachability info for other nodes from the lighthouse. They'll then try all IPs in succession, hoping that one of them will be reachable and respond to a handshake request. However, if you've got some nodes on the same LAN as the lighthouse they will never use NAT to reach the lighthouse, and therefore the lighthouse can never learn what NAT'ted IP those nodes might be behind in order to relay it to other nodes outside the LAN. I think this is why they suggest putting the lighthouse out onto the Internet, so that every node that needs to talk to it goes through a NAT.

That being said, I think that only one of the nodes in a pair of nodes that want to communicate with each other need to be reachable; if one can reach the other one successfully they should be able to set up a bidirectional tunnel. Therefore, we can focus on B trying to reach C, as we know the reverse is likely not going to work based on the explanation above.

Is C's public IP of 202.187.183.124 configured directly on C, or is it a NAT'ted IP address that C uses when it goes out onto the Internet? You can see in B's logs that it is trying to reach C at IP address and port 202.187.183.124:4242. Is this an address:port combo that you know for a fact will lead directly to C? Have you tcpdump'ed on C to determine if you actually receive UDP handshake traffic from B?

lamarios commented 4 years ago

I see explained like this it makes a lot of sense.

For C public address, it is a NAT'ted IP address that C uses when it goes out on the internet.

I just TCP dumped and I don't receive anything on C from B. So C network probably doesn't let C open ports on 4242.

Thanks for the help, it was very informative

radcool commented 4 years ago

From the lighthouse's point of view C connected to it using source IP 202.187.183.124 and source port 4242, so it basically tells any node asking how to reach C: "You can try reaching C at IP address 202.187.183.124 and UDP port 4242, maybe that'll work for you". So that's what B is trying to do. Now, depending on the type of NAT/firewall behind which sits C, that may or may not work out (it doesn't in your case). In the case of a Full-cone NAT and a permissive firewall that might work. But with other types of NAT that's not likely (see https://en.wikipedia.org/wiki/Network_address_translation#Methods_of_translation for various NAT types).

One thing that would likely work if you have control of the NAT/firewall box behind which sits C is to set up port forwarding so that incoming traffic destined to external IP 202.187.183.124 and port 4242 redirects to C's internal IP and port 4242. However, if you've got more than one Nebula node behind that NAT/firewall you'll have to set up other port forwarding, and you obviously won't be able to re-use port 4242, so you'll have to configure Nebula on those other nodes to bind to another port and hope that the NAT you're behind will also keep that source port intact once the packet goes out on the Internet.

lamarios commented 4 years ago

Yeah that's more or less what i've done for the nodes I need to access the most, for the one I can't access it's only to ssh to them so I use proxy jumps. I don't control the network for C but I do for network Lighthouse, A & B,

So I opened a different port for A that I need to access often on the router and use it as a known host in C's config.

Thanks for your help !