slackhq / nebula

A scalable overlay networking tool with a focus on performance, simplicity and security
MIT License
14.42k stars 973 forks source link

Question: NAT Setup #33

Closed jatsrt closed 1 year ago

jatsrt commented 4 years ago

I seem to be missing something important. If I setup a mesh of hosts with all direct public IP addresses, it works fine. However, if I have a network with a light house(public IP), then all nodes behind NAT, they will not connect to each other. The lighthouse is able to communicate with all hosts, but hosts are not able to communicate with each other.

Watching the logs I see connections trying to be made to both the NAT public, and the private IPs.

I have enabled punchy and punch back, but does not seem to help.

Hope it is something simple?

jatsrt commented 4 years ago

Also, to note in this setup all nodes are behind different NATs on different networks. Hub and spoke with the hub being the lighthouse and spokes going to hosts on different networks.

rawdigits commented 4 years ago

My best guess (because I just messed this up in a live demo), is that am_lighthouse may be set to "true" on the individual nodes.

Either way, can you post your lighthouse config and one of your node configs?

(feel free to replace any sensitive IP/config bits, just put consistent placeholders in their place)

nfam commented 4 years ago

Hi, I have the same issue. My lighthouse is on a DigitalOcean droplet with public IP. My MacBook and Linux Laptop at home are on the same network both connected to lighthouse. I can ping lighthouse from both laptop, but I cannot ping from one laptop to the other.

Lighthouse config

pki:
  ca: /data/cert/nebula/ca.crt
  cert: /data/cert/nebula/lighthouse.crt
  key: /data/cert/nebula/lighthouse.key
static_host_map:
  "192.168.100.1": ["LIGHTHOUSE_PUBLIC_IP:4242"]
lighthouse:
  am_lighthouse: true
  interval: 60
  hosts:
listen:
  host: 0.0.0.0
  port: 4242
punchy: true
tun:
  dev: neb0
  drop_local_broadcast: false
  drop_multicast: false
  tx_queue: 500
  mtu: 1300
logging:
  level: info
  format: text
firewall:
  conntrack:
    tcp_timeout: 120h
    udp_timeout: 3m
    default_timeout: 10m
    max_connections: 100000
  outbound:
    - port: any
      proto: any
      host: any
  inbound:
    - port: any
      proto: icmp
      host: any
    - port: 443
      proto: tcp
      groups:
        - laptop

Macbook config

pki:
  ca: /Volumes/code/cert/nebula/ca.crt
  cert: /Volumes/code/cert/nebula/mba.crt
  key: /Volumes/code/cert/nebula/mba.key
static_host_map:
  "192.168.100.1": ["LIGHTHOUSE_PUBLIC_IP:4242"]
lighthouse:
  am_lighthouse: false
  interval: 60
  hosts:
  - "LIGHTHOUSE_PUBLIC_IP"
punchy: true
tun:
  dev: neb0
  drop_local_broadcast: false
  drop_multicast: false
  tx_queue: 500
  mtu: 1300
logging:
  level: debug
  format: text
firewall:
  conntrack:
    tcp_timeout: 120h
    udp_timeout: 3m
    default_timeout: 10m
    max_connections: 100000
  outbound:
    - port: any
      proto: any
      host: any
  inbound:
    - port: any
      proto: icmp
      host: any
    - port: 443
      proto: tcp
      groups:
        - laptop

Linux laptop config

pki:
  ca: /data/cert/nebula/ca.crt
  cert: /data/cert/nebula/server.crt
  key: /data/cert/nebula/server.key
static_host_map:
  "192.168.100.1": ["LIGHTHOUSE_PUBLIC_IP:4242"]
lighthouse:
  am_lighthouse: false
  interval: 60
  hosts:
  - "LIGHTHOUSE_PUBLIC_IP"
punchy: true
listen:
  host: 0.0.0.0
  port: 4242
tun:
  dev: neb0
  drop_local_broadcast: false
  drop_multicast: false
  tx_queue: 500
  mtu: 1300
logging:
  level: info
  format: text
firewall:
  conntrack:
    tcp_timeout: 120h
    udp_timeout: 3m
    default_timeout: 10m
    max_connections: 100000
  outbound:
    - port: any
      proto: any
      host: any
  inbound:
    - port: any
      proto: icmp
      host: any
    - port: 443
      proto: tcp
      groups:
        - laptop
rawdigits commented 4 years ago

@nfam thanks for sharing the config. My next best guess is that nat isn't reflecting and for some reason nodes also aren't finding each other locally.

Try setting the local_range config setting on the two laptops, which can give them a hint about the local network range to use for establishing the direct tunnel.

jatsrt commented 4 years ago

@nfam similar setup. Public lighthouse on digital ocean, laptop on home nat, and server in AWS behind a NAT. Local and AWS are using different private ranges(though overlap should be handled)

nfam commented 4 years ago

@rawdigits setting local_range does not help. I stopped nebula on both laptops, set log on lighthouse to debug, cleared log and restarted lighthouse (no node connected to). Following is the log I got.

time="2019-11-23T20:05:18Z" level=info msg="Main HostMap created" network=192.168.100.1/24 preferredRanges="[]" time="2019-11-23T20:05:18Z" level=info msg="UDP hole punching enabled" time="2019-11-23T20:05:18Z" level=info msg="Nebula interface is active" build=1.0.0 interface=neb0 network=192.168.100.1/24 time="2019-11-23T20:05:18Z" level=debug msg="Error while validating outbound packet: packet is not ipv4, type: 6" packet="[96 0 0 0 0 8 58 255 254 128 0 0 0 0 0 0 183 226 137 252 10 196 21 15 255 2 0 0 0 0 0 0 0 0 0 0 0 0 0 2 133 0 27 133 0 0 0 0]"

jatsrt commented 4 years ago

My Config: nebula-cert sign -name "lighthouse" -ip "192.168.100.1/24" nebula-cert sign -name "laptop" -ip "192.168.100.101/24" -groups "laptop" nebula-cert sign -name "server" -ip "192.168.100.201/24" -groups "server"

Lighthouse:

pki:
  ca: /etc/nebula/ca.crt
  cert: /etc/nebula/lighthouse.crt
  key: /etc/nebula/lighthouse.key

static_host_map:
  "192.168.100.1": ["167.71.175.250:4242"]

lighthouse:
  am_lighthouse: true
  interval: 60

listen:
  host: 0.0.0.0
  port: 4242

punchy: true

tun:
  dev: nebula1
  mtu: 1300

logging:
  level: info
  format: text

firewall:
  conntrack:
    tcp_timeout: 12m
    udp_timeout: 3m
    default_timeout: 10m
    max_connections: 100000

  outbound:
    - port: any
      proto: any
      host: any

  inbound:
    - port: any
      proto: icmp
      host: any

Laptop:

pki:
  # The CAs that are accepted by this node. Must contain one or more certificates created by 'nebula-cert ca'
  ca: /etc/nebula/ca.crt
  cert: /etc/nebula/laptop.crt
  key: /etc/nebula/laptop.key

static_host_map:
  "192.168.100.1": ["167.71.175.250:4242"]

lighthouse:
  am_lighthouse: false
  interval: 60
  hosts:
    - "192.168.100.1"

listen:
  host: 0.0.0.0
  port: 0

punchy: true

tun:
  dev: nebula1
  mtu: 1300

logging:
  level: info
  format: text

firewall:
  conntrack:
    tcp_timeout: 12m
    udp_timeout: 3m
    default_timeout: 10m
    max_connections: 100000

  outbound:
    - port: any
      proto: any
      host: any

  inbound:
    - port: any
      proto: icmp
      host: any

Server:

pki:
  ca: /etc/nebula/ca.crt
  cert: /etc/nebula/server.crt
  key: /etc/nebula/server.key

static_host_map:
  "192.168.100.1": ["167.71.175.250:4242"]

lighthouse:
  am_lighthouse: false
  interval: 60
  hosts:
    - "192.168.100.1"

listen:
  host: 0.0.0.0
  port: 0

punchy: true

tun:
  dev: nebula1
  mtu: 1300

logging:
  level: info
  format: text

firewall:
  conntrack:
    tcp_timeout: 12m
    udp_timeout: 3m
    default_timeout: 10m
    max_connections: 100000

  outbound:
    - port: any
      proto: any
      host: any

  inbound:
    - port: any
      proto: icmp
      host: any

With this setup, both server and laptop can ping lighthouse, lighhouse can ping server and laptop, but laptop cannot ping server and server cannot ping laptop.

I get messages such as this as it's trying to make the connection:

INFO[0006] Handshake message sent                        handshake="map[stage:1 style:ix_psk0]" initiatorIndex=3339283633 remoteIndex=0 udpAddr="18.232.11.42:4726" vpnIp=192.168.100.201
INFO[0007] Handshake message sent                        handshake="map[stage:1 style:ix_psk0]" initiatorIndex=3339283633 remoteIndex=0 udpAddr="172.31.106.61:37058" vpnIp=192.168.100.201
INFO[0009] Handshake message sent                        handshake="map[stage:1 style:ix_psk0]" initiatorIndex=3339283633 remoteIndex=0 udpAddr="18.232.11.42:4726" vpnIp=192.168.100.201
INFO[0011] Handshake message sent                        handshake="map[stage:1 style:ix_psk0]" initiatorIndex=3339283633 remoteIndex=0 udpAddr="172.31.106.61:37058" vpnIp=192.168.100.201
INFO[0012] Handshake message sent                        handshake="map[stage:1 style:ix_psk0]" initiatorIndex=3339283633 remoteIndex=0 udpAddr="18.232.11.42:4726" vpnIp=192.168.100.201
INFO[0014] Handshake message sent                        handshake="map[stage:1 style:ix_psk0]" initiatorIndex=3339283633 remoteIndex=0 udpAddr="172.31.106.61:37058" vpnIp=192.168.100.201
INFO[0016] Handshake message sent                        handshake="map[stage:1 style:ix_psk0]" initiatorIndex=3339283633 remoteIndex=0 udpAddr="18.232.11.42:4726" vpnIp=192.168.100.201
jatsrt commented 4 years ago

@nfam similar error, not sure it's the problem

Error while validating outbound packet: packet is not ipv4, type: 6 packet="[96 0 0 0 0 8 58 255 254 128 0 0 0 0 0 0 139 176 20 9 146 65 14 250 255 2 0 0 0 0 0 0 0 0 0 0 0 0 0 2 133 0 60 66 0 0 0 0]" DEBU[0066] Error while validating outbound packet: packet is not ipv4, type: 6 packet="[96 0 0 0 0 8 58 255 254 128 0 0 0 0 0 0 139 176 20 9 146 65 14 250 255 2 0 0 0 0 0 0 0 0 0 0 0 0 0 2 133 0 60 66 0 0 0 0]"

rawdigits commented 4 years ago

@jatsrt

The Error while validating outbound packet can mostly be ignored. Just some types of packet nebula doesn't support bouncing off.

As far as the handshakes, for some reason hole punching isn't working. A few things to try:

1) Add punch_back: true on the "server" and "laptop" nodes. 2) explicitly allow all UDP in to the "server" node from the internet (via AWS security groups, just as a test) 3) verify iptables isn't blocking anything.

Also It appears the logs with the handshake messages are from the laptop? If so can you also share nebula logs from the server as it tries to reach the laptop?

Thanks!

rawdigits commented 4 years ago

Aha, @nfam I think I spotted the config problem.

instead of

lighthouse:
  am_lighthouse: false
  interval: 60
  hosts:
  - "LIGHTHOUSE_PUBLIC_IP"

it should be

lighthouse:
  am_lighthouse: false
  interval: 60
  hosts:
  - "192.168.100.1"
rawdigits commented 4 years ago

adding #40 to cover accidental misconfiguration noted above.

nfam commented 4 years ago

@rawdigits yes, it is. Now both laptops can ping to each other. Thanks!

jatsrt commented 4 years ago

@rawdigits

  1. added punch back on "server" and "laptop"
  2. security group for that node is currently wide open for all protocols
  3. No iptables on any of these nodes, base ubuntu server for testing

Server log:

time="2019-11-24T00:25:21Z" level=info msg="Handshake message sent" handshake="map[stage:1 style:ix_psk0]" initiatorIndex=1689969496 remoteIndex=0 udpAddr="96.252.12.10:51176" vpnIp=192.168.100.101
time="2019-11-24T00:25:22Z" level=info msg="Handshake message sent" handshake="map[stage:1 style:ix_psk0]" initiatorIndex=1689969496 remoteIndex=0 udpAddr="96.252.12.10:51176" vpnIp=192.168.100.101
time="2019-11-24T00:25:22Z" level=info msg="Handshake message sent" handshake="map[stage:1 style:ix_psk0]" initiatorIndex=1689969496 remoteIndex=0 udpAddr="96.252.12.10:51176" vpnIp=192.168.100.101
time="2019-11-24T00:25:23Z" level=info msg="Handshake message sent" handshake="map[stage:1 style:ix_psk0]" initiatorIndex=1689969496 remoteIndex=0 udpAddr="96.252.12.10:51176" vpnIp=192.168.100.101
time="2019-11-24T00:25:24Z" level=info msg="Handshake message sent" handshake="map[stage:1 style:ix_psk0]" initiatorIndex=1689969496 remoteIndex=0 udpAddr="192.168.0.22:51176" vpnIp=192.168.100.101
time="2019-11-24T00:25:25Z" level=info msg="Handshake message sent" handshake="map[stage:1 style:ix_psk0]" initiatorIndex=1689969496 remoteIndex=0 udpAddr="96.252.12.10:51176" vpnIp=192.168.100.101
time="2019-11-24T00:25:26Z" level=info msg="Handshake message sent" handshake="map[stage:1 style:ix_psk0]" initiatorIndex=1689969496 remoteIndex=0 udpAddr="192.168.0.22:51176" vpnIp=192.168.100.101
time="2019-11-24T00:25:27Z" level=info msg="Handshake message sent" handshake="map[stage:1 style:ix_psk0]" initiatorIndex=1689969496 remoteIndex=0 udpAddr="96.252.12.10:51176" vpnIp=192.168.100.101
time="2019-11-24T00:25:28Z" level=info msg="Handshake message sent" handshake="map[stage:1 style:ix_psk0]" initiatorIndex=1689969496 remoteIndex=0 udpAddr="192.168.0.22:51176" vpnIp=192.168.100.101
time="2019-11-24T00:25:30Z" level=info msg="Handshake message sent" handshake="map[stage:1 style:ix_psk0]" initiatorIndex=1689969496 remoteIndex=0 udpAddr="96.252.12.10:51176" vpnIp=192.168.100.101
jatsrt commented 4 years ago

So, tried a few more setups, just comes down to what seems like if the two hosts that are trying to communicate with each other are both on different networks and both behind NAT, it will not work.
If the lighthouse does not facilitate the communication/tunneling, this would make sense, but is it meant to be a limitation?

nbrownus commented 4 years ago

Dual NAT scenario is a bit tricky, possibly room for improvement from nebula's perspective there. Do you have details on the type of NATs you are dealing with?

jatsrt commented 4 years ago

@nbrownus nothing crazy, I've done multiple AWS VPC NAT gateways with hosts behind them and they cannot connect. I've also tried "home" NAT(google WiFi router based NAT), with no success.

From a networking perspective, I get why it's "tricky" was hoping there was some trick nebula was doing.

nbrownus commented 4 years ago

@rawdigits can speak to the punching better than I can. If you are having problems in AWS then we can get a test running and sort out the issues.

jatsrt commented 4 years ago

Yeah, so all my tests have had at least one host behind an AWS NAT Gateway

rawdigits commented 4 years ago

Longshot, but one more thing to try until I set up an AWS NAT GW: set the UDP port on all nodes to 4242 and let NAT remap it. One ISP I've dealt with blocks the random ephemeral udp ports above 32,000, presumably because they think every high UDP port is bittorrent.

Probably won't work, but easy to test..

jatsrt commented 4 years ago

@rawdigits same issue

Network combination: Lighthouse - Digital Ocean NYC3 - Public IP Server - AWS - Oregon - Private VPC with AWS NAT Gateway (172.31.0.0/16) Laptop - Verizon FIOS With Google WIFI Router NAT (192.168.1.0/24) Server2(added later to test) - AWS - Ohio Private VPC with AWS NAT Gateway (10.200.200.0/24)

I added in a second server in a different VPC on AWS to remove the FIOS variable, and had the same results, with server and server2 trying to communicate

INFO[0065] Handshake message sent                        handshake="map[stage:1 style:ix_psk0]" initiatorIndex=760525141 remoteIndex=0 udpAddr="172.31.106.61:4242" vpnIp=192.168.100.201
INFO[0066] Handshake message sent                        handshake="map[stage:1 style:ix_psk0]" initiatorIndex=760525141 remoteIndex=0 udpAddr="18.232.11.42:42005" vpnIp=192.168.100.201
INFO[0067] Handshake message sent                        handshake="map[stage:1 style:ix_psk0]" initiatorIndex=760525141 remoteIndex=0 udpAddr="172.31.106.61:4242" vpnIp=192.168.100.201
INFO[0069] Handshake message sent                        handshake="map[stage:1 style:ix_psk0]" initiatorIndex=760525141 remoteIndex=0 udpAddr="18.232.11.42:42005" vpnIp=192.168.100.201
INFO[0071] Handshake message sent                        handshake="map[stage:1 style:ix_psk0]" initiatorIndex=760525141 remoteIndex=0 udpAddr="172.31.106.61:4242" vpnIp=192.168.100.201
INFO[0072] Handshake message sent                        handshake="map[stage:1 style:ix_psk0]" initiatorIndex=760525141 remoteIndex=0 udpAddr="18.232.11.42:42005" vpnIp=192.168.100.201
rawdigits commented 4 years ago

@jatsrt I'll stand up a testbed this week to explore what may be the cause of the issue. Thanks!

iamid0 commented 4 years ago

My Config: nebula-cert sign -name "lighthouse" -ip "192.168.100.1/24" nebula-cert sign -name "laptop" -ip "192.168.100.101/24" -groups "laptop" nebula-cert sign -name "server" -ip "192.168.100.201/24" -groups "server"

Lighthouse:

pki:
  ca: /etc/nebula/ca.crt
  cert: /etc/nebula/lighthouse.crt
  key: /etc/nebula/lighthouse.key

static_host_map:
  "192.168.100.1": ["167.71.175.250:4242"]

lighthouse:
  am_lighthouse: true
  interval: 60

listen:
  host: 0.0.0.0
  port: 4242

punchy: true

tun:
  dev: nebula1
  mtu: 1300

logging:
  level: info
  format: text

firewall:
  conntrack:
    tcp_timeout: 12m
    udp_timeout: 3m
    default_timeout: 10m
    max_connections: 100000

  outbound:
    - port: any
      proto: any
      host: any

  inbound:
    - port: any
      proto: icmp
      host: any

Laptop:

pki:
  # The CAs that are accepted by this node. Must contain one or more certificates created by 'nebula-cert ca'
  ca: /etc/nebula/ca.crt
  cert: /etc/nebula/laptop.crt
  key: /etc/nebula/laptop.key

static_host_map:
  "192.168.100.1": ["167.71.175.250:4242"]

lighthouse:
  am_lighthouse: false
  interval: 60
  hosts:
    - "192.168.100.1"

listen:
  host: 0.0.0.0
  port: 0

punchy: true

tun:
  dev: nebula1
  mtu: 1300

logging:
  level: info
  format: text

firewall:
  conntrack:
    tcp_timeout: 12m
    udp_timeout: 3m
    default_timeout: 10m
    max_connections: 100000

  outbound:
    - port: any
      proto: any
      host: any

  inbound:
    - port: any
      proto: icmp
      host: any

Server:

pki:
  ca: /etc/nebula/ca.crt
  cert: /etc/nebula/server.crt
  key: /etc/nebula/server.key

static_host_map:
  "192.168.100.1": ["167.71.175.250:4242"]

lighthouse:
  am_lighthouse: false
  interval: 60
  hosts:
    - "192.168.100.1"

listen:
  host: 0.0.0.0
  port: 0

punchy: true

tun:
  dev: nebula1
  mtu: 1300

logging:
  level: info
  format: text

firewall:
  conntrack:
    tcp_timeout: 12m
    udp_timeout: 3m
    default_timeout: 10m
    max_connections: 100000

  outbound:
    - port: any
      proto: any
      host: any

  inbound:
    - port: any
      proto: icmp
      host: any

With this setup, both server and laptop can ping lighthouse, lighhouse can ping server and laptop, but laptop cannot ping server and server cannot ping laptop.

I get messages such as this as it's trying to make the connection:

INFO[0006] Handshake message sent                        handshake="map[stage:1 style:ix_psk0]" initiatorIndex=3339283633 remoteIndex=0 udpAddr="18.232.11.42:4726" vpnIp=192.168.100.201
INFO[0007] Handshake message sent                        handshake="map[stage:1 style:ix_psk0]" initiatorIndex=3339283633 remoteIndex=0 udpAddr="172.31.106.61:37058" vpnIp=192.168.100.201
INFO[0009] Handshake message sent                        handshake="map[stage:1 style:ix_psk0]" initiatorIndex=3339283633 remoteIndex=0 udpAddr="18.232.11.42:4726" vpnIp=192.168.100.201
INFO[0011] Handshake message sent                        handshake="map[stage:1 style:ix_psk0]" initiatorIndex=3339283633 remoteIndex=0 udpAddr="172.31.106.61:37058" vpnIp=192.168.100.201
INFO[0012] Handshake message sent                        handshake="map[stage:1 style:ix_psk0]" initiatorIndex=3339283633 remoteIndex=0 udpAddr="18.232.11.42:4726" vpnIp=192.168.100.201
INFO[0014] Handshake message sent                        handshake="map[stage:1 style:ix_psk0]" initiatorIndex=3339283633 remoteIndex=0 udpAddr="172.31.106.61:37058" vpnIp=192.168.100.201
INFO[0016] Handshake message sent                        handshake="map[stage:1 style:ix_psk0]" initiatorIndex=3339283633 remoteIndex=0 udpAddr="18.232.11.42:4726" vpnIp=192.168.100.201

I have got the same situation. node_A <----> lighthouse OK node_B <----> lighthouse OK node_A < ----> node_B Not work, cannot ping each other.

But I found, node_A and node_B can communicate with each other ONLY if both are connected to the same router, such as the same WiFi router.

PS punch_back: true on both node_A and node_B.

No firewall on node_A, node_B and lighthouse.

fireapp commented 4 years ago

hole punch very difficult and random

spencerryan commented 4 years ago

I also can't get nebula to work properly when both nodes are behind a typical NAT (Technically PAT) regardless of any port pinning I do in the config. They happily connect to the lighthouse I have in AWS but it seems like something isn't working properly. I've got punchy and punchback enabled on everything and it doesn't seem to help. I've tried setting the port on the nodes to 0, and also trying the same port that lighthouse is listening on.

The nodes have no issues connecting to each other over the MPLS, but we don't want that (performance reasons)

Edit: To add a bit more detail, even Meraki's AutoVPN can't deal with this. In their situation the "hub" needs to be told it's public IP and a fixed port that is open inbound. I'd be fine with that as an option, and may be the only reliable one if both nodes are behind different NATs.

Another option I had considered, what if we could use the lighthouses to hairpin traffic? I'd much rather pay AWS for the bandwidth than have to deal with unfriendly NATs everywhere.

rawdigits commented 4 years ago

I did a bit more research, and it appears that the "AWS Nat Gateway" uses Symmetric NAT, which isn't friendly to hole punching of any kind. NAT gateways also don't appear to support any type of port forwarding, so fixing this by statically assigning and forwarding a port doesn't appear to be an option.

A NAT instance would probably work, but I realize that's probably not a great option. One thing I recommend considering would be to give instances a routable IP address, but disallow all inbound traffic. This wouldn't greatly change the security of your network, since you still aren't allowing any unsolicited packets to reach the hosts, but would allow hole punching to work properly.

spencerryan commented 4 years ago

I don't think NAT so much is the issue but PAT (port translation). Unfortunately with that you can't predict what your public port will be and hole punching becomes impossible if both ends are behind a similar PAT. I'm going to do some testing, but I think that as long as 1 of 2 nodes has a 1:1 NAT (no port translation) a public IP on the node directly isn't a concern.

If I get particularly ambitious I may attempt to whip up some code in lighthouse to detect when one/both nodes are behind a PAT and throw a warning saying that this won't work out of the box.

wadey commented 4 years ago

If I get particularly ambitious I may attempt to whip up some code in lighthouse to detect when one/both nodes are behind a PAT and throw a warning saying that this won't work out of the box

I've thought about this before. You need at least 2 lighthouses, and I think it's best to implement as a flag on the non-lighthouses (when you query the lighthouses for a host, if you get results with the same IP but different ports then you know the remote is problematic).

spencerryan commented 4 years ago

I haven't dug into the handshake code but if you include the source port in the handshake the lighthouse can compare that to what it sees. If they differ you know something in the middle is doing port translation.

jocull commented 4 years ago

Aha, @nfam I think I spotted the config problem.

instead of

lighthouse:
  am_lighthouse: false
  interval: 60
  hosts:
  - "LIGHTHOUSE_PUBLIC_IP"

it should be

lighthouse:
  am_lighthouse: false
  interval: 60
  hosts:
  - "192.168.100.1"

I bet this is also my issue... will test it soon. That section is confusing 😕

jocull commented 4 years ago

That was not a fix - I had it configured like this already. After more testing I think what I have is hole punching issue with my NAT.

zfwjs commented 4 years ago

I had a similar issue with a DO lighthouse and two Windows PC's on the same LAN.

I could ping between the lighthouse and PC's, but not between the PC's.

Adding a windows defender firewall rule worked for me as well, even though there were already rules added by nebula.

I didnt add a port rule, instead I added a custom rule with an allow for the network 192.168.100.0/24. I'm using 0 for the port on the nodes.

gebi commented 4 years ago

We had similar problems getting nebula to work. It seems nebula just can't work with "normal" consumer setups (both sides behind NAT).

It's not only me but also 3 collegues that have tried it without success. The common error pattern was that all boxes can reach the lighthouse via nebula, but except if they are on the same network NO nebula node was able to reach any other nebula node (except the lighthouse). I've tested it for over 2 weeks from various different networks with my laptop and could not get a connection working to other nebula nodes other than the lighthouse a single time.

Maybe it would be a good idea to adept the readme, that nebula is more for a server use case, because for consumer it seems to not work for the main usecase.

Btw... I had the interesting problem for nebula that most of the machines nebula runs on have the same network (eg. docker or k8s network) which is also displayed in the lighthouse tables and as nebula runs on the host there is also a nebula running there, just the wrong one (it's speaking with himself). With the config problems mentioned in this thread that i also debugged through i just can't say if this was related to the initial connection problems.

rawdigits commented 4 years ago

I share the frustration of dealing with connections that are NAT'd on both sides. Folks could do IP routing or proxying via other nodes, but it defeats the simplicity that nebula brings, and is not a true solution.

Nebula was created as a server-to-server mesh network, but now that we have ported it to every platform (not all released yet, but it works on ios/android), we absolutely need to handle use cases that involve clients behind any kind of NAT or more complex networking scenario, and thus relaying is our only viable option.

Note: relay nodes can be any node on a network, and don't have to be devoted to relaying. The ones you choose to use as relays should, however, have a direct internet connection for them to be useful.

There is a bit of discussion happening in the nebulaoss slack group, but just to make it available here as well (my words reposted):

There are some NATs we just don’t handle well yet. I have a personal interest in doing relaying and am actively working on it again, so hopefully something to share soon. The current experiments i’m doing involve allowing individual nodes to advertise relay node ips/ports as a way to reach them, which would transparently work around NAT for any node that advertises itself as having a relay.

[...]

I’m envisioning it being a configuration option on nodes and clients, with two sep purposes. On a relay node it would be something like am_relay: true to signal that the node allows other nodes to use it as a relay (more accurately a bouncer) On endpoints, especially behind NAT, there would be an option that looks similar to the lighthouse config, something like:

relays:
  {relay_nebula_ip}
  {relay2_nebula_ip}
gebi commented 4 years ago

Thx for the feedback! (i've put the whining at the end, sorry)

Yes ultimately realys are necessary, eg. as tailscale puts it

https://github.com/tailscale/tailscale/blob/master/derp/derp.go#L9

// DERP is used by Tailscale nodes to proxy encrypted WireGuard // packets through the Tailscale cloud servers when a direct path // cannot be found or opened. DERP is a last resort. Both sides // between very aggressive NATs, firewalls, no IPv6, etc? Well, DERP.

But relays should not be used unnecessarily, they are just a last resort.

STUN or ICE do a whole lot to get through nats, but an additional idea would also be to use UPNP or NAT-PMP when configured.

<== snip

I really appreciate your honest answer, though i'm inclined to say that "There are some NATs we just don’t handle well yet" might not quite cut it, in my experience it's "Not at all", our home servers where behind some consumer stuff, but also every other network i tested, corporate / hackerspaces / ... nothing worked except connection to lighthouse (thus the connection should have been working in principle).

spencerryan commented 4 years ago

I would agree that while some NAT combo's are nearly impossible, there are many situations that should work that do not. Cisco has figured it out with Meraki's AutoVPN.

I do think having the relay as an option is a good thing, but it shouldn't be necessary in some configs. Per my comment above, the lighthouses should be able to detect if PAT is in use. If it is, you can still make it work without a relay as long as 1 of the 2 ends of a connection are not using PAT (NAT is fine). If both ends are using PAT a relay will be required.

rawdigits commented 4 years ago

Unfortunately I don't have any stats or insight into how well NAT is working across the userbase. The only thing I can say with confidence is that I'm successfully using it myself to connect from home to devices at various locations around the world that are behind NATs themselves, but I'm sure we can do better.

I was going to bring up uPNP/nat-pmp in my original reply but decided against it. Since you've mentioned it, my thoughts for now are: We should do that, too, but the number of people who will benefit from relaying is much higher than the number who will benefit from router-allowed NAT traversal at the moment. It certainly has the upshot of making direct non-relayed tunnels, so is also worth doing, but I'd like to have relays done first.

Out of curiosity have you tested software that uses STUN/ICE/RFCn on those networks where nebula doesn't create a tunnel? I would love to debug why ours wouldn't work if another would, but I don't have any good test setups to reproduce these issues at the moment. I'd also be happy to replicate your setup hardware/software-wise so I can find what we're doing incorrectly with hole punching, if other solutions are doing it without issue.

rawdigits commented 4 years ago

To clarify the above: I totally believe it isn't working for some folks in situations where it should. I just don't have detail on their setups yet, so I haven't been able to replicate and find a root cause.

gebi commented 4 years ago

IMHO currently the best example of nat traversal is tailscale, they use a combination of STUN and ICE together with their encrypted relay (DERP).

Awesome... i will re-do the nebula setup get everything up and running again and help you debug if you want :). Even if we have a quarantine currently i'm sure i will get it to not working between two nodes.

btw... one additional nice feature of a relay would be possible support for http proxy (as many corps still use a proxy for internet access).

ps.: should i create an issue with the problem of ip collission i found with the presence of the docker network on mutliple nebula nodes and nebula listening on both nodes on the "same" address? i've "fixed" it partly through firewall rules and different nebula ports on each nodes, which might be an uncommon config for newcommers.

rawdigits commented 4 years ago

Last issue first: I recommend trying a random port (instead of choosing a numbered port that is identical on every node) by using port: 0 in the config. That's how it is used in kubernetes in a few places, to avoid reusing a single port number. This is also how i run it on devices behind nat to improve the chances they don't overlap and have to be reassigned a new NAT'd port. (perhaps a thing for you to try as well)

(TBH, I'd just make every non-lighthouse node on any nebula network port: 0 unless you have a restrictive network)

I agree that STUN and ICE plus relaying is a good solution (really the only solution), but it would be useful to know if tailscale is successfully doing NAT traversal in a place we don't or if they fail in the exact same situations, their relaying makes things work. I say this because it is either true that 1) They are falling back to relaying because their hole punching encounters the same issues as ours or 2) Their hole punching is succeeding when ours is failing in some cases.

If point 1 is true, then us doing relaying is the only thing to do. If point 2 is true, we need to do relaying, but also need to look at hole punching code.

breisig commented 4 years ago

That's one major reason why I stuck with Tinc VPN. It's a mesh VPN that will route traffic through other Tinc nodes if it can't do it directly. Once Nebula has that feature, then I would switch completely.

rawdigits commented 4 years ago

@breisig I used Tinc for many years and still think it is great. It definitely inspired some of Nebula. Now that I'm a full time indoors person, I'm typing code as fast as I can, so we'll have something to test soon. :)

breisig commented 4 years ago

@rawdigits Once you have something ready for testing that would allow Nebula to route traffic through other nodes [like Tinc], please let me know. I would be willing to test. I would drop Tinc right away for Nebula.

rawdigits commented 4 years ago

Sooo, it turns out our hole punching may have been too efficient and was triggering race conditions in various connection tracking implementations. We have now nerfed it (slowed it down slightly) and the problems I was having have mostly vanished. Once #210 is merged, I recommend building from source and testing on various NAT setups again, because I believe this exists in a lot of routers/etc and was causing problems for people.

rakundig commented 4 years ago

I downloaded the latest master, compiled it, and it worked. Only 1 test so far so needs validation and repeated tests, but looks good so far.

Test scenario: (as-is) V1.1.0 LH VM Ubuntu 18.04.4 (amd64) Node A Ubuntu 18.04.4 on metal (arm64) Node B MacOS on metal (10.14.6) LH on public with UDP in 4242 allow Node A behind consumer ATT DSL rtr in KC MO Node B behind tethered iPhone Node A ---> ping --> LH == OK Node B ---> ping --> LH == OK Node A --> ping --> Node B == Nope Node B --> ping --> Node A == Nope

Test scenario: (new bins) Create new LH VM with pub IP and UDP 4242 in allow New LH VM Ubuntu 18.04.4 (also amd64) Node A Ubuntu 18.04.4 on metal (same as above) Node B MacOS on metal (same as above) Compile new nebula code to create new bins for each Create new CA Create new config.yaml (test-config.yaml) Create new signed certs for nodes and LH (test-*.crt/key) Fire up new config and certs to use new LH on LH, Node A, Node B Node A ---> ping --> LH == OK Node B ---> ping --> LH == OK Node A --> ping --> Node B == OK Node B --> ping --> Node A == OK

Validate: Stop all nebula services, all nodes Restart with orig config and orig bins (v1.1.0) Node A ---> ping --> LH == OK Node B ---> ping --> LH == OK Node A --> ping --> Node B == Nope Node B --> ping --> Node A == Nope

So, it seems that the updates have resolved the issue/race condition preventing nodes from finding each other and punching through NAT. I have notified some of my team about my findings so they can validate more thoroughly.

ETA: In the "Validate" scenario I used the new bins on Node A and B, and v1.1.0 bin on LH and it didn't work.

Therefore, all nodes need new bins. Makes sense of course, but I am adding this comment to add that extra test detail for anyone else.

Not a perfect test, but good enough for this AM. As I said, needs more testing, but is looking good so far.

gebi commented 4 years ago

Awesome, i'll also test as soon as we are allowed to go out again.

btw... as it seems now viable to use nebula i've polished up my debian package building and sent a pull request :) #211

rakundig commented 4 years ago

Update: I haven't had time to test further. However, I wanted to point out that my laptop behind an iPhone tether isn't symmetric NAT, so that little test doesn't prove or disprove that it does or doesn't defeat that issue. that said, it is an overall improvement as it did improve the scenario described above.

Good luck and I hope that further testing proves this is a fix and move the whole project forward.

@gebi why can't you test without "going out?"

mismacku commented 4 years ago

First off... Nebula is awesome and I appreciate having the privilege of using it. Thank you!

I'm experiencing this same issue using v1.2 and built from source commit 363c836422627db8593b4ecebb271d1dfdef05a8, but would like to note that I don't see the same issue on OS X.

Connectivity:

When I ping server A <-> server B nebula logs nebula[31848]: time="2020-04-21T15:33:48-07:00" level=info msg="Handshake message sent" endlessly, but traffic never arrives.
I've turned numerous knobs, but can't seem to get it to work. Any help is appreciated!

Macbook config:

lighthouse:
  am_lighthouse: false
  interval: 60
  hosts:
    - "172.16.0.1"

listen:
  host: 0.0.0.0
  port: 0

punchy:
  punch: true

tun:
  dev: nebula1
  drop_local_broadcast: false
  drop_multicast: false
  tx_queue: 500
  mtu: 1300
  routes:
  unsafe_routes:

logging:
  level: debug
  format: text

firewall:
  conntrack:
    tcp_timeout: 120h
    udp_timeout: 3m
    default_timeout: 10m
    max_connections: 100000

  outbound:
    - port: any
      proto: any
      host: any

  inbound:
    - port: any
      proto: icmp
      host: any

    - port: 443
      proto: tcp
      groups:
        - laptop
        - home

Server A and server B config:

lighthouse:
  am_lighthouse: false
  interval: 60
  hosts:
    - "172.16.0.1"

  local_allow_list:

listen:
  host: 10.137.124.217
  port: 0

punchy:
  punch: true
  respond: true
  delay: 1s

tun:
  dev: nebula1
  drop_local_broadcast: false
  drop_multicast: false
  tx_queue: 500
  mtu: 1300
  routes:
  unsafe_routes:

logging:
  level: debug
  format: text

handshakes:
  try_interval: 100ms
  retries: 20
  wait_rotation: 5

firewall:
  conntrack:
    tcp_timeout: 120h
    udp_timeout: 3m
    default_timeout: 10m
    max_connections: 100000

  outbound:
    - port: any
      proto: any
      host: any

  inbound:
    - port: any
      proto: any
      host: any

lighthouse config:

lighthouse:
  am_lighthouse: true
  interval: 60
  hosts:

listen:
  host: 0.0.0.0
  port: 4242

punchy:
  punch: true
  respond: true

tun:
  dev: nebula1
  drop_local_broadcast: false
  drop_multicast: false
  tx_queue: 500
  mtu: 1300
  routes:
  unsafe_routes:

logging:
  level: debug
  format: text

firewall:
  conntrack:
    tcp_timeout: 120h
    udp_timeout: 3m
    default_timeout: 10m
    max_connections: 100000

  outbound:
    - port: any
      proto: any
      host: any

  inbound:
    - port: any
      proto: any
      host: any
iamid0 commented 4 years ago

First off... Nebula is awesome and I appreciate having the privilege of using it. Thank you!

I'm experiencing this same issue using v1.2 and built from source commit 363c836, but would like to note that I don't see the same issue on OS X.

  • lighthouse running on Linode
  • linux server A behind Google WiFi NAT
  • MacBook behind Google WiFi Nat
  • linux server B behind unknown NAT

Connectivity:

  • lighthouse can reach all machines
  • all machines can reach lighthouse
  • server A <-> Macbook = OK (same LAN)
  • server B <-> Macbook = OK (not on the same LAN)
  • server A <-> server B = FAIL (not on the same LAN)

When I ping server A <-> server B nebula logs nebula[31848]: time="2020-04-21T15:33:48-07:00" level=info msg="Handshake message sent" endlessly, but traffic never arrives. I've turned numerous knobs, but can't seem to get it to work. Any help is appreciated!

Macbook config:

lighthouse:
  am_lighthouse: false
  interval: 60
  hosts:
    - "172.16.0.1"

listen:
  host: 0.0.0.0
  port: 0

punchy:
  punch: true

tun:
  dev: nebula1
  drop_local_broadcast: false
  drop_multicast: false
  tx_queue: 500
  mtu: 1300
  routes:
  unsafe_routes:

logging:
  level: debug
  format: text

firewall:
  conntrack:
    tcp_timeout: 120h
    udp_timeout: 3m
    default_timeout: 10m
    max_connections: 100000

  outbound:
    - port: any
      proto: any
      host: any

  inbound:
    - port: any
      proto: icmp
      host: any

    - port: 443
      proto: tcp
      groups:
        - laptop
        - home

Server A and server B config:

lighthouse:
  am_lighthouse: false
  interval: 60
  hosts:
    - "172.16.0.1"

  local_allow_list:

listen:
  host: 10.137.124.217
  port: 0

punchy:
  punch: true
  respond: true
  delay: 1s

tun:
  dev: nebula1
  drop_local_broadcast: false
  drop_multicast: false
  tx_queue: 500
  mtu: 1300
  routes:
  unsafe_routes:

logging:
  level: debug
  format: text

handshakes:
  try_interval: 100ms
  retries: 20
  wait_rotation: 5

firewall:
  conntrack:
    tcp_timeout: 120h
    udp_timeout: 3m
    default_timeout: 10m
    max_connections: 100000

  outbound:
    - port: any
      proto: any
      host: any

  inbound:
    - port: any
      proto: any
      host: any

lighthouse config:

lighthouse:
  am_lighthouse: true
  interval: 60
  hosts:

listen:
  host: 0.0.0.0
  port: 4242

punchy:
  punch: true
  respond: true

tun:
  dev: nebula1
  drop_local_broadcast: false
  drop_multicast: false
  tx_queue: 500
  mtu: 1300
  routes:
  unsafe_routes:

logging:
  level: debug
  format: text

firewall:
  conntrack:
    tcp_timeout: 120h
    udp_timeout: 3m
    default_timeout: 10m
    max_connections: 100000

  outbound:
    - port: any
      proto: any
      host: any

  inbound:
    - port: any
      proto: any
      host: any

I have got a similar issue. Nebula may fail if there is NAT or there are multi-NAT to be punched.

hasturo commented 4 years ago

Hi,

im trying to find out if there is any Support for Routing Traffic, between nodes which cant reach each other, through the Lighthouse Host. From my understanding a Gateway doing Port Based Address Translation never has a working Session Table Entry which dont reflect the Communication between the Nebula Host and a Lighthouse.

Even if your able to fake a Session between both Gateways, using Lighthouse for Signaling you dont get the right source/destinations Ports. Simply because they are random.

-hasturo

windwalker78 commented 4 years ago

I share the frustration of dealing with connections that are NAT'd on both sides. Folks could do IP routing or proxying via other nodes, but it defeats the simplicity that nebula brings, and is not a true solution.

Nebula was created as a server-to-server mesh network, but now that we have ported it to every platform (not all released yet, but it works on ios/android), we absolutely need to handle use cases that involve clients behind any kind of NAT or more complex networking scenario, and thus relaying is our only viable option.

Note: relay nodes can be any node on a network, and don't have to be devoted to relaying. The ones you choose to use as relays should, however, have a direct internet connection for them to be useful.

There is a bit of discussion happening in the nebulaoss slack group, but just to make it available here as well (my words reposted):

There are some NATs we just don’t handle well yet. I have a personal interest in doing relaying and am actively working on it again, so hopefully something to share soon. The current experiments i’m doing involve allowing individual nodes to advertise relay node ips/ports as a way to reach them, which would transparently work around NAT for any node that advertises itself as having a relay.

[...]

I’m envisioning it being a configuration option on nodes and clients, with two sep purposes. On a relay node it would be something like am_relay: true to signal that the node allows other nodes to use it as a relay (more accurately a bouncer) On endpoints, especially behind NAT, there would be an option that looks similar to the lighthouse config, something like:

relays:
  {relay_nebula_ip}
  {relay2_nebula_ip}

This is really cool feature and we need it. We have like 10 clients with 1 lighthouse and sometimes some of clients cannot talk to each other, which still enforces us to use traditional vpn solutions. Is there any information if this will be implemented and the hottest question - when?