Closed jatsrt closed 1 year ago
Also, to note in this setup all nodes are behind different NATs on different networks. Hub and spoke with the hub being the lighthouse and spokes going to hosts on different networks.
My best guess (because I just messed this up in a live demo), is that am_lighthouse may be set to "true" on the individual nodes.
Either way, can you post your lighthouse config and one of your node configs?
(feel free to replace any sensitive IP/config bits, just put consistent placeholders in their place)
Hi, I have the same issue. My lighthouse is on a DigitalOcean droplet with public IP. My MacBook and Linux Laptop at home are on the same network both connected to lighthouse. I can ping lighthouse from both laptop, but I cannot ping from one laptop to the other.
Lighthouse config
pki:
ca: /data/cert/nebula/ca.crt
cert: /data/cert/nebula/lighthouse.crt
key: /data/cert/nebula/lighthouse.key
static_host_map:
"192.168.100.1": ["LIGHTHOUSE_PUBLIC_IP:4242"]
lighthouse:
am_lighthouse: true
interval: 60
hosts:
listen:
host: 0.0.0.0
port: 4242
punchy: true
tun:
dev: neb0
drop_local_broadcast: false
drop_multicast: false
tx_queue: 500
mtu: 1300
logging:
level: info
format: text
firewall:
conntrack:
tcp_timeout: 120h
udp_timeout: 3m
default_timeout: 10m
max_connections: 100000
outbound:
- port: any
proto: any
host: any
inbound:
- port: any
proto: icmp
host: any
- port: 443
proto: tcp
groups:
- laptop
Macbook config
pki:
ca: /Volumes/code/cert/nebula/ca.crt
cert: /Volumes/code/cert/nebula/mba.crt
key: /Volumes/code/cert/nebula/mba.key
static_host_map:
"192.168.100.1": ["LIGHTHOUSE_PUBLIC_IP:4242"]
lighthouse:
am_lighthouse: false
interval: 60
hosts:
- "LIGHTHOUSE_PUBLIC_IP"
punchy: true
tun:
dev: neb0
drop_local_broadcast: false
drop_multicast: false
tx_queue: 500
mtu: 1300
logging:
level: debug
format: text
firewall:
conntrack:
tcp_timeout: 120h
udp_timeout: 3m
default_timeout: 10m
max_connections: 100000
outbound:
- port: any
proto: any
host: any
inbound:
- port: any
proto: icmp
host: any
- port: 443
proto: tcp
groups:
- laptop
Linux laptop config
pki:
ca: /data/cert/nebula/ca.crt
cert: /data/cert/nebula/server.crt
key: /data/cert/nebula/server.key
static_host_map:
"192.168.100.1": ["LIGHTHOUSE_PUBLIC_IP:4242"]
lighthouse:
am_lighthouse: false
interval: 60
hosts:
- "LIGHTHOUSE_PUBLIC_IP"
punchy: true
listen:
host: 0.0.0.0
port: 4242
tun:
dev: neb0
drop_local_broadcast: false
drop_multicast: false
tx_queue: 500
mtu: 1300
logging:
level: info
format: text
firewall:
conntrack:
tcp_timeout: 120h
udp_timeout: 3m
default_timeout: 10m
max_connections: 100000
outbound:
- port: any
proto: any
host: any
inbound:
- port: any
proto: icmp
host: any
- port: 443
proto: tcp
groups:
- laptop
@nfam thanks for sharing the config. My next best guess is that nat isn't reflecting and for some reason nodes also aren't finding each other locally.
Try setting the local_range
config setting on the two laptops, which can give them a hint about the local network range to use for establishing the direct tunnel.
@nfam similar setup. Public lighthouse on digital ocean, laptop on home nat, and server in AWS behind a NAT. Local and AWS are using different private ranges(though overlap should be handled)
@rawdigits setting local_range
does not help.
I stopped nebula on both laptops, set log on lighthouse to debug
, cleared log and restarted lighthouse (no node connected to). Following is the log I got.
time="2019-11-23T20:05:18Z" level=info msg="Main HostMap created" network=192.168.100.1/24 preferredRanges="[]" time="2019-11-23T20:05:18Z" level=info msg="UDP hole punching enabled" time="2019-11-23T20:05:18Z" level=info msg="Nebula interface is active" build=1.0.0 interface=neb0 network=192.168.100.1/24 time="2019-11-23T20:05:18Z" level=debug msg="Error while validating outbound packet: packet is not ipv4, type: 6" packet="[96 0 0 0 0 8 58 255 254 128 0 0 0 0 0 0 183 226 137 252 10 196 21 15 255 2 0 0 0 0 0 0 0 0 0 0 0 0 0 2 133 0 27 133 0 0 0 0]"
My Config:
nebula-cert sign -name "lighthouse" -ip "192.168.100.1/24"
nebula-cert sign -name "laptop" -ip "192.168.100.101/24" -groups "laptop"
nebula-cert sign -name "server" -ip "192.168.100.201/24" -groups "server"
Lighthouse:
pki:
ca: /etc/nebula/ca.crt
cert: /etc/nebula/lighthouse.crt
key: /etc/nebula/lighthouse.key
static_host_map:
"192.168.100.1": ["167.71.175.250:4242"]
lighthouse:
am_lighthouse: true
interval: 60
listen:
host: 0.0.0.0
port: 4242
punchy: true
tun:
dev: nebula1
mtu: 1300
logging:
level: info
format: text
firewall:
conntrack:
tcp_timeout: 12m
udp_timeout: 3m
default_timeout: 10m
max_connections: 100000
outbound:
- port: any
proto: any
host: any
inbound:
- port: any
proto: icmp
host: any
Laptop:
pki:
# The CAs that are accepted by this node. Must contain one or more certificates created by 'nebula-cert ca'
ca: /etc/nebula/ca.crt
cert: /etc/nebula/laptop.crt
key: /etc/nebula/laptop.key
static_host_map:
"192.168.100.1": ["167.71.175.250:4242"]
lighthouse:
am_lighthouse: false
interval: 60
hosts:
- "192.168.100.1"
listen:
host: 0.0.0.0
port: 0
punchy: true
tun:
dev: nebula1
mtu: 1300
logging:
level: info
format: text
firewall:
conntrack:
tcp_timeout: 12m
udp_timeout: 3m
default_timeout: 10m
max_connections: 100000
outbound:
- port: any
proto: any
host: any
inbound:
- port: any
proto: icmp
host: any
Server:
pki:
ca: /etc/nebula/ca.crt
cert: /etc/nebula/server.crt
key: /etc/nebula/server.key
static_host_map:
"192.168.100.1": ["167.71.175.250:4242"]
lighthouse:
am_lighthouse: false
interval: 60
hosts:
- "192.168.100.1"
listen:
host: 0.0.0.0
port: 0
punchy: true
tun:
dev: nebula1
mtu: 1300
logging:
level: info
format: text
firewall:
conntrack:
tcp_timeout: 12m
udp_timeout: 3m
default_timeout: 10m
max_connections: 100000
outbound:
- port: any
proto: any
host: any
inbound:
- port: any
proto: icmp
host: any
With this setup, both server and laptop can ping lighthouse, lighhouse can ping server and laptop, but laptop cannot ping server and server cannot ping laptop.
I get messages such as this as it's trying to make the connection:
INFO[0006] Handshake message sent handshake="map[stage:1 style:ix_psk0]" initiatorIndex=3339283633 remoteIndex=0 udpAddr="18.232.11.42:4726" vpnIp=192.168.100.201
INFO[0007] Handshake message sent handshake="map[stage:1 style:ix_psk0]" initiatorIndex=3339283633 remoteIndex=0 udpAddr="172.31.106.61:37058" vpnIp=192.168.100.201
INFO[0009] Handshake message sent handshake="map[stage:1 style:ix_psk0]" initiatorIndex=3339283633 remoteIndex=0 udpAddr="18.232.11.42:4726" vpnIp=192.168.100.201
INFO[0011] Handshake message sent handshake="map[stage:1 style:ix_psk0]" initiatorIndex=3339283633 remoteIndex=0 udpAddr="172.31.106.61:37058" vpnIp=192.168.100.201
INFO[0012] Handshake message sent handshake="map[stage:1 style:ix_psk0]" initiatorIndex=3339283633 remoteIndex=0 udpAddr="18.232.11.42:4726" vpnIp=192.168.100.201
INFO[0014] Handshake message sent handshake="map[stage:1 style:ix_psk0]" initiatorIndex=3339283633 remoteIndex=0 udpAddr="172.31.106.61:37058" vpnIp=192.168.100.201
INFO[0016] Handshake message sent handshake="map[stage:1 style:ix_psk0]" initiatorIndex=3339283633 remoteIndex=0 udpAddr="18.232.11.42:4726" vpnIp=192.168.100.201
@nfam similar error, not sure it's the problem
Error while validating outbound packet: packet is not ipv4, type: 6 packet="[96 0 0 0 0 8 58 255 254 128 0 0 0 0 0 0 139 176 20 9 146 65 14 250 255 2 0 0 0 0 0 0 0 0 0 0 0 0 0 2 133 0 60 66 0 0 0 0]" DEBU[0066] Error while validating outbound packet: packet is not ipv4, type: 6 packet="[96 0 0 0 0 8 58 255 254 128 0 0 0 0 0 0 139 176 20 9 146 65 14 250 255 2 0 0 0 0 0 0 0 0 0 0 0 0 0 2 133 0 60 66 0 0 0 0]"
@jatsrt
The Error while validating outbound packet
can mostly be ignored. Just some types of packet nebula doesn't support bouncing off.
As far as the handshakes, for some reason hole punching isn't working. A few things to try:
1) Add punch_back: true
on the "server" and "laptop" nodes.
2) explicitly allow all UDP in to the "server" node from the internet (via AWS security groups, just as a test)
3) verify iptables
isn't blocking anything.
Also It appears the logs with the handshake messages are from the laptop? If so can you also share nebula logs from the server as it tries to reach the laptop?
Thanks!
Aha, @nfam I think I spotted the config problem.
instead of
lighthouse:
am_lighthouse: false
interval: 60
hosts:
- "LIGHTHOUSE_PUBLIC_IP"
it should be
lighthouse:
am_lighthouse: false
interval: 60
hosts:
- "192.168.100.1"
adding #40 to cover accidental misconfiguration noted above.
@rawdigits yes, it is. Now both laptops can ping to each other. Thanks!
@rawdigits
Server log:
time="2019-11-24T00:25:21Z" level=info msg="Handshake message sent" handshake="map[stage:1 style:ix_psk0]" initiatorIndex=1689969496 remoteIndex=0 udpAddr="96.252.12.10:51176" vpnIp=192.168.100.101
time="2019-11-24T00:25:22Z" level=info msg="Handshake message sent" handshake="map[stage:1 style:ix_psk0]" initiatorIndex=1689969496 remoteIndex=0 udpAddr="96.252.12.10:51176" vpnIp=192.168.100.101
time="2019-11-24T00:25:22Z" level=info msg="Handshake message sent" handshake="map[stage:1 style:ix_psk0]" initiatorIndex=1689969496 remoteIndex=0 udpAddr="96.252.12.10:51176" vpnIp=192.168.100.101
time="2019-11-24T00:25:23Z" level=info msg="Handshake message sent" handshake="map[stage:1 style:ix_psk0]" initiatorIndex=1689969496 remoteIndex=0 udpAddr="96.252.12.10:51176" vpnIp=192.168.100.101
time="2019-11-24T00:25:24Z" level=info msg="Handshake message sent" handshake="map[stage:1 style:ix_psk0]" initiatorIndex=1689969496 remoteIndex=0 udpAddr="192.168.0.22:51176" vpnIp=192.168.100.101
time="2019-11-24T00:25:25Z" level=info msg="Handshake message sent" handshake="map[stage:1 style:ix_psk0]" initiatorIndex=1689969496 remoteIndex=0 udpAddr="96.252.12.10:51176" vpnIp=192.168.100.101
time="2019-11-24T00:25:26Z" level=info msg="Handshake message sent" handshake="map[stage:1 style:ix_psk0]" initiatorIndex=1689969496 remoteIndex=0 udpAddr="192.168.0.22:51176" vpnIp=192.168.100.101
time="2019-11-24T00:25:27Z" level=info msg="Handshake message sent" handshake="map[stage:1 style:ix_psk0]" initiatorIndex=1689969496 remoteIndex=0 udpAddr="96.252.12.10:51176" vpnIp=192.168.100.101
time="2019-11-24T00:25:28Z" level=info msg="Handshake message sent" handshake="map[stage:1 style:ix_psk0]" initiatorIndex=1689969496 remoteIndex=0 udpAddr="192.168.0.22:51176" vpnIp=192.168.100.101
time="2019-11-24T00:25:30Z" level=info msg="Handshake message sent" handshake="map[stage:1 style:ix_psk0]" initiatorIndex=1689969496 remoteIndex=0 udpAddr="96.252.12.10:51176" vpnIp=192.168.100.101
So, tried a few more setups, just comes down to what seems like if the two hosts that are trying to communicate with each other are both on different networks and both behind NAT, it will not work.
If the lighthouse does not facilitate the communication/tunneling, this would make sense, but is it meant to be a limitation?
Dual NAT scenario is a bit tricky, possibly room for improvement from nebula's perspective there. Do you have details on the type of NATs you are dealing with?
@nbrownus nothing crazy, I've done multiple AWS VPC NAT gateways with hosts behind them and they cannot connect. I've also tried "home" NAT(google WiFi router based NAT), with no success.
From a networking perspective, I get why it's "tricky" was hoping there was some trick nebula was doing.
@rawdigits can speak to the punching better than I can. If you are having problems in AWS then we can get a test running and sort out the issues.
Yeah, so all my tests have had at least one host behind an AWS NAT Gateway
Longshot, but one more thing to try until I set up an AWS NAT GW: set the UDP port on all nodes to 4242 and let NAT remap it. One ISP I've dealt with blocks the random ephemeral udp ports above 32,000, presumably because they think every high UDP port is bittorrent.
Probably won't work, but easy to test..
@rawdigits same issue
Network combination: Lighthouse - Digital Ocean NYC3 - Public IP Server - AWS - Oregon - Private VPC with AWS NAT Gateway (172.31.0.0/16) Laptop - Verizon FIOS With Google WIFI Router NAT (192.168.1.0/24) Server2(added later to test) - AWS - Ohio Private VPC with AWS NAT Gateway (10.200.200.0/24)
I added in a second server in a different VPC on AWS to remove the FIOS variable, and had the same results, with server and server2 trying to communicate
INFO[0065] Handshake message sent handshake="map[stage:1 style:ix_psk0]" initiatorIndex=760525141 remoteIndex=0 udpAddr="172.31.106.61:4242" vpnIp=192.168.100.201
INFO[0066] Handshake message sent handshake="map[stage:1 style:ix_psk0]" initiatorIndex=760525141 remoteIndex=0 udpAddr="18.232.11.42:42005" vpnIp=192.168.100.201
INFO[0067] Handshake message sent handshake="map[stage:1 style:ix_psk0]" initiatorIndex=760525141 remoteIndex=0 udpAddr="172.31.106.61:4242" vpnIp=192.168.100.201
INFO[0069] Handshake message sent handshake="map[stage:1 style:ix_psk0]" initiatorIndex=760525141 remoteIndex=0 udpAddr="18.232.11.42:42005" vpnIp=192.168.100.201
INFO[0071] Handshake message sent handshake="map[stage:1 style:ix_psk0]" initiatorIndex=760525141 remoteIndex=0 udpAddr="172.31.106.61:4242" vpnIp=192.168.100.201
INFO[0072] Handshake message sent handshake="map[stage:1 style:ix_psk0]" initiatorIndex=760525141 remoteIndex=0 udpAddr="18.232.11.42:42005" vpnIp=192.168.100.201
@jatsrt I'll stand up a testbed this week to explore what may be the cause of the issue. Thanks!
My Config:
nebula-cert sign -name "lighthouse" -ip "192.168.100.1/24"
nebula-cert sign -name "laptop" -ip "192.168.100.101/24" -groups "laptop"
nebula-cert sign -name "server" -ip "192.168.100.201/24" -groups "server"
Lighthouse:
pki: ca: /etc/nebula/ca.crt cert: /etc/nebula/lighthouse.crt key: /etc/nebula/lighthouse.key static_host_map: "192.168.100.1": ["167.71.175.250:4242"] lighthouse: am_lighthouse: true interval: 60 listen: host: 0.0.0.0 port: 4242 punchy: true tun: dev: nebula1 mtu: 1300 logging: level: info format: text firewall: conntrack: tcp_timeout: 12m udp_timeout: 3m default_timeout: 10m max_connections: 100000 outbound: - port: any proto: any host: any inbound: - port: any proto: icmp host: any
Laptop:
pki: # The CAs that are accepted by this node. Must contain one or more certificates created by 'nebula-cert ca' ca: /etc/nebula/ca.crt cert: /etc/nebula/laptop.crt key: /etc/nebula/laptop.key static_host_map: "192.168.100.1": ["167.71.175.250:4242"] lighthouse: am_lighthouse: false interval: 60 hosts: - "192.168.100.1" listen: host: 0.0.0.0 port: 0 punchy: true tun: dev: nebula1 mtu: 1300 logging: level: info format: text firewall: conntrack: tcp_timeout: 12m udp_timeout: 3m default_timeout: 10m max_connections: 100000 outbound: - port: any proto: any host: any inbound: - port: any proto: icmp host: any
Server:
pki: ca: /etc/nebula/ca.crt cert: /etc/nebula/server.crt key: /etc/nebula/server.key static_host_map: "192.168.100.1": ["167.71.175.250:4242"] lighthouse: am_lighthouse: false interval: 60 hosts: - "192.168.100.1" listen: host: 0.0.0.0 port: 0 punchy: true tun: dev: nebula1 mtu: 1300 logging: level: info format: text firewall: conntrack: tcp_timeout: 12m udp_timeout: 3m default_timeout: 10m max_connections: 100000 outbound: - port: any proto: any host: any inbound: - port: any proto: icmp host: any
With this setup, both server and laptop can ping lighthouse, lighhouse can ping server and laptop, but laptop cannot ping server and server cannot ping laptop.
I get messages such as this as it's trying to make the connection:
INFO[0006] Handshake message sent handshake="map[stage:1 style:ix_psk0]" initiatorIndex=3339283633 remoteIndex=0 udpAddr="18.232.11.42:4726" vpnIp=192.168.100.201 INFO[0007] Handshake message sent handshake="map[stage:1 style:ix_psk0]" initiatorIndex=3339283633 remoteIndex=0 udpAddr="172.31.106.61:37058" vpnIp=192.168.100.201 INFO[0009] Handshake message sent handshake="map[stage:1 style:ix_psk0]" initiatorIndex=3339283633 remoteIndex=0 udpAddr="18.232.11.42:4726" vpnIp=192.168.100.201 INFO[0011] Handshake message sent handshake="map[stage:1 style:ix_psk0]" initiatorIndex=3339283633 remoteIndex=0 udpAddr="172.31.106.61:37058" vpnIp=192.168.100.201 INFO[0012] Handshake message sent handshake="map[stage:1 style:ix_psk0]" initiatorIndex=3339283633 remoteIndex=0 udpAddr="18.232.11.42:4726" vpnIp=192.168.100.201 INFO[0014] Handshake message sent handshake="map[stage:1 style:ix_psk0]" initiatorIndex=3339283633 remoteIndex=0 udpAddr="172.31.106.61:37058" vpnIp=192.168.100.201 INFO[0016] Handshake message sent handshake="map[stage:1 style:ix_psk0]" initiatorIndex=3339283633 remoteIndex=0 udpAddr="18.232.11.42:4726" vpnIp=192.168.100.201
I have got the same situation. node_A <----> lighthouse OK node_B <----> lighthouse OK node_A < ----> node_B Not work, cannot ping each other.
But I found, node_A and node_B can communicate with each other ONLY if both are connected to the same router, such as the same WiFi router.
PS punch_back: true on both node_A and node_B.
No firewall on node_A, node_B and lighthouse.
hole punch very difficult and random
I also can't get nebula to work properly when both nodes are behind a typical NAT (Technically PAT) regardless of any port pinning I do in the config. They happily connect to the lighthouse I have in AWS but it seems like something isn't working properly. I've got punchy and punchback enabled on everything and it doesn't seem to help. I've tried setting the port on the nodes to 0, and also trying the same port that lighthouse is listening on.
The nodes have no issues connecting to each other over the MPLS, but we don't want that (performance reasons)
Edit: To add a bit more detail, even Meraki's AutoVPN can't deal with this. In their situation the "hub" needs to be told it's public IP and a fixed port that is open inbound. I'd be fine with that as an option, and may be the only reliable one if both nodes are behind different NATs.
Another option I had considered, what if we could use the lighthouses to hairpin traffic? I'd much rather pay AWS for the bandwidth than have to deal with unfriendly NATs everywhere.
I did a bit more research, and it appears that the "AWS Nat Gateway" uses Symmetric NAT, which isn't friendly to hole punching of any kind. NAT gateways also don't appear to support any type of port forwarding, so fixing this by statically assigning and forwarding a port doesn't appear to be an option.
A NAT instance would probably work, but I realize that's probably not a great option. One thing I recommend considering would be to give instances a routable IP address, but disallow all inbound traffic. This wouldn't greatly change the security of your network, since you still aren't allowing any unsolicited packets to reach the hosts, but would allow hole punching to work properly.
I don't think NAT so much is the issue but PAT (port translation). Unfortunately with that you can't predict what your public port will be and hole punching becomes impossible if both ends are behind a similar PAT. I'm going to do some testing, but I think that as long as 1 of 2 nodes has a 1:1 NAT (no port translation) a public IP on the node directly isn't a concern.
If I get particularly ambitious I may attempt to whip up some code in lighthouse to detect when one/both nodes are behind a PAT and throw a warning saying that this won't work out of the box.
If I get particularly ambitious I may attempt to whip up some code in lighthouse to detect when one/both nodes are behind a PAT and throw a warning saying that this won't work out of the box
I've thought about this before. You need at least 2 lighthouses, and I think it's best to implement as a flag on the non-lighthouses (when you query the lighthouses for a host, if you get results with the same IP but different ports then you know the remote is problematic).
I haven't dug into the handshake code but if you include the source port in the handshake the lighthouse can compare that to what it sees. If they differ you know something in the middle is doing port translation.
Aha, @nfam I think I spotted the config problem.
instead of
lighthouse: am_lighthouse: false interval: 60 hosts: - "LIGHTHOUSE_PUBLIC_IP"
it should be
lighthouse: am_lighthouse: false interval: 60 hosts: - "192.168.100.1"
I bet this is also my issue... will test it soon. That section is confusing 😕
That was not a fix - I had it configured like this already. After more testing I think what I have is hole punching issue with my NAT.
I had a similar issue with a DO lighthouse and two Windows PC's on the same LAN.
I could ping between the lighthouse and PC's, but not between the PC's.
Adding a windows defender firewall rule worked for me as well, even though there were already rules added by nebula.
I didnt add a port rule, instead I added a custom rule with an allow for the network 192.168.100.0/24. I'm using 0 for the port on the nodes.
We had similar problems getting nebula to work. It seems nebula just can't work with "normal" consumer setups (both sides behind NAT).
It's not only me but also 3 collegues that have tried it without success. The common error pattern was that all boxes can reach the lighthouse via nebula, but except if they are on the same network NO nebula node was able to reach any other nebula node (except the lighthouse). I've tested it for over 2 weeks from various different networks with my laptop and could not get a connection working to other nebula nodes other than the lighthouse a single time.
Maybe it would be a good idea to adept the readme, that nebula is more for a server use case, because for consumer it seems to not work for the main usecase.
Btw... I had the interesting problem for nebula that most of the machines nebula runs on have the same network (eg. docker or k8s network) which is also displayed in the lighthouse tables and as nebula runs on the host there is also a nebula running there, just the wrong one (it's speaking with himself). With the config problems mentioned in this thread that i also debugged through i just can't say if this was related to the initial connection problems.
I share the frustration of dealing with connections that are NAT'd on both sides. Folks could do IP routing or proxying via other nodes, but it defeats the simplicity that nebula brings, and is not a true solution.
Nebula was created as a server-to-server mesh network, but now that we have ported it to every platform (not all released yet, but it works on ios/android), we absolutely need to handle use cases that involve clients behind any kind of NAT or more complex networking scenario, and thus relaying is our only viable option.
Note: relay nodes can be any node on a network, and don't have to be devoted to relaying. The ones you choose to use as relays should, however, have a direct internet connection for them to be useful.
There is a bit of discussion happening in the nebulaoss slack group, but just to make it available here as well (my words reposted):
There are some NATs we just don’t handle well yet. I have a personal interest in doing relaying and am actively working on it again, so hopefully something to share soon. The current experiments i’m doing involve allowing individual nodes to advertise relay node ips/ports as a way to reach them, which would transparently work around NAT for any node that advertises itself as having a relay.
[...]
I’m envisioning it being a configuration option on nodes and clients, with two sep purposes.
On a relay node it would be something like am_relay: true
to signal that the node allows other nodes to use it as a relay (more accurately a bouncer)
On endpoints, especially behind NAT, there would be an option that looks similar to the lighthouse config, something like:
relays:
{relay_nebula_ip}
{relay2_nebula_ip}
Thx for the feedback! (i've put the whining at the end, sorry)
Yes ultimately realys are necessary, eg. as tailscale puts it
https://github.com/tailscale/tailscale/blob/master/derp/derp.go#L9
// DERP is used by Tailscale nodes to proxy encrypted WireGuard // packets through the Tailscale cloud servers when a direct path // cannot be found or opened. DERP is a last resort. Both sides // between very aggressive NATs, firewalls, no IPv6, etc? Well, DERP.
But relays should not be used unnecessarily, they are just a last resort.
STUN or ICE do a whole lot to get through nats, but an additional idea would also be to use UPNP or NAT-PMP when configured.
<== snip
I really appreciate your honest answer, though i'm inclined to say that "There are some NATs we just don’t handle well yet" might not quite cut it, in my experience it's "Not at all", our home servers where behind some consumer stuff, but also every other network i tested, corporate / hackerspaces / ... nothing worked except connection to lighthouse (thus the connection should have been working in principle).
I would agree that while some NAT combo's are nearly impossible, there are many situations that should work that do not. Cisco has figured it out with Meraki's AutoVPN.
I do think having the relay as an option is a good thing, but it shouldn't be necessary in some configs. Per my comment above, the lighthouses should be able to detect if PAT is in use. If it is, you can still make it work without a relay as long as 1 of the 2 ends of a connection are not using PAT (NAT is fine). If both ends are using PAT a relay will be required.
Unfortunately I don't have any stats or insight into how well NAT is working across the userbase. The only thing I can say with confidence is that I'm successfully using it myself to connect from home to devices at various locations around the world that are behind NATs themselves, but I'm sure we can do better.
I was going to bring up uPNP/nat-pmp in my original reply but decided against it. Since you've mentioned it, my thoughts for now are: We should do that, too, but the number of people who will benefit from relaying is much higher than the number who will benefit from router-allowed NAT traversal at the moment. It certainly has the upshot of making direct non-relayed tunnels, so is also worth doing, but I'd like to have relays done first.
Out of curiosity have you tested software that uses STUN/ICE/RFCn on those networks where nebula doesn't create a tunnel? I would love to debug why ours wouldn't work if another would, but I don't have any good test setups to reproduce these issues at the moment. I'd also be happy to replicate your setup hardware/software-wise so I can find what we're doing incorrectly with hole punching, if other solutions are doing it without issue.
To clarify the above: I totally believe it isn't working for some folks in situations where it should. I just don't have detail on their setups yet, so I haven't been able to replicate and find a root cause.
IMHO currently the best example of nat traversal is tailscale, they use a combination of STUN and ICE together with their encrypted relay (DERP).
Awesome... i will re-do the nebula setup get everything up and running again and help you debug if you want :). Even if we have a quarantine currently i'm sure i will get it to not working between two nodes.
btw... one additional nice feature of a relay would be possible support for http proxy (as many corps still use a proxy for internet access).
ps.: should i create an issue with the problem of ip collission i found with the presence of the docker network on mutliple nebula nodes and nebula listening on both nodes on the "same" address? i've "fixed" it partly through firewall rules and different nebula ports on each nodes, which might be an uncommon config for newcommers.
Last issue first: I recommend trying a random port (instead of choosing a numbered port that is identical on every node) by using port: 0
in the config. That's how it is used in kubernetes in a few places, to avoid reusing a single port number. This is also how i run it on devices behind nat to improve the chances they don't overlap and have to be reassigned a new NAT'd port. (perhaps a thing for you to try as well)
(TBH, I'd just make every non-lighthouse node on any nebula network port: 0
unless you have a restrictive network)
I agree that STUN and ICE plus relaying is a good solution (really the only solution), but it would be useful to know if tailscale is successfully doing NAT traversal in a place we don't or if they fail in the exact same situations, their relaying makes things work. I say this because it is either true that 1) They are falling back to relaying because their hole punching encounters the same issues as ours or 2) Their hole punching is succeeding when ours is failing in some cases.
If point 1 is true, then us doing relaying is the only thing to do. If point 2 is true, we need to do relaying, but also need to look at hole punching code.
That's one major reason why I stuck with Tinc VPN. It's a mesh VPN that will route traffic through other Tinc nodes if it can't do it directly. Once Nebula has that feature, then I would switch completely.
@breisig I used Tinc for many years and still think it is great. It definitely inspired some of Nebula. Now that I'm a full time indoors person, I'm typing code as fast as I can, so we'll have something to test soon. :)
@rawdigits Once you have something ready for testing that would allow Nebula to route traffic through other nodes [like Tinc], please let me know. I would be willing to test. I would drop Tinc right away for Nebula.
Sooo, it turns out our hole punching may have been too efficient and was triggering race conditions in various connection tracking implementations. We have now nerfed it (slowed it down slightly) and the problems I was having have mostly vanished. Once #210 is merged, I recommend building from source and testing on various NAT setups again, because I believe this exists in a lot of routers/etc and was causing problems for people.
I downloaded the latest master, compiled it, and it worked. Only 1 test so far so needs validation and repeated tests, but looks good so far.
Test scenario: (as-is) V1.1.0 LH VM Ubuntu 18.04.4 (amd64) Node A Ubuntu 18.04.4 on metal (arm64) Node B MacOS on metal (10.14.6) LH on public with UDP in 4242 allow Node A behind consumer ATT DSL rtr in KC MO Node B behind tethered iPhone Node A ---> ping --> LH == OK Node B ---> ping --> LH == OK Node A --> ping --> Node B == Nope Node B --> ping --> Node A == Nope
Test scenario: (new bins) Create new LH VM with pub IP and UDP 4242 in allow New LH VM Ubuntu 18.04.4 (also amd64) Node A Ubuntu 18.04.4 on metal (same as above) Node B MacOS on metal (same as above) Compile new nebula code to create new bins for each Create new CA Create new config.yaml (test-config.yaml) Create new signed certs for nodes and LH (test-*.crt/key) Fire up new config and certs to use new LH on LH, Node A, Node B Node A ---> ping --> LH == OK Node B ---> ping --> LH == OK Node A --> ping --> Node B == OK Node B --> ping --> Node A == OK
Validate: Stop all nebula services, all nodes Restart with orig config and orig bins (v1.1.0) Node A ---> ping --> LH == OK Node B ---> ping --> LH == OK Node A --> ping --> Node B == Nope Node B --> ping --> Node A == Nope
So, it seems that the updates have resolved the issue/race condition preventing nodes from finding each other and punching through NAT. I have notified some of my team about my findings so they can validate more thoroughly.
ETA: In the "Validate" scenario I used the new bins on Node A and B, and v1.1.0 bin on LH and it didn't work.
Therefore, all nodes need new bins. Makes sense of course, but I am adding this comment to add that extra test detail for anyone else.
Not a perfect test, but good enough for this AM. As I said, needs more testing, but is looking good so far.
Awesome, i'll also test as soon as we are allowed to go out again.
btw... as it seems now viable to use nebula i've polished up my debian package building and sent a pull request :) #211
Update: I haven't had time to test further. However, I wanted to point out that my laptop behind an iPhone tether isn't symmetric NAT, so that little test doesn't prove or disprove that it does or doesn't defeat that issue. that said, it is an overall improvement as it did improve the scenario described above.
Good luck and I hope that further testing proves this is a fix and move the whole project forward.
@gebi why can't you test without "going out?"
First off... Nebula is awesome and I appreciate having the privilege of using it. Thank you!
I'm experiencing this same issue using v1.2 and built from source commit 363c836422627db8593b4ecebb271d1dfdef05a8, but would like to note that I don't see the same issue on OS X.
Connectivity:
When I ping server A <-> server B nebula logs nebula[31848]: time="2020-04-21T15:33:48-07:00" level=info msg="Handshake message sent" endlessly, but traffic never arrives.
I've turned numerous knobs, but can't seem to get it to work. Any help is appreciated!
Macbook config:
lighthouse:
am_lighthouse: false
interval: 60
hosts:
- "172.16.0.1"
listen:
host: 0.0.0.0
port: 0
punchy:
punch: true
tun:
dev: nebula1
drop_local_broadcast: false
drop_multicast: false
tx_queue: 500
mtu: 1300
routes:
unsafe_routes:
logging:
level: debug
format: text
firewall:
conntrack:
tcp_timeout: 120h
udp_timeout: 3m
default_timeout: 10m
max_connections: 100000
outbound:
- port: any
proto: any
host: any
inbound:
- port: any
proto: icmp
host: any
- port: 443
proto: tcp
groups:
- laptop
- home
Server A and server B config:
lighthouse:
am_lighthouse: false
interval: 60
hosts:
- "172.16.0.1"
local_allow_list:
listen:
host: 10.137.124.217
port: 0
punchy:
punch: true
respond: true
delay: 1s
tun:
dev: nebula1
drop_local_broadcast: false
drop_multicast: false
tx_queue: 500
mtu: 1300
routes:
unsafe_routes:
logging:
level: debug
format: text
handshakes:
try_interval: 100ms
retries: 20
wait_rotation: 5
firewall:
conntrack:
tcp_timeout: 120h
udp_timeout: 3m
default_timeout: 10m
max_connections: 100000
outbound:
- port: any
proto: any
host: any
inbound:
- port: any
proto: any
host: any
lighthouse config:
lighthouse:
am_lighthouse: true
interval: 60
hosts:
listen:
host: 0.0.0.0
port: 4242
punchy:
punch: true
respond: true
tun:
dev: nebula1
drop_local_broadcast: false
drop_multicast: false
tx_queue: 500
mtu: 1300
routes:
unsafe_routes:
logging:
level: debug
format: text
firewall:
conntrack:
tcp_timeout: 120h
udp_timeout: 3m
default_timeout: 10m
max_connections: 100000
outbound:
- port: any
proto: any
host: any
inbound:
- port: any
proto: any
host: any
First off... Nebula is awesome and I appreciate having the privilege of using it. Thank you!
I'm experiencing this same issue using v1.2 and built from source commit 363c836, but would like to note that I don't see the same issue on OS X.
- lighthouse running on Linode
- linux server A behind Google WiFi NAT
- MacBook behind Google WiFi Nat
- linux server B behind unknown NAT
Connectivity:
- lighthouse can reach all machines
- all machines can reach lighthouse
- server A <-> Macbook = OK (same LAN)
- server B <-> Macbook = OK (not on the same LAN)
- server A <-> server B = FAIL (not on the same LAN)
When I ping server A <-> server B nebula logs nebula[31848]: time="2020-04-21T15:33:48-07:00" level=info msg="Handshake message sent" endlessly, but traffic never arrives. I've turned numerous knobs, but can't seem to get it to work. Any help is appreciated!
Macbook config:
lighthouse: am_lighthouse: false interval: 60 hosts: - "172.16.0.1" listen: host: 0.0.0.0 port: 0 punchy: punch: true tun: dev: nebula1 drop_local_broadcast: false drop_multicast: false tx_queue: 500 mtu: 1300 routes: unsafe_routes: logging: level: debug format: text firewall: conntrack: tcp_timeout: 120h udp_timeout: 3m default_timeout: 10m max_connections: 100000 outbound: - port: any proto: any host: any inbound: - port: any proto: icmp host: any - port: 443 proto: tcp groups: - laptop - home
Server A and server B config:
lighthouse: am_lighthouse: false interval: 60 hosts: - "172.16.0.1" local_allow_list: listen: host: 10.137.124.217 port: 0 punchy: punch: true respond: true delay: 1s tun: dev: nebula1 drop_local_broadcast: false drop_multicast: false tx_queue: 500 mtu: 1300 routes: unsafe_routes: logging: level: debug format: text handshakes: try_interval: 100ms retries: 20 wait_rotation: 5 firewall: conntrack: tcp_timeout: 120h udp_timeout: 3m default_timeout: 10m max_connections: 100000 outbound: - port: any proto: any host: any inbound: - port: any proto: any host: any
lighthouse config:
lighthouse: am_lighthouse: true interval: 60 hosts: listen: host: 0.0.0.0 port: 4242 punchy: punch: true respond: true tun: dev: nebula1 drop_local_broadcast: false drop_multicast: false tx_queue: 500 mtu: 1300 routes: unsafe_routes: logging: level: debug format: text firewall: conntrack: tcp_timeout: 120h udp_timeout: 3m default_timeout: 10m max_connections: 100000 outbound: - port: any proto: any host: any inbound: - port: any proto: any host: any
I have got a similar issue. Nebula may fail if there is NAT or there are multi-NAT to be punched.
Hi,
im trying to find out if there is any Support for Routing Traffic, between nodes which cant reach each other, through the Lighthouse Host. From my understanding a Gateway doing Port Based Address Translation never has a working Session Table Entry which dont reflect the Communication between the Nebula Host and a Lighthouse.
Even if your able to fake a Session between both Gateways, using Lighthouse for Signaling you dont get the right source/destinations Ports. Simply because they are random.
-hasturo
I share the frustration of dealing with connections that are NAT'd on both sides. Folks could do IP routing or proxying via other nodes, but it defeats the simplicity that nebula brings, and is not a true solution.
Nebula was created as a server-to-server mesh network, but now that we have ported it to every platform (not all released yet, but it works on ios/android), we absolutely need to handle use cases that involve clients behind any kind of NAT or more complex networking scenario, and thus relaying is our only viable option.
Note: relay nodes can be any node on a network, and don't have to be devoted to relaying. The ones you choose to use as relays should, however, have a direct internet connection for them to be useful.
There is a bit of discussion happening in the nebulaoss slack group, but just to make it available here as well (my words reposted):
There are some NATs we just don’t handle well yet. I have a personal interest in doing relaying and am actively working on it again, so hopefully something to share soon. The current experiments i’m doing involve allowing individual nodes to advertise relay node ips/ports as a way to reach them, which would transparently work around NAT for any node that advertises itself as having a relay.
[...]
I’m envisioning it being a configuration option on nodes and clients, with two sep purposes. On a relay node it would be something like
am_relay: true
to signal that the node allows other nodes to use it as a relay (more accurately a bouncer) On endpoints, especially behind NAT, there would be an option that looks similar to the lighthouse config, something like:relays: {relay_nebula_ip} {relay2_nebula_ip}
This is really cool feature and we need it. We have like 10 clients with 1 lighthouse and sometimes some of clients cannot talk to each other, which still enforces us to use traditional vpn solutions. Is there any information if this will be implemented and the hottest question - when?
I seem to be missing something important. If I setup a mesh of hosts with all direct public IP addresses, it works fine. However, if I have a network with a light house(public IP), then all nodes behind NAT, they will not connect to each other. The lighthouse is able to communicate with all hosts, but hosts are not able to communicate with each other.
Watching the logs I see connections trying to be made to both the NAT public, and the private IPs.
I have enabled punchy and punch back, but does not seem to help.
Hope it is something simple?