Two NAT and weird observation

thenitai commented 2 years ago

Hi,

I came to this project as I was testing ZeroTier and Tailscale. Both look great, but I really need something that is fast, is used in production (Slack), and doesn't include some third-party looking into our traffic.

That said, I has great success getting everything connected. Except for hosts that don't have a public IP, i.e., a private IP (either with a fixed address or per DHCP).

The weird thing is that I can see traffic in the lighthouse and when a host within another NAT pings the host in question I see a lot of traffic and handshakes and responses. No errors, but also no feedback in the shell (the one that pings).

I have pfSense running in both networks, but even with turning the firewall off, it does not work. Next, I've tried to set up a Linux Bridge and that also didn't resolve the issue. I've re-created all the certs, double-checked the IPs, made certain that the lighthouse has a public IP, etc.

Note: As soon as I give this host a public IP, it works.

Honestly not really sure what Is going on and I hope someone has a clue.

chirayu-patel commented 2 years ago

I also have a similar observation. What has worked for me is to give public ip atleast at 1 end if both sides are NAT'ed. That surely works.

I also have tried doing port forwarding from the router in front , but it does not work consistently.

thenitai commented 2 years ago

Doesn't do it for me:

Lighthouse = public IP (not behind a firewall other than local)
Host 1 = public and private IP (pfSense DHCP and Firewall on network 1)
Host 2 = private IP (pfSense DHCP and Firewall on network 2) <--- No access from and to others
Host 3 = public and private IP (DigitalOcean with Firewall)

It works as soon as Host 2 gets a public IP (no other changes). No matter if the Firewall is on or off.

ZeroTier, Tailscale, and Netmaker all work with the above setup.

bouillon commented 2 years ago

I can confirm. See #666 also

em-winterschon commented 2 years ago

Confirmed on versions 1.3.0 and 1.5.2. I upgraded all of my hosts (18 running nebula across four datacenters) this weekend to see if that would fix the issue with the older version, but it did not. The setup has three lighthouses, one each in US-west, US-east, EU-west all with public IPs and correct firewall rules.

All hosts behind OPnsense firewalls on physically separate subnets (at two locations) can talk to each other without issue (these talk over an OpenVPN site to site link with no firewall restrictions). Any host traffic going outside of the firewalls can ping the public VMs at various cloud providers but those same hosts cannot get traffic back into the non-cloud DCs reliably. They will work sometimes and then cease traffic entirely.

Sometimes restarting firewalld on the cloud VMs resumes traffic but only for a short period, sometimes it fails to help at all.

I've tried forcing the listening IP and port for nebula as well as using 0.0.0.0 and 0 respectively, but neither makes a difference.
I've enabled UDP NAT punching rules (static port map for UDP) at the OPNsense firewall level as mentioned in one issue/post, which did not make a difference.
I've enabled all punchy aspects (punch, return, delay, punch_back) but the issue persists.
I've enabled all the usual trusted rules on firewalld for the cloud VMs, still the issue persists.
I've cleared the OPNsense firewall state tables and rebooted them to try that out, doesn't help.

So right now I've got a great internal VPN for hosts that can already talk to each other over a private LAN to begin with, and several cloud VMs which cannot get their traffic back across the nebula VPN into the datacenters. So basically nebula isn't doing its job at all.

Like the original post mentioned, I had no issues at all with ZeroTier or TailScale, only with Nebula. For further comparison, this has not been an issue when spinning up ansible automated tunnels with Wireguard, OpenVPN, and IPsec.

thenitai commented 2 years ago

In the meantime, we switched to OPNSense (from pfSense) but that doesn't resolve the issue either. As we couldn't wait any longer we deployed Tailscale, well Headscale to be precise (https://github.com/juanfont/headscale).

While it works, it has its own quirks. Subnets don't really work either and even with local DERP installed it can take a few seconds to connect.

As soon as we get some feedback or a working version I would love to use Nebula because both ZeroTier and Tailscale have high CPU issues and network speed is subpar (Tailscale is much faster than ZeroTier).

DePingus commented 2 years ago

If you're using OPNsense or PFsense have a look here: https://blog.ktz.me/punching-through-nat-with-nebula-mesh/

The reason ZeroTier works is because when ZT fails to punch through UDP, it will fall back to using the moon as a relay. Nebula doesn't have a relay mode.

1MachineElf commented 2 years ago

Thanks @DePingus for that link. By following the directions there for one of my networks (not LAN, but a custom "Home" network) my devices are now able to connect via their Nebula IPs while behind OPNSense.

Would someone be so kind as to put into layman's terms any potential security impact of disabling UDP port rewriting for an entire network? I'm not a network expert but I assume OPNSense wasn't configured this way by default for a good reason.

brad-defined commented 2 years ago

@1MachineElf A NAT drops some IPv4 traffic when it doesn't have the internal state tracking to know who to forward the traffic to. This makes it act in some ways like a Firewall. However, a Firewall is different in a key way - a Firewall is intentionally designed and configured to drop traffic, while a NAT only happens to do so.

A NAT's packet dropping behavior looks like a security protection, but it is not. A Firewall, in contrast, is a good network security device. For network security, use Firewalls, not NATs.

OPNsense's port rewriting makes the NAT behavior less predictable, causing the NAT to drop traffic that Nebula NAT traversal logic does not want dropped. Making the NAT drop more traffic makes it look like it is providing better network security. However, NAT's - even ones that behave less predictably - are still not network security tools.

One case in point - NAT's only apply to IPv4 traffic. IPv6 has no NAT, as there are more than enough IPv6 addresses to go around. Firewalls continue to provide traffic management on IPv6.

I think that's a good argument to say that OPNSense's default configuration was not made for a good reason.

brad-defined commented 2 years ago

Nebula 1.6.0 is released with a Relay feature, to cover cases like a Symmetric NAT. https://github.com/slackhq/nebula/pull/678

If you don't want to change the NAT behavior of your OPNsense system, you can use relays.

Check out the example config to see how to configure a Nebula node to act as a relay, and how to configure other nodes to identify which Relay can be used by peers for access. Also, take a look at https://github.com/slackhq/nebula/issues/33#issuecomment-1180569297 for more info on how to configure it.

johnmaguire commented 1 year ago

I'm closing this out for inactivity. If you continue to have issues after trying the relay feature, pleasefeel free to open up a new issue or join us on Slack. Thanks!

slackhq / nebula

Two NAT and weird observation #663