slackhq / nebula

A scalable overlay networking tool with a focus on performance, simplicity and security
MIT License
14.59k stars 980 forks source link

🐛 BUG: Fallback lighthouse setup? #1220

Closed azukaar closed 3 weeks ago

azukaar commented 1 month ago

What version of nebula are you using? (nebula -version)

1.8.2

What operating system are you using?

*

Describe the Bug

When using 2 lighthouse, with a static_host_map, the selected IP in this map seems to override every other routing logic, including the relay. I have a setup where I have 2 lighthouses: one for my local network, and one for remote access. It looks roughly like this:

lighthouse:
  am_lighthouse: false
  hosts:
  - 192.168.201.1
  - 192.168.201.2
  interval: 60
pki:
  ca: |
    -----BEGIN NEBULA CERTIFICATE-----
    ...
    -----END NEBULA CERTIFICATE-----
  cert: |
    -----BEGIN NEBULA CERTIFICATE-----
    ...
    -----END NEBULA CERTIFICATE-----
  key: |
    -----BEGIN NEBULA X25519 PRIVATE KEY-----
    ...
    -----END NEBULA X25519 PRIVATE KEY-----
relay:
  am_relay: false
  relays:
  - 192.168.201.2
  use_relays: true
static_host_map:
  192.168.201.1:
  - local-ip:4242
  192.168.201.2:
  - remote-ip:4242

In this configuration, if I remove the static_host_map and the lighthouse entry for 192.168.201.1 (so having a single lighthouse) it works perfectly, and 192.168.201.1 (who then connect to the lighthouse too) is accessible.

On the other hand, with the 2 lighthouses, because 192.168.201.1 is local network only, it becomes inaccessible from outside. I am guessing that

Which now comes to the question / potential issue: how to have a fallback lighthouse for specific subnet? (Eg. in this case a local network lighthouse in case internet connection is lost and the public lighthouse becomes inaccessible).

In that scenario for some reason 192.168.201.1 does not ping 192.168.201.2 (even thought it could) but 192.168.201.2 tries to ping 192.168.201.1 even thought they are both lighthouses

Thanks for your help

Logs from affected hosts

logs from 192.168.201.1

time="2024-09-18T15:41:07+01:00" level=info msg="listening \"0.0.0.0\" 4242"
time="2024-09-18T15:41:07+01:00" level=info msg="Main HostMap created" network=192.168.201.1/24 preferredRanges="[]"
time="2024-09-18T15:41:07+01:00" level=info msg="punchy enabled"
time="2024-09-18T15:41:07+01:00" level=warning msg="lighthouse.am_lighthouse enabled on node but upstream lighthouses exist in config"
time="2024-09-18T15:41:07+01:00" level=info msg="Read relay from config" relay=192.168.201.2
time="2024-09-18T15:41:07+01:00" level=info msg="Loaded send_recv_error config" sendRecvError=always
time="2024-09-18T15:41:07+01:00" level=info msg="Nebula interface is active" boringcrypto=false build=1.8.2 interface=nebula1 network=192.168.201.1/24 udpAddr="0.0.0.0:4242"
time="2024-09-18T15:41:07+01:00" level=info msg="DNS results changed for host list" newSet="map[ip:4242:{}]" origSet="&map[]"

logs from 192.168.201.2

time="2024-09-18T14:40:56Z" level=info msg="listening \"0.0.0.0\" 4242"
time="2024-09-18T14:40:56Z" level=info msg="Main HostMap created" network=192.168.201.7/24 preferredRanges="[]"
time="2024-09-18T14:40:56Z" level=info msg="punchy enabled"
time="2024-09-18T14:40:56Z" level=info msg="Loaded send_recv_error config" sendRecvError=always
time="2024-09-18T14:40:56Z" level=info msg="Nebula interface is active" boringcrypto=false build=1.8.2 interface=nebula1 network=192.168.201.7/24 udpAddr="0.0.0.0:4242"
time="2024-09-18T14:40:56Z" level=info msg="DNS results changed for host list" newSet="map[external-ip:4242:{}]" origSet="&map[]"
time="2024-09-18T14:41:04Z" level=info msg="Handshake timed out" durationNs=7129153927 handshake="map[stage:1 style:ix_psk0]" initiatorIndex=664017120 localIndex=664017120 remoteIndex=0 udpAddrs="[]" vpnIp=192.168.201.8
time="2024-09-18T14:41:04Z" level=info msg="Handshake timed out" durationNs=6613558988 handshake="map[stage:1 style:ix_psk0]" initiatorIndex=3107523284 localIndex=3107523284 remoteIndex=0 udpAddrs="[]" vpnIp=192.168.201.1
time="2024-09-18T14:41:15Z" level=info msg="Handshake timed out" durationNs=6809820135 handshake="map[stage:1 style:ix_psk0]" initiatorIndex=901597904 localIndex=901597904 remoteIndex=0 udpAddrs="[]" vpnIp=192.168.201.1

Config files from affected hosts

lighthouse:
  am_lighthouse: false
  hosts:
  - 192.168.201.1
  - 192.168.201.2
  interval: 60
pki:
  ca: |
    -----BEGIN NEBULA CERTIFICATE-----
    ...
    -----END NEBULA CERTIFICATE-----
  cert: |
    -----BEGIN NEBULA CERTIFICATE-----
    ...
    -----END NEBULA CERTIFICATE-----
  key: |
    -----BEGIN NEBULA X25519 PRIVATE KEY-----
    ...
    -----END NEBULA X25519 PRIVATE KEY-----
listen:
  host: 0.0.0.0
  port: 4242
punchy:
  punch: true
  respond: true
relay:
  am_relay: false
  relays:
  - 192.168.201.2
  use_relays: true
static_host_map:
  192.168.201.1:
  - local-ip:4242
  192.168.201.2:
  - remote-ip:4242
tun:
  disabled: false
  dev: nebula1
  drop_local_broadcast: false
  drop_multicast: false
  tx_queue: 500
  mtu: 1300
  routes: []
  unsafe_routes: []
logging:
  level: info
  format: text
firewall:
  outbound_action: drop
  inbound_action: drop
  conntrack:
    tcp_timeout: 12m
    udp_timeout: 3m
    default_timeout: 10m
  outbound:
  - port: any
    proto: any
    host: any
  inbound:
  - port: any
    proto: any
    host: any
johnmaguire commented 1 month ago

Hi @azukaar - I'm not totally sure I understand your issue. First, let's talk about your configuration:

lighthouse:
  am_lighthouse: false
  hosts:
  - 192.168.201.1
  - 192.168.201.2
  interval: 60
static_host_map:
  192.168.201.1:
  - local-ip:4242
  192.168.201.2:
  - remote-ip:4242

In this configuration, you've defined two Lighthouse hosts, each with a single underlay IP address. The host will attempt to connect to 192.168.201.2 using a public IPv4 address and will attempt to connect to 192.168.201.1 using its private (LAN) IPv4 address. A couple notes:

Because of the final point, we do not recommend that hosts talk to Lighthouses over LAN / private IPs. This is because the packets arriving at the Lighthouse will have a source field pointing to the private IP address of the node, which is only routable to other hosts on the same LAN.

When Lighthouses are instead configured to talk over the public Internet, the source field will contain the public IPv4 address of the host's network, giving hosts on other networks the best chance to handshake with it. Because hosts still self-report IP addresses assigned to them, the private IP address will still be available to the Lighthouse for nodes that are able to communicate directly over the LAN.

With that in mind, can you clarify exactly what you're trying to achieve?

azukaar commented 1 month ago

Thanks for the detailed response I really appreciate it

Lighthouses do not connect to other Lighthouses to learn underlay addresses of hosts. They only learn underlay addresses when hosts connect to them

I think that might be the issue. In this configuration:

image

The internal lighthouse can never be reached because it did not handshake with the public lighthouse to enable the support for the relay. That makes sense.

The issue is that, if you do it this way: image

It works perfectly fine, but then in the scenario where both clients (ex. your phone and your home server) are in the same local network, and connectivity is lost to the public lighthouse (a VPS) the phone cannot reach your home server even thought they can theorically simply have a direct connection:

image

That's why I was trying to make them both lighthouses, so that the internal lighthosue can stilll be used for connectivity as a fallback, but that undermines everything else basically

johnmaguire commented 1 month ago

@azukaar Yes, I think you've got it - relays cannot be used to connect to Lighthouses.

If I understand your goals correctly, you have a main Lighthouse that operates on the Internet. You have a secondary Lighthouse that is on a LAN with some of your hosts. If you lose Internet connection, you wish for the LAN devices to use the local Lighthouse. You also wish for external nodes to access the Lighthouse directly as a peer (i.e. non-Lighthouse traffic.)

One thing you could try is adding two IP addresses for your LAN Lighthouse to the rest of your nodes' configs: (a) a public and routable IP address/port for the LAN Lighthouse (i.e. setup port forwarding in your NAT, so it is also a Lighthouse on the Internet), (b) a local IP address for the LAN Lighthouse

In this case, nodes external to the local Lighthouse will access it over the public Internet. I believe nodes local to it will also attempt to route this way, preferentially. However, if the Internet route to the Lighthouse is lost, they should try to failover to the private IPv4 address: https://nebula.defined.net/docs/config/preferred-ranges/#how-nebula-orders-underlay-ip-addresses-it-learns-about

azukaar commented 1 month ago

Thanks. Yes I thought about it, but it's not possible because of CGNAT. Also it's a bit of a waste of bandwith to relay connection to a node that is local to your network.

Essentially the use case is, in a basic home server situation, I have

The phone uses the lighthouse to communicate with the home server (with relay). But when I'm home, it should be able to connect to the home server directly, without manual intervention

Ideally, what I think what would be good would be

In that scenario then if the lighthouse is offline, you wouldn't use complete connectivity

Something else that would be potentially useful is an option for fallback lighthouse

lighthouse:
  am_lighthouse: false:
  am_fb_lighthouse: true
  interval: 60

And basically this would cause a client to act as a lighthouse until it is able to connect to a lighthouse itself.

With those 2 features, it would be possible to design much more reliable and reliant networks that are able to react to loss of connectivity better

EDIT: How about an option that allows lighthouse to ping other lighthouses? Isn't it odd that lighthouse and clients are completely considered to be 2 separate entities? Not every clients are lighthouses, but a lighthouse is still a client, why doesn't it attempt to establish connectivity to the rest of the network? If I have a Nebula network with only 2 lighthouses, they should still be able to communicate together, but because they are lighthouses, they will ignore each others.

johnmaguire commented 1 month ago

If I add a static_host_map for an ip that is NOT a lighthosue (but a client) then the client should be able to attempt a direct P2P connection to it without lighthouses involved.

This is how Nebula already behaves today.

If I have a Nebula network with only 2 lighthouses, they should still be able to communicate together, but because they are lighthouses, they will ignore each others.

Again, what you're asking for already exists in Nebula today. As long as Lighthouses list each other in their static_host_map, they are still able to handshake and talk to each other like any other nodes. They will not query each other for underlay IPs of other nodes however.

Something else that would be potentially useful is an option for fallback lighthouse

I don't understand the fallback Lighthouse proposal. Maybe you can describe how you envision that working in more detail?

johnmaguire commented 3 weeks ago

Closed for inactivity. Please ping me directly if you'd like the ticket reopened. Thanks!