netbirdio / netbird

Connect your devices into a secure WireGuard®-based overlay network with SSO, MFA and granular access controls.
https://netbird.io
BSD 3-Clause "New" or "Revised" License
10.51k stars 472 forks source link

Nameserver being randomly unavailable #1869

Open Enailis opened 4 months ago

Enailis commented 4 months ago

Describe the problem

We're using the self-hosted version of Netbird and everything is setup according to the documentation. Sometimes the custom nameserver is resolved, sometimes it isn't. That's without ever touching the config on the web interface.

To give more context here is our network configuration:

When a user is connected to the Netbird VPN, he can ping every server and every user without any problem. For example, users can ping Gitlab's Netbird IP:

> ping 100.73.149.194
PING 100.73.149.194 (100.73.149.194): 56 data bytes
64 bytes from 100.73.149.194: icmp_seq=0 ttl=64 time=35.938 ms
64 bytes from 100.73.149.194: icmp_seq=1 ttl=64 time=32.203 ms
64 bytes from 100.73.149.194: icmp_seq=2 ttl=64 time=32.427 ms

But users cannot ping pfSense's DNS Resolver IP:

> ping 10.10.10.1
PING 10.10.10.1 (10.10.10.1): 56 data bytes
Request timeout for icmp_seq 0
Request timeout for icmp_seq 1
Request timeout for icmp_seq 2

The netbird status -d command returns this problem:

[10.10.10.1:53] for [gitlab.mycompany.com] is Unavailable, reason: 1 error occurred:
    * read udp 192.168.1.182:53408->10.10.10.1:53: i/o timeout

Apart from this, we have no logs server side about this, and only i/o timeout for the pfSense's DNS Resolver IP in var/log/netbird/client.log.

But sometimes, without changing anything, either client or server side, everything works just fine.

This issue appears on every OS: Windows 11, macOS 14.4.1 (23E224) and Ubuntu 22.04.

To Reproduce

Since the problem is random, we have no clue how to reproduce this problem.

Expected behavior

The Nameserver is supposed to be constantly recognize by Netbird without being randomly unavailable.

Are you using NetBird Cloud?

We're using Netbird self-hosted solution.

NetBird version

Every user is up to date: 0.27.3.

pascal-fischer commented 4 months ago

Hi @Enailis,

how exactly is the route to 10.10.10.1 set up? Are you sure the configured routing peer is online and successfully connected to the users peer that tries to ping? Is that connection direct or relayed? So with netbird status -d can you detect a difference in the connection when it is working compared to when it is not working?

Enailis commented 4 months ago

Hi @pascal-fischer,

We have a network route to 10.10.10.1/32 using our internal servers as peer group. All servers in this group have access to 10.10.10.1. This route is distributed to all users. We have 3 different peers in this group, they're all online. The servers can't ping the users with their Netbird's IP. The users can ping the servers using their real IP but not their Netbird's IP. The connection to the 3 servers in the peer group is relayed.

We actually can't detect any difference in the netbird status -d when it's working and when it's not. The current configuration gives me this result for netbird status -d, there is other peers but they all look the same as the one shown here:

 server.mycompany.com:
  NetBird IP: 100.73.252.226
  Public key: RupexIsExt4J2oKsN4avstKkjD03vlSq728BzT/uvB8=
  Status: Connected
  -- detail --
  Connection type: Relayed
  Direct: false
  ICE candidate (Local/Remote): relay/prflx
  ICE candidate endpoints (Local/Remote): 90.90.90.90:63293/80.80.80.80:63293
  Last connection update: 2024-04-22 13:58:48
  Last WireGuard handshake: 2024-04-22 14:11:34
  Transfer status (received/sent) 1.9 KiB/1.5 KiB
  Quantum resistance: false
  Routes: -
  Latency: 55.684815ms

Daemon version: 0.27.3
CLI version: 0.27.3
Management: Connected to https://vpn.mycompany.com:33073
Signal: Connected to http://vpn.mycompany.com:10000
Relays:
  [stun:vpn.mycompany.com:3478] is Available
  [turn:vpn.mycompany.com:3478?transport=udp] is Available
Nameservers:
  [10.10.10.1:53] for [gitlab.mycompany.com] is Available
FQDN: ena.mycompany.com
NetBird IP: 100.73.213.219/16
Interface type: Kernel
Quantum resistance: false
Routes: -
Peers count: 8/13 Connected

Even if everything looks fine, I cannot access gitlab.mycompany.com.

To add something from my original issue, it now works for some windows users. The client takes a long time to connect and sometimes users have to do netbird up/netbird down multiple times before it actually works. It still doesn't work for Linux, macOS and some windows users.

vincent-lg18 commented 4 months ago

Hello, I'm working on the same Netbird instance as @Enailis

Some corrections / additional information on the above post:

Here is some other additional information:

On Windows clients (our users connected with SSO), our nameserver (10.10.10.1:53) is unstable, and its availability can change from one netbird down & netbird up to another for no apparent reason. When it's available, we can access gitlab.mycompany.com and server.mycompany.vpn without any problem. However, when it's unavailable ([10.10.10.1:53] for [gitlab.mycompany.com] is Unavailable, reason: 1 error occurred: * read udp 192.168.1.182:53408->10.10.10.1:53: i/o timeout)), we can no longer access gitlab.mycompany.com but we can still access server.mycompany.vpn.

On our Linux clients (other users connected with SSO), other behaviors appear. Our nameserver (10.10.10.1:53) is always marked as available in a netbird status -d, however, it is impossible to access gitlab.mycompany.com or server.mycompany.vpn

Here is a client's /etc/resolv.conf file:

# Generated by NetworkManager
nameserver 192.168.1.1

If I run dig gitlab.mycompany.com, I don't get an IP address back. However, if I run dig @10.10.10.1 gitlab.mycompany.com, its IP appears. So by adding the line nameserver 10.10.10.1 in the clients' /etc/resolv.conf files, we can access our gitlab but we can't still access server.mycompany.vpn.

Note that we can still access our gitlab via its IP address (the IP given by Netbird and its real IP). Our routes are therefore well configured, the problem only comes from DNS resolution.

Finally, note that this problem never appears for Linux clients installed with a Setup Key (our servers). Here's their /etc/resolv.conf file:

...
nameserver 127.0.0.53
options edns0 trust-ad
search company.vpn company.com

We therefore believe that the problem only comes from Netbird clients, which cannot apply DNS configurations to our workstations (Linux and Windows).

vincent-lg18 commented 4 months ago

Hello, here is some additional information about our Windows client errors.

Here are the lines in the client.log file when the error [10.10.10.1:53] for [gitlab.mycompany.com] is Unavailable, reason: 1 error occurred: * read udp 192.168.1.182:53408->10.10.10.1 :53: i/o timeout) appears on our Windows clients:

2024-04-25T11:58:42+02:00 ERRO util/net/dialer_generic.go:64: Failed to call dialer hooks: 1 error occurred:
        * executing dial hook: 1 error occurred:
        * adding route reference: failed to add route for prefix 90.90.90.90/32: add route to table: PowerShell add route: exit status 1

2024-04-25T11:58:43+02:00 ERRO util/net/listener_generic.go:128: Error executing listener write hook: adding route reference: failed to add route for prefix 90.90.90.90/32: add route to table: PowerShell add route: exit status 1
florian-obradovic commented 3 months ago

Similar issue here on macOS.

Non-authoritative answer: Name: docker.my-localdomain.local Address: 192.168.99.125

OS: darwin/arm64 Daemon version: 0.27.10 CLI version: 0.27.10 Management: Connected to https://netbird.mydomain.com:33073 Signal: Connected to http://netbird.mydomain.com:10000 Relays: [stun:netbird.mydomain.com:3478] is Available [turn:netbird.mydomain.com:3478?transport=udp] is Available Nameservers: [192.168.99.1:53] for [my-localdomain.local, mydomain.com] is Unavailable, reason: 1 error occurred: