No route between hosts - Githubissues

jurgenhaas commented 2 years ago

I've installed netbird 0.8.2 on 3 hosts in 3 different locations.

2022-07-13_10-26

They appear to be online but netbird status on each of them doesn't seem to be seeing the other hosts:

Daemon status: Connected
Management: Connected
Signal:  Connected
NetBird IP: 100.64.94.238/16
Interface type: Kernel
Peers count: 0/2 Connected

On 2 of them I've enabled debugging and the log files are here: bslog1.log pcweb1.log

I can't ping between the hosts and when trying to ssh into them, I get the error Error: dial tcp 100.64.94.238:44338: connect: no route to host

Any idea what's wrong?

mlsmaycon commented 2 years ago

Hello, @jurgenhaas sorry to see that you are facing some connectivity issues.

Thanks for sending so much information, after seeing your logs, it seems like you have 3 peers with unusual networking setups, all 3 have interfaces with 169 addresses, and 2 with many network ranges (docker|VMs maybe?). With that said, they are attempting to negotiate the best connection path, but can't get into an agreement within 15s and the ICE protocol agent is marking them as failed.

Could you confirm a few things? Are there any outgoing filters blocking UDP traffic in any peer? Also, could tell us if you have any other VPN running on these nodes?

If you think is best, we can move this conversation to our community slack

jurgenhaas commented 2 years ago

Thanks @mlsmaycon for getting back to me on this. You're right, there are a lot of docker networks on each of our hosts. I have deleted most of them, so that it's down to 5-6 networks on each host. But that can't be persistant, as we have lots of docker based CI/CD processes, each of which setting up their own internal network dynamically.

Now I get pings between 2 of the 3 hosts. bslog1 is still not playing with us. There should not be any outbound filter being applied on any of the hosts though.

The 2 hosts that can ping each other, still can't ssh into each other although the ssh service is enabled in the netbird account. It responds with an i/o timeout on port 44338 on either of the 2 hosts. I guess I don't have to open those ports in the firewall as that connection should go through the VPN tunnel, right?

Last but not least, no there are no other VPN service running on those hosts.

mlsmaycon commented 2 years ago

Ok @jurgenhaas, regarding the interfaces, you can filter them all at once if they match a common prefix, to do that you can update the config file /etc/netbird/config.json and update the IFaceBlackList with this prefix.

Regarding the SSH issue, you can validate the firewall configuration with the following iptables rule:

iptables -I INPUT 1 -i wt0 -p tcp --dport 44338 -j ACCEPT

if that doesn't help you can remove the rule with:

iptables -D INPUT 1

After you apply these changes, could you please share new logs from bslog1 and another peer that it should connect to?

jurgenhaas commented 2 years ago

OK, my block list (not black list!) now looks like this:

    "WgIface": "wt0",
    "IFaceBlackList": [
        "wt0",
        "tun0",
        "zt",
        "ZeroTier",
        "utun",
        "wg",
        "ts",
        "Tailscale",
        "tailscale",
        "veth",
        "docker",
        "br-"
    ],

Is it correct, that wt0 in blocked too?

Anyways, here are the new logs: bslog1.log pcweb1.log

I now have 4 hosts altogether, 3 of which connect to each other except bslog1.

The netbird ssh connection now also works, when I open that port on the target machine. Still confused why we need an open port when the ssh should be established through the vpn tunnel.

mlsmaycon commented 2 years ago

Hello @jurgenhaas, after checking the logs, it seems that bslog1 is restricting outgoing traffic for UDP; this is more or less clear when checking pcweb1 records and seeing the exchange of only a local address from bslog1.

Even though NetBird doesn't require open incoming ports, we still need outgoing UDP traffic to be allowed. We will work on that as this case may appear more in the restricted firewall configuration scenario, but it is still a good reason for us to introduce TCP relaying.

To fix the issue, you can test the following rule:

sudo iptables -I OUTPUT 1 -p udp -j ACCEPT

You can also check if your default OUTPUT chain policy is set to deny with the following command:

sudo iptables -n -L OUTPUT

You should see a line similar to Chain OUTPUT (policy ACCEPT) or Chain OUTPUT (policy DROP); this is the default behavior of your output chain and if is set to DROP, you may need to manage every outgoing connection.

Regarding the SSH access, the issue is very similar, even with the wt0 interface, we still follow the iptables rules. I am currently working on our router feature which will introduce minimal firewall management for traffic coming/going through the interface but is yet, too early for me to give you more details on that.

jurgenhaas commented 2 years ago

Very interesting, thanks for all your support on this.

On bslog1, we see this:

sudo iptables -n -L OUTPUT
Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination

As I recalled, there is no outbound filter on that host at all.

Then I recalled, that we also installed nftables on that host for testing CrowdSec, another exciting open source tool which requires that firewall. I stop nftables and voila, bslog1 connected right away to all the other hosts and things seem to be working great.

I guess my next step now should be to try self-hosting the netbird dashboard.

mlsmaycon commented 2 years ago

Really great that you found out the issue. Just out of curiosity, was nftables running on the nodes you had to handle ssh access?

For self-hosting we got a quick start guide at https://netbird.io/docs/getting-started/self-hosting

jurgenhaas commented 2 years ago

nftables was only running on bslog1, the one host that didn't connect to all the others.

What's interesting now is that I can SSH from my local host into all 3 others. But from the others I can not SSH into any of the other, although they all show being connected to all other hosts with netbird status. The error message says dial tcp 100.64.94.238:44338: i/o timeout

That sounds like some outbound firewall issue, because inbound they should all be OK as I can SSH into them from one of the hosts, here locally. Any idea?

mlsmaycon commented 2 years ago

Besides a possible outgoing firewall issue, we can check routing configuration as well.

Could you share the output of ip route | grep 100 from both peers? Also, are they able to ping each other?

jurgenhaas commented 2 years ago

Sure, here is both the routing info and the iptable config for both hosts:

# bslog1:
ip route | grep 100
100.64.0.0/16 dev wt0 proto kernel scope link src 100.64.104.21 

iptables -n -L OUTPUT
Chain OUTPUT (policy ACCEPT)

# pcweb1:
ip route | grep 100
100.64.0.0/16 dev wt0 proto kernel scope link src 100.64.94.238

iptables -n -L OUTPUT
Chain OUTPUT (policy DROP)
target     prot opt source               destination         
ACCEPT     all  --  0.0.0.0/0            0.0.0.0/0            state RELATED,ESTABLISHED
ACCEPT     all  --  127.0.0.1            127.0.0.1           
DROP       all  --  0.0.0.0/0            0.0.0.0/0            state INVALID
ACCEPT     udp  --  0.0.0.0/0            0.0.0.0/0            udp dpt:53 state NEW
ACCEPT     udp  --  0.0.0.0/0            0.0.0.0/0            udp dpts:67:68 state NEW
ACCEPT     udp  --  0.0.0.0/0            0.0.0.0/0            udp dpt:123 state NEW
ACCEPT     udp  --  0.0.0.0/0            0.0.0.0/0            udp dpts:60000:61000 state NEW
ACCEPT     tcp  --  0.0.0.0/0            0.0.0.0/0            tcp dpt:22 state NEW
ACCEPT     tcp  --  0.0.0.0/0            0.0.0.0/0            tcp dpt:2222 state NEW
ACCEPT     tcp  --  0.0.0.0/0            0.0.0.0/0            tcp dpt:25 state NEW
ACCEPT     tcp  --  0.0.0.0/0            0.0.0.0/0            tcp dpt:80 state NEW
ACCEPT     tcp  --  0.0.0.0/0            0.0.0.0/0            tcp dpt:8080 state NEW
ACCEPT     tcp  --  0.0.0.0/0            0.0.0.0/0            tcp dpt:443 state NEW
ACCEPT     tcp  --  0.0.0.0/0            0.0.0.0/0            tcp dpt:465 state NEW
ACCEPT     tcp  --  0.0.0.0/0            0.0.0.0/0            tcp dpt:587 state NEW
ACCEPT     tcp  --  0.0.0.0/0            0.0.0.0/0            tcp dpt:993 state NEW
ACCEPT     tcp  --  0.0.0.0/0            0.0.0.0/0            tcp dpt:161 state NEW
ACCEPT     udp  --  0.0.0.0/0            0.0.0.0/0            udp dpt:161 state NEW
ACCEPT     tcp  --  0.0.0.0/0            0.0.0.0/0            tcp dpt:162 state NEW
ACCEPT     udp  --  0.0.0.0/0            0.0.0.0/0            udp dpt:162 state NEW
ACCEPT     tcp  --  0.0.0.0/0            0.0.0.0/0            tcp dpt:9418 state NEW
ACCEPT     tcp  --  0.0.0.0/0            0.0.0.0/0            tcp dpt:514 state NEW
ACCEPT     tcp  --  0.0.0.0/0            0.0.0.0/0            tcp dpt:24284 state NEW
ACCEPT     udp  --  0.0.0.0/0            0.0.0.0/0            udp dpt:24284 state NEW
ACCEPT     tcp  --  0.0.0.0/0            0.0.0.0/0            tcp dpt:10000 state NEW
ACCEPT     tcp  --  0.0.0.0/0            0.0.0.0/0            tcp dpt:33073 state NEW
ACCEPT     tcp  --  0.0.0.0/0            0.0.0.0/0            tcp dpt:44338 state NEW
ACCEPT     udp  --  0.0.0.0/0            0.0.0.0/0            udp dpts:5555:65535 state NEW

For the second host I have added outgoing accept rule for port 44338. still no difference.

mlsmaycon commented 2 years ago

Hi @jurgenhaas, unless the peers aren't able to ping or you added the rule for 44338 associated with the wrong output interface, all seems ok.

A step-by-step troubleshoot I would do know to get to the bottom of this is:

After checking basic ping works (might neet to allow in iptables), you can use netcat to test the port without the netbird agent:

pcweb1  $: nc -vw 5 -z -G 5 100.64.104.21 44338

if that timeout as well, we need to investigate if the traffic gets out of the wt0 interface or it is being dropped on the other node. You can do that by opening 2 new terminal windows, 1 connecting to pcweb1 again and another connecting to bslog1 , once you do that you can run tcpdump as follow:

On pcweb1:

pcweb1 $ sudo tcpdump -i wg0 -nn port 44338 and 100.64.104.21

and On bslog1:

bslog1 $ sudo tcpdump -i wg0 -nn port 44338 and 100.64.94.238

Then you can repeat the test with either netcat or the netbird agent bin.

You should see samples from both directions like this:

18:25:11.435274 IP 100.64.94.238.48362 > 100.64.104.21.44338: Flags ...
18:25:11.435284 IP 100.64.104.2144338 > 100.64.94.238.48362: Flags ...

If you don't see the exchange between both nodes it means that the packets are being dropped on one of the sides. At this point, if you don't see a packet on one of the peers, check its INPUT chain for that initial INPUT rule I've sent you.

jurgenhaas commented 2 years ago

This is all somehow strange. When I tried ssh again this morning, it almost worked everywhere. Only bslog1 is still causing issues. Maybe connection negotiations took some time yesterday, so that it didn't work yet.

Now, here is the situation:

local host: can ping all 3 other hosts and can ssh into all of them
pcweb1: can ping and ssh all except bslog1
zkweb1: can ping and ssh all except bslog1
bslog1: can ping an ssh into my local host, but not to the other 2

The nc-command above has an issue with -G 5 which is not supported, I had to remove that. It times out from pcweb1 to bslog1, but it works from my local host to bslog1 and vice versa.

The tcpdump command has several issues. I guess wg0 needs to be replaced with wt0 but then I get the output tcpdump: 'port' modifier applied to ip host and it quits right away and doesn't do anything.

What really confuses me:

SRC	DEST	status
local host	bslog1	OK
local host	pcweb1	OK
pcweb1	local host	OK
pcweb1	bslog1	FAIL
bslog1	local host	OK
bslog1	pcweb1	FAIL

That seems to tell, that the firewall does not seem to be the problem because bslog1 can communicate in and out, at least with my local host. But it can't with any of the others. As ping doesn't work either, it could be a routing issue?

jurgenhaas commented 2 years ago

One more observation: bslog1 is hosted by Hetzner and it's the only one, where the upstream firewall from Hetzner is enabled with these settings:

2022-07-15_09-12

All other hosts at Hetzner use iptables and have their settings like this:

2022-07-15_09-10

I've just provisioned another host at Hetzner with the allow all setting and using iptables, and that works fine too.

So, the solution seems to be to not use the Hetzner firewall but only iptables. However, that leaves the question why my local host can communicate with bslog1 where all the others can't?

jurgenhaas commented 2 years ago

Maybe this helps to explain it:

on my local host, netbird status --detail shows that the connection type to all other hosts is P2P and that it's a direct connection. for bslog1 it shows that the connection type is relayed and the connection in NOT direct.

on all the other hosts, the connection type to bslog1 is also relayed but the connection is direct.

Maybe that's the problem that a relayed connection can not be direct?

mlsmaycon commented 2 years ago

Hi @jurgenhaas Hetzner firewall explains the issue. It is a stateless firewall, which means that it doesn't keep track between in/out packets.

In this case, you may add to this server firewall an UDP port range equals to the result of:

sudo cat /proc/sys/net/ipv4/ip_local_port_range

This range is used by the processes when negotiating connections. If 51820 is not part of this range, you can also create a rule for it as well.

You can remove the rule number 9 from your screenshot because the NetBird SSH traffic going through this firewall will be encapsulated in Wireguard packets.

jurgenhaas commented 2 years ago

That was it, thanks a lot. I had to restart the netbird service though in order to get connected.

What I still don't understand, why my local host was able to ssh into bslog1 even before that new firewall rule. Otherwise this is ready to be closed from my point of view.

Great support @mlsmaycon 🥇

mlsmaycon commented 2 years ago

That is great @jurgenhaas.

Regarding the issue, I fear that you didn't have established a proper connection between peers in Hetzner and the bslog1 because NetBird attempt to switch to the native interface without proxy when it sees hosts with public IPs. I opened another issue to verify this logic here #393

netbirdio / netbird

No route between hosts #390