Nobody can ping anybody, including netmaker

ethanfowler commented 2 years ago

Hi, standard AWS setup as per the docs, on an EC2 Micro 20.04.2 instance. DNS, dashboard etc. are working. Tunnels are up, but no-one can ping anyone. Even on the netmaker server:

$ ping 10.20.32.4
PING 10.20.32.4 (10.20.32.4) 56(84) bytes of data.
From 10.20.32.1 icmp_seq=1 Destination Host Unreachable
ping: sendmsg: Destination address required
From 10.20.32.1 icmp_seq=2 Destination Host Unreachable
ping: sendmsg: Destination address required
From 10.20.32.1 icmp_seq=3 Destination Host Unreachable
ping: sendmsg: Destination address required

My docker-compose.yml:

version: "3.4"

services:
  netmaker:
    container_name: netmaker
    image: gravitl/netmaker:v0.9.1
    volumes:
      - /var/run/dbus/system_bus_socket:/var/run/dbus/system_bus_socket
      - /run/systemd/system:/run/systemd/system
      - /etc/systemd/system:/etc/systemd/system
      - /sys/fs/cgroup:/sys/fs/cgroup
      - /usr/bin/wg:/usr/bin/wg
      - dnsconfig:/root/config/dnsconfig
      - sqldata:/root/data
    cap_add:
      - NET_ADMIN
      - SYS_ADMIN
    restart: always
    network_mode: host
    privileged: true
    environment:
      SERVER_HOST: "<public IP>"
      SERVER_API_CONN_STRING: "api.netmaker.<domain>.com:443"
      SERVER_GRPC_CONN_STRING: "grpc.netmaker.<domain>.com:443"
      COREDNS_ADDR: "<public IP>"
      GRPC_SSL: "on"
      DNS_MODE: "on"
      SERVER_HTTP_HOST: "api.netmaker.<domain>.com"
      SERVER_GRPC_HOST: "grpc.netmaker.<domain>.com"
      API_PORT: "8081"
      GRPC_PORT: "50051"
      CLIENT_MODE: "on"
      MASTER_KEY: "<key>"
      SERVER_GRPC_WIREGUARD: "off"
      CORS_ALLOWED_ORIGIN: "*"
      DATABASE: "sqlite"
      NODE_ID: "netmaker-server-1"
      AUTH_PROVIDER: "google"
      CLIENT_ID: "<id>.apps.googleusercontent.com"
      CLIENT_SECRET: "<secret>"
      SERVER_HTTP_HOST: "api.netmaker.<domain>.com"
      FRONTEND_URL: "https://dashboard.netmaker.<domain>.com"
  netmaker-ui:
    container_name: netmaker-ui
    depends_on:
      - netmaker
    image: gravitl/netmaker-ui:v0.9.1
    links:
      - "netmaker:api"
    ports:
      - "8082:80"
    environment:
      BACKEND_URL: "https://api.netmaker.<domain>.com"
    restart: always
  coredns:
    depends_on:
      - netmaker
    image: coredns/coredns
    command: -conf /root/dnsconfig/Corefile
    container_name: coredns
    restart: always
    ports:
      - "<EC2 private IP>:53/udp"
      - "<EC2 private IP>:53/tcp"
    volumes:
      - dnsconfig:/root/dnsconfig
  caddy:
    image: caddy:latest
    container_name: caddy
    restart: unless-stopped
    network_mode: host # Wants ports 80 and 443!
    volumes:
      - /root/Caddyfile:/etc/caddy/Caddyfile
      # - $PWD/site:/srv # you could also serve a static site in site folder
      - caddy_data:/data
      - caddy_conf:/config
volumes:
  caddy_data: {}
  caddy_conf: {}
  sqldata: {}
  dnsconfig: {}

Caddyfile

{
    # LetsEncrypt account
    email software@<domain>.com
}

# Dashboard
https://dashboard.netmaker.<domain>.com {
    reverse_proxy http://127.0.0.1:8082
}

# API
https://api.netmaker.<domain>.com {
    reverse_proxy http://127.0.0.1:8081
}

# gRPC
https://grpc.netmaker.<domain>.com {
    reverse_proxy h2c://127.0.0.1:50051
}

afeiszli commented 2 years ago

Hi Ethan, I'm noticing a couple things.

First of all, are your security groups open? You need to make sure none of the ports are blocked by AWS

Second, try using the docker-compose.contained.yml. That's what we're using for quick starts now.

ethanfowler commented 2 years ago

Hi, thanks. I'm pretty sure my security groups are good. Inbound: Outbound:

I think I originally used that docker-compose but it has moved on. Anyway, I've started again from docker-compose.contained.yml as you suggest; Everything comes up, google auth works, can create networks etc. However, when clients try to connect with e.g. docker clients:

docker run -d --network host  --privileged -e TOKEN=<token> -v /etc/netclient:/etc/netclient --name netclient gravitl/netclient:v0.9.1

No network adapter is added, and docker logs netclient shows:

[netclient] joining network
2021/12/02 10:27:21 running userspace WireGuard with wireguard-go
2021/12/02 10:27:21 [netclient] success
2021/12/02 10:27:21 [netclient] ALREADY_INSTALLED. Netclient appears to already be installed for software-ml. To re-install, please remove by executing 'sudo netclient leave -n software-ml'. Then re-run the install command.
[netclient] Starting netclient checkin
2021/12/02 10:27:21 running userspace WireGuard with wireguard-go
2021/12/02 10:27:21 [netclient] running checkin for all networks
2021/12/02 10:27:21 [netclient] error checking in for software-ml network: rpc error: code = Unauthenticated desc = Empty record

My new docker-compose.yml for reference:

version: "3.4"

services:
  netmaker:
    container_name: netmaker
    image: gravitl/netmaker:v0.9.1
    volumes:
      - dnsconfig:/root/config/dnsconfig
      - /usr/bin/wg:/usr/bin/wg
      - sqldata:/root/data
    cap_add: 
      - NET_ADMIN
    restart: always
    privileged: true
    environment:
      SERVER_HOST: "<public_ip>"
      SERVER_API_CONN_STRING: "api.netmaker.<domain>.com:443"
      SERVER_GRPC_CONN_STRING: "grpc.netmaker.<domain>.com:443"
      COREDNS_ADDR: "<public_ip>"
      GRPC_SSL: "on"
      DNS_MODE: "on"
      SERVER_HTTP_HOST: "api.netmaker.<domain>.com"
      SERVER_GRPC_HOST: "grpc.netmaker.<domain>.com"
      API_PORT: "8081"
      GRPC_PORT: "50051"
      CLIENT_MODE: "on"
      MASTER_KEY: "<key>"
      SERVER_GRPC_WIREGUARD: "off"
      CORS_ALLOWED_ORIGIN: "*"
      DISPLAY_KEYS: "on"
      DATABASE: "sqlite"
      NODE_ID: "netmaker-server-1"
      AUTH_PROVIDER: "google"
      CLIENT_ID: "<id>.apps.googleusercontent.com"
      CLIENT_SECRET: "<secret>"
      FRONTEND_URL: "https://dashboard.netmaker.<domain>.com"
    ports:
      - "51821-51921:51821-51921/udp"
      - "8081:8081"
      - "50051:50051"
  netmaker-ui:
    container_name: netmaker-ui
    depends_on:
      - netmaker
    image: gravitl/netmaker-ui:v0.9.1
    links:
      - "netmaker:api"
    ports:
      - "8082:80"
    environment:
      BACKEND_URL: "https://api.netmaker.<domain>.com"
    restart: always
  coredns:
    depends_on:
      - netmaker 
    image: coredns/coredns
    command: -conf /root/dnsconfig/Corefile
    container_name: coredns
    restart: always
    ports:
      - "<private-ip>:53:53/udp"
      - "<private-ip>:53:53/tcp"
    volumes:
      - dnsconfig:/root/dnsconfig
  caddy:
    image: caddy:latest
    container_name: caddy
    restart: unless-stopped
    network_mode: host # Wants ports 80 and 443!
    volumes:
      - /root/Caddyfile:/etc/caddy/Caddyfile
      # - $PWD/site:/srv # you could also serve a static site in site folder
      - caddy_data:/data
      - caddy_conf:/config
volumes:
  caddy_data: {}
  caddy_conf: {}
  sqldata: {}
  dnsconfig: {}

And Caddyfile:

{
    # LetsEncrypt account
    email software@<domain>.com
}

# Dashboard
https://dashboard.netmaker.<domain>.com {
    reverse_proxy http://127.0.0.1:8082
}

# API
https://api.netmaker.<domain>.com {
    reverse_proxy http://127.0.0.1:8081
}

# gRPC
https://grpc.netmaker.<domain>.com {
    reverse_proxy h2c://127.0.0.1:50051
}

ethanfowler commented 2 years ago

In the meantime, out of desperation, I created a brand new EC2 instance, and ran through the "quick install" instructions again. This led to LetsEncrypt rate limits being met. To get around this, I created new subdomains, and modified my docker-compose.yml URLs to point at the new subdomain. It is reachable, I can log in, create networks, but when I try to join a network using the docker method:

2021/12/02 12:27:54 running userspace WireGuard with wireguard-go
2021/12/02 12:27:54 [netclient] running checkin for all networks
2021/12/02 12:27:54 [netclient] error checking in for software-ml network: rpc error: code = Unavailable desc = connection error: desc = "transport: authentication handshake failed: remote error: tls: internal error"
2021/12/02 12:28:04 running userspace WireGuard with wireguard-go
2021/12/02 12:28:04 [netclient] running checkin for all networks
2021/12/02 12:28:05 [netclient] error checking in for software-ml network: rpc error: code = Unavailable desc = connection error: desc = "transport: authentication handshake failed: remote error: tls: internal error"

at my wits' end here. This must be my 10th attempt to get netmaker up and running stably, and I used to be a systems engineer designing IP cryptos, am now a software engineer. I (broadly) know what I'm doing.

ethanfowler commented 2 years ago

In case it sheds any light, 1 out of 3 clients can join the network, the other two give that TLS error above.

afeiszli commented 2 years ago

It still looks like you are not using the "contained" docker-compose (docker-compose.contained.yml).

The TLS issue likely occurs from re-using the same domain with Caddy. If you run the install multiple times with the same subdomain, Caddy will start failing to automatically generate certs.

It also looks like there are artifacts of your installation in the docker client. ALREADY_INSTALLED indicates there are already entries in /etc/netclient and it will re-use those entries. So worth removing that folder (and related interfaces, if they exist) before moving on.

One last note, your security groups look correct, but for an install, I would just use the "quick-install" command from the readme, I think that will do a better job of setting up correctly.

ethanfowler commented 2 years ago

I am using docker-compose.contained.yml from master; I replaced the placeholder variables and added Google OAuth.

As I said above, I have moved to a new domain (nm.domain.com instead of netmaker.domain.com) and all of the subdomains are resolving correctly, dashboard working etc.

Your point about docker artefacts has helped; I had checked that there were no docker volumes persisting on the client, but failed to notice the mapping to the host machine /etc/netclient. Having run sudo rm -rf /etc/netclient on the host, the docker container now seems to connect correctly. I'll test this on other clients, along with the original ping issue.

ethanfowler commented 2 years ago

Ok, so, back to original issue; fresh install on a fresh EC2 instance, as per the quick install instructions, taking docker-compose.contained.yml from master and replacing variables as instructed, AWS security group set up as instructed.

Nodes can connect to netmaker server, join networks, and ping netmaker on its network IP. Weirdly, netmaker cannot ping other nodes. Nodes cannot ping each other. DNS is also not working.

ethanfowler commented 2 years ago

Ok, I can ping other nodes from inside the netmaker container on the server, but not from the server itself:

$ sudo docker exec -it netmaker sh
~ # ping 10.251.1.3
PING 10.251.1.3 (10.251.1.3): 56 data bytes
64 bytes from 10.251.1.3: seq=1 ttl=64 time=15.632 ms
64 bytes from 10.251.1.3: seq=2 ttl=64 time=15.236 ms

$ ping 10.251.1.3
PING 10.251.1.3 (10.251.1.3) 56(84) bytes of data.

hangs.

afeiszli commented 2 years ago

yup, that's what the 'contained' version does. It confines networking to the container, which keeps things a lot cleaner on the host. If you need to use the host as a bastion, the best option is to deploy an additional client.

If nodes are unable to reach each other, the most likely scenario is that the UDP hole punching feature is unable to get the correct addresses. Though in this case, nodes are typically unable to reach the server.

I would check "wg show" on the nodes and see if there is a handshake with the other nodes. If there is a handshake but you cannot ping, you may need to reduce MTU. If there is no handshake, try "netclient pull -n ". If that doesnt work, you may need to try a network with UDP hole punching turned off.

ethanfowler commented 2 years ago

Thanks for the response. For context, pre 0.9.0, all of these machines talked fine, but there was a memory leak that kept crashing my EC2 instance, hence the upgrade. My point is: MTUs and hole punching were fine. I would also point out that a single-ping packet is 84 bytes all-in - how are we expecting the addition of a wireguard encapsulation to take a packet over the MTU?

So I think I'm seeing two different failure modes.

LAN Both machines are on the same LAN. They have IPs 10.251.1.3 and .4. They have a handshake, but are unable to ping. The LAN MTU is the usual 1500, and the netmaker default was 1280. Lowering their netmaker MTUs to 1000 and pulling the config has resulted in some, sporadic pinging. i.e. I leave ping running, and every few minutes, a burst of around 10 pings get through. No guarantee this is because of the reduced MTU though, maybe a burst didn't occur before the change for other reasons.
WAN No handshake between machines behind seperate firewalls. As mentioned above, these machines talked fine pre 0.9.0. And both can ping the server over wg.

ethanfowler commented 2 years ago

I have just created (another) new EC2 instance and run the nm-quick.sh install script with the domain and email arguments, and ended up with exactly the same symptoms as above.

ethanfowler commented 2 years ago

So, doing some more testing this morning, it seems the problem is the docker client, i.e. setting up a client with:

docker stop netclient && docker rm netclient && rm -rf /etc/netclient  # For a clean install
docker run -d --network host  --privileged -e TOKEN=<token> -v /etc/netclient:/etc/netclient --name netclient gravitl/netclient:v0.9.1

Results in the symptoms above, namely:

handshakes but inability to ping nodes on the same LAN
no handshakes with any remote peers apart from netmaker

If I use the non-docker Linux standard:

sudo su
docker stop netclient && docker rm netclient && rm -rf /etc/netclient  # For a clean install
curl -sfL https://raw.githubusercontent.com/gravitl/netmaker/master/scripts/netclient-install.sh | VERSION=v0.9.1 KEY=<token> sh -

I can ping even remote clients, and even DNS works (although only on 20.04 clients, not 18.04, but that's for a separate issue).

So I think the response to this issue should be an investigation into the dockerised client.

I am still in the situation wherein my EC2 instance goes to full CPU and stops responding after a couple of days; possibly a memory leak, but EC2 doesn't log memory usage, so I need to collect more data before raising an issue.

hagaibarel commented 2 years ago

Seeing this same behavior while trying to run netclient as a kubernetes daemonset on hosts with kernel wiregurad installed. Running wg show lists the peers and I get handshake from them, but I can't ping them from the host.

Installing with the script results in a working setup of netclient and I can ping other nodes and also get DNS working

hagaibarel commented 2 years ago

some more digging, seems like missing route table entries prevent the ping from going thorough. I have 2 peers with wg ip of 10.20.0.2 and 10.20.0.1. wg show lists them and the handshake looks fine. However, it seems no ip route is setup between the two. running ip route get shows the default route:

~# ip route get 10.20.0.2
10.20.0.2 via 10.207.0.1 dev ens4 src 10.207.0.19 uid 0 
    cache

Running wg-quick down /etc/netclient/config/nm-pme.conf and wg-quick up /etc/netclient/config/nm-pme.conf, removes and recreates the interface this time with the proper routes set up and I can get ping working between hosts, and the route seems fine:

~# ip route get 10.20.0.2
10.20.0.2 dev nm-pme src 10.20.0.1 uid 0 
    cache

afeiszli commented 2 years ago

@HagaiBarel That is a good find. What OS are you running? I have noticed this issue, and it seems to be dependent on linux OS. Was thinking we should just add in a few lines of code after bringing up the interface to confirm the interface gets created.

hagaibarel commented 2 years ago

I'm running ubuntu on the hosts:

~# cat /etc/os-release 
NAME="Ubuntu"
VERSION="20.04.3 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.3 LTS"
VERSION_ID="20.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal

hagaibarel commented 2 years ago

Btw, I still don't have DNS setup properly, I guess it might be a different issue but the two seem related. Running resolvectl on the hosts produces the following:

~# resolvectl status nm-pme
Link 74 (nm-pme)
      Current Scopes: none
DefaultRoute setting: no  
       LLMNR setting: yes 
MulticastDNS setting: no  
  DNSOverTLS setting: no  
      DNSSEC setting: no  
    DNSSEC supported: no

And while using the installation script the settings were inserted correctly

hagaibarel commented 2 years ago

Some more info, I've deleted the daemonset and removed the wireguard settings and netclient config (wg-quick down /etc/netclient/config/nm-pme.conf && rm -rf /etc/netclient) and relaunched the daemonset with WG_QUICK_USERSPACE_IMPLEMENTATION="" and now routes are created successfully and I can ping between nodes.

So it seems that this env var somehow messes the routes being created.

Still no DNS...

hagaibarel commented 2 years ago

I've opened https://github.com/gravitl/netmaker/issues/540 for further discussion on DNS issues

afeiszli commented 2 years ago

fixed route issue in 0.9.2

gravitl / netmaker

Nobody can ping anybody, including netmaker #528