Closed ethanfowler closed 2 years ago
Hi Ethan, I'm noticing a couple things.
First of all, are your security groups open? You need to make sure none of the ports are blocked by AWS
Second, try using the docker-compose.contained.yml. That's what we're using for quick starts now.
Hi, thanks. I'm pretty sure my security groups are good. Inbound: Outbound:
I think I originally used that docker-compose but it has moved on. Anyway, I've started again from docker-compose.contained.yml as you suggest; Everything comes up, google auth works, can create networks etc. However, when clients try to connect with e.g. docker clients:
docker run -d --network host --privileged -e TOKEN=<token> -v /etc/netclient:/etc/netclient --name netclient gravitl/netclient:v0.9.1
No network adapter is added, and docker logs netclient
shows:
[netclient] joining network
2021/12/02 10:27:21 running userspace WireGuard with wireguard-go
2021/12/02 10:27:21 [netclient] success
2021/12/02 10:27:21 [netclient] ALREADY_INSTALLED. Netclient appears to already be installed for software-ml. To re-install, please remove by executing 'sudo netclient leave -n software-ml'. Then re-run the install command.
[netclient] Starting netclient checkin
2021/12/02 10:27:21 running userspace WireGuard with wireguard-go
2021/12/02 10:27:21 [netclient] running checkin for all networks
2021/12/02 10:27:21 [netclient] error checking in for software-ml network: rpc error: code = Unauthenticated desc = Empty record
My new docker-compose.yml for reference:
version: "3.4"
services:
netmaker:
container_name: netmaker
image: gravitl/netmaker:v0.9.1
volumes:
- dnsconfig:/root/config/dnsconfig
- /usr/bin/wg:/usr/bin/wg
- sqldata:/root/data
cap_add:
- NET_ADMIN
restart: always
privileged: true
environment:
SERVER_HOST: "<public_ip>"
SERVER_API_CONN_STRING: "api.netmaker.<domain>.com:443"
SERVER_GRPC_CONN_STRING: "grpc.netmaker.<domain>.com:443"
COREDNS_ADDR: "<public_ip>"
GRPC_SSL: "on"
DNS_MODE: "on"
SERVER_HTTP_HOST: "api.netmaker.<domain>.com"
SERVER_GRPC_HOST: "grpc.netmaker.<domain>.com"
API_PORT: "8081"
GRPC_PORT: "50051"
CLIENT_MODE: "on"
MASTER_KEY: "<key>"
SERVER_GRPC_WIREGUARD: "off"
CORS_ALLOWED_ORIGIN: "*"
DISPLAY_KEYS: "on"
DATABASE: "sqlite"
NODE_ID: "netmaker-server-1"
AUTH_PROVIDER: "google"
CLIENT_ID: "<id>.apps.googleusercontent.com"
CLIENT_SECRET: "<secret>"
FRONTEND_URL: "https://dashboard.netmaker.<domain>.com"
ports:
- "51821-51921:51821-51921/udp"
- "8081:8081"
- "50051:50051"
netmaker-ui:
container_name: netmaker-ui
depends_on:
- netmaker
image: gravitl/netmaker-ui:v0.9.1
links:
- "netmaker:api"
ports:
- "8082:80"
environment:
BACKEND_URL: "https://api.netmaker.<domain>.com"
restart: always
coredns:
depends_on:
- netmaker
image: coredns/coredns
command: -conf /root/dnsconfig/Corefile
container_name: coredns
restart: always
ports:
- "<private-ip>:53:53/udp"
- "<private-ip>:53:53/tcp"
volumes:
- dnsconfig:/root/dnsconfig
caddy:
image: caddy:latest
container_name: caddy
restart: unless-stopped
network_mode: host # Wants ports 80 and 443!
volumes:
- /root/Caddyfile:/etc/caddy/Caddyfile
# - $PWD/site:/srv # you could also serve a static site in site folder
- caddy_data:/data
- caddy_conf:/config
volumes:
caddy_data: {}
caddy_conf: {}
sqldata: {}
dnsconfig: {}
And Caddyfile:
{
# LetsEncrypt account
email software@<domain>.com
}
# Dashboard
https://dashboard.netmaker.<domain>.com {
reverse_proxy http://127.0.0.1:8082
}
# API
https://api.netmaker.<domain>.com {
reverse_proxy http://127.0.0.1:8081
}
# gRPC
https://grpc.netmaker.<domain>.com {
reverse_proxy h2c://127.0.0.1:50051
}
In the meantime, out of desperation, I created a brand new EC2 instance, and ran through the "quick install" instructions again. This led to LetsEncrypt rate limits being met. To get around this, I created new subdomains, and modified my docker-compose.yml URLs to point at the new subdomain. It is reachable, I can log in, create networks, but when I try to join a network using the docker method:
2021/12/02 12:27:54 running userspace WireGuard with wireguard-go
2021/12/02 12:27:54 [netclient] running checkin for all networks
2021/12/02 12:27:54 [netclient] error checking in for software-ml network: rpc error: code = Unavailable desc = connection error: desc = "transport: authentication handshake failed: remote error: tls: internal error"
2021/12/02 12:28:04 running userspace WireGuard with wireguard-go
2021/12/02 12:28:04 [netclient] running checkin for all networks
2021/12/02 12:28:05 [netclient] error checking in for software-ml network: rpc error: code = Unavailable desc = connection error: desc = "transport: authentication handshake failed: remote error: tls: internal error"
at my wits' end here. This must be my 10th attempt to get netmaker up and running stably, and I used to be a systems engineer designing IP cryptos, am now a software engineer. I (broadly) know what I'm doing.
In case it sheds any light, 1 out of 3 clients can join the network, the other two give that TLS error above.
It still looks like you are not using the "contained" docker-compose (docker-compose.contained.yml).
The TLS issue likely occurs from re-using the same domain with Caddy. If you run the install multiple times with the same subdomain, Caddy will start failing to automatically generate certs.
It also looks like there are artifacts of your installation in the docker client. ALREADY_INSTALLED indicates there are already entries in /etc/netclient and it will re-use those entries. So worth removing that folder (and related interfaces, if they exist) before moving on.
One last note, your security groups look correct, but for an install, I would just use the "quick-install" command from the readme, I think that will do a better job of setting up correctly.
I am using docker-compose.contained.yml from master; I replaced the placeholder variables and added Google OAuth.
As I said above, I have moved to a new domain (nm.domain.com instead of netmaker.domain.com) and all of the subdomains are resolving correctly, dashboard working etc.
Your point about docker artefacts has helped; I had checked that there were no docker volumes persisting on the client, but failed to notice the mapping to the host machine /etc/netclient. Having run sudo rm -rf /etc/netclient
on the host, the docker container now seems to connect correctly. I'll test this on other clients, along with the original ping issue.
Ok, so, back to original issue; fresh install on a fresh EC2 instance, as per the quick install instructions, taking docker-compose.contained.yml from master and replacing variables as instructed, AWS security group set up as instructed.
Nodes can connect to netmaker server, join networks, and ping netmaker on its network IP. Weirdly, netmaker cannot ping other nodes. Nodes cannot ping each other. DNS is also not working.
Ok, I can ping other nodes from inside the netmaker
container on the server, but not from the server itself:
$ sudo docker exec -it netmaker sh
~ # ping 10.251.1.3
PING 10.251.1.3 (10.251.1.3): 56 data bytes
64 bytes from 10.251.1.3: seq=1 ttl=64 time=15.632 ms
64 bytes from 10.251.1.3: seq=2 ttl=64 time=15.236 ms
$ ping 10.251.1.3
PING 10.251.1.3 (10.251.1.3) 56(84) bytes of data.
hangs.
yup, that's what the 'contained' version does. It confines networking to the container, which keeps things a lot cleaner on the host. If you need to use the host as a bastion, the best option is to deploy an additional client.
If nodes are unable to reach each other, the most likely scenario is that the UDP hole punching feature is unable to get the correct addresses. Though in this case, nodes are typically unable to reach the server.
I would check "wg show" on the nodes and see if there is a handshake with the other nodes. If there is a handshake but you cannot ping, you may need to reduce MTU. If there is no handshake, try "netclient pull -n
Thanks for the response. For context, pre 0.9.0, all of these machines talked fine, but there was a memory leak that kept crashing my EC2 instance, hence the upgrade. My point is: MTUs and hole punching were fine. I would also point out that a single-ping packet is 84 bytes all-in - how are we expecting the addition of a wireguard encapsulation to take a packet over the MTU?
So I think I'm seeing two different failure modes.
I have just created (another) new EC2 instance and run the nm-quick.sh install script with the domain and email arguments, and ended up with exactly the same symptoms as above.
So, doing some more testing this morning, it seems the problem is the docker client, i.e. setting up a client with:
docker stop netclient && docker rm netclient && rm -rf /etc/netclient # For a clean install
docker run -d --network host --privileged -e TOKEN=<token> -v /etc/netclient:/etc/netclient --name netclient gravitl/netclient:v0.9.1
Results in the symptoms above, namely:
netmaker
If I use the non-docker Linux standard:
sudo su
docker stop netclient && docker rm netclient && rm -rf /etc/netclient # For a clean install
curl -sfL https://raw.githubusercontent.com/gravitl/netmaker/master/scripts/netclient-install.sh | VERSION=v0.9.1 KEY=<token> sh -
I can ping even remote clients, and even DNS works (although only on 20.04 clients, not 18.04, but that's for a separate issue).
So I think the response to this issue should be an investigation into the dockerised client.
I am still in the situation wherein my EC2 instance goes to full CPU and stops responding after a couple of days; possibly a memory leak, but EC2 doesn't log memory usage, so I need to collect more data before raising an issue.
Seeing this same behavior while trying to run netclient
as a kubernetes
daemonset on hosts with kernel wiregurad installed. Running wg show
lists the peers and I get handshake from them, but I can't ping them from the host.
Installing with the script results in a working setup of netclient
and I can ping other nodes and also get DNS working
some more digging, seems like missing route table entries prevent the ping from going thorough. I have 2 peers with wg ip
of 10.20.0.2 and 10.20.0.1. wg show
lists them and the handshake looks fine. However, it seems no ip route
is setup between the two. running ip route get
shows the default route:
~# ip route get 10.20.0.2
10.20.0.2 via 10.207.0.1 dev ens4 src 10.207.0.19 uid 0
cache
Running wg-quick down /etc/netclient/config/nm-pme.conf
and wg-quick up /etc/netclient/config/nm-pme.conf
, removes and recreates the interface this time with the proper routes set up and I can get ping working between hosts, and the route seems fine:
~# ip route get 10.20.0.2
10.20.0.2 dev nm-pme src 10.20.0.1 uid 0
cache
@HagaiBarel That is a good find. What OS are you running? I have noticed this issue, and it seems to be dependent on linux OS. Was thinking we should just add in a few lines of code after bringing up the interface to confirm the interface gets created.
I'm running ubuntu
on the hosts:
~# cat /etc/os-release
NAME="Ubuntu"
VERSION="20.04.3 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.3 LTS"
VERSION_ID="20.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal
Btw, I still don't have DNS setup properly, I guess it might be a different issue but the two seem related. Running resolvectl
on the hosts produces the following:
~# resolvectl status nm-pme
Link 74 (nm-pme)
Current Scopes: none
DefaultRoute setting: no
LLMNR setting: yes
MulticastDNS setting: no
DNSOverTLS setting: no
DNSSEC setting: no
DNSSEC supported: no
And while using the installation script the settings were inserted correctly
Some more info, I've deleted the daemonset and removed the wireguard settings and netclient config (wg-quick down /etc/netclient/config/nm-pme.conf && rm -rf /etc/netclient
) and relaunched the daemonset with WG_QUICK_USERSPACE_IMPLEMENTATION=""
and now routes are created successfully and I can ping between nodes.
So it seems that this env var
somehow messes the routes being created.
Still no DNS...
I've opened https://github.com/gravitl/netmaker/issues/540 for further discussion on DNS issues
fixed route issue in 0.9.2
Hi, standard AWS setup as per the docs, on an EC2 Micro 20.04.2 instance. DNS, dashboard etc. are working. Tunnels are up, but no-one can ping anyone. Even on the netmaker server:
My docker-compose.yml:
Caddyfile