gravitl / netmaker

Netmaker makes networks with WireGuard. Netmaker automates fast, secure, and distributed virtual networks.
https://netmaker.io
Other
9.55k stars 553 forks source link

[Bug]: Suddenly unable to work #3183

Open wuwo1952368901 opened 2 weeks ago

wuwo1952368901 commented 2 weeks ago

Contact Details

No response

What happened?

Suddenly unable to ping between nodes.

Version

v0.24.2

What OS are you using?

No response

Relevant log output

No response

Contributing guidelines

wuwo1952368901 commented 2 weeks ago

After running normally for a period of time, some nodes may experience ping failure. The netclient service needs to be restarted before it can be restored, but after a period of recovery, there may be issues with the system. How can we investigate the specific cause? @afeiszli

abhishek9686 commented 2 weeks ago

After running normally for a period of time, some nodes may experience ping failure. The netclient service needs to be restarted before it can be restored, but after a period of recovery, there may be issues with the system. How can we investigate the specific cause? @afeiszli

can you provide more information on your environment?

  1. clients are running on which OS?
  2. Are they behind NAT?
wuwo1952368901 commented 2 weeks ago

They are not behind NAT.

OS:

  Debian
  debian_version:12.7
  kernel: Linux  6.1.0-26-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.112-1 (2024-09-30) x86_64 GNU/Linux

  Debian
  debian_version:12.4
  kernel: Linux  6.1.0-17-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.69-1 (2023-12-30) x86_64 GNU/Linux
yabinma commented 2 weeks ago

They are not behind NAT.

OS:

  Debian
  debian_version:12.7
  kernel: Linux  6.1.0-26-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.112-1 (2024-09-30) x86_64 GNU/Linux

  Debian
  debian_version:12.4
  kernel: Linux  6.1.0-17-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.69-1 (2023-12-30) x86_64 GNU/Linux

When the issue happened, there are several places to check usually:

  1. wg command to check if the target host ip in the peer list.
  2. journalctl -u netclient > ./netclient.log import the netclient log and check if any error or what may be doing at the time when the issue occurs.
  3. Maybe it's worth of checking the system log if there is anything unusual at the time being.
wuwo1952368901 commented 2 weeks ago

They are not behind NAT. OS:

  Debian
  debian_version:12.7
  kernel: Linux  6.1.0-26-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.112-1 (2024-09-30) x86_64 GNU/Linux

  Debian
  debian_version:12.4
  kernel: Linux  6.1.0-17-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.69-1 (2023-12-30) x86_64 GNU/Linux

When the issue happened, there are several places to check usually:

  1. wg command to check if the target host ip in the peer list.
  2. journalctl -u netclient > ./netclient.log import the netclient log and check if any error or what may be doing at the time when the issue occurs.
  3. Maybe it's worth of checking the system log if there is anything unusual at the time being.

Through the wg command, I found that the endpoint IP of the peer is incorrect. It automatically obtained the network IP of my k8s cluster.

peer: publickey
  endpoint: 10.42.6.133:51821
  allowed ips: 10.103.0.6/32
  transfer: 0 B received, 4.47 MiB sent
  persistent keepalive: every 20 seconds

peer: publickey
  endpoint: 10.42.9.197:51821
  allowed ips: 10.103.0.9/32
  transfer: 0 B received, 4.60 MiB sent
  persistent keepalive: every 20 seconds
yabinma commented 2 weeks ago

They are not behind NAT. OS:

  Debian
  debian_version:12.7
  kernel: Linux  6.1.0-26-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.112-1 (2024-09-30) x86_64 GNU/Linux

  Debian
  debian_version:12.4
  kernel: Linux  6.1.0-17-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.69-1 (2023-12-30) x86_64 GNU/Linux

When the issue happened, there are several places to check usually:

  1. wg command to check if the target host ip in the peer list.
  2. journalctl -u netclient > ./netclient.log import the netclient log and check if any error or what may be doing at the time when the issue occurs.
  3. Maybe it's worth of checking the system log if there is anything unusual at the time being.

Through the wg command, I found that the endpoint IP of the peer is incorrect. It automatically obtained the network IP of my k8s cluster.

peer: publickey
  endpoint: 10.42.6.133:51821
  allowed ips: 10.103.0.6/32
  transfer: 0 B received, 4.47 MiB sent
  persistent keepalive: every 20 seconds

peer: publickey
  endpoint: 10.42.9.197:51821
  allowed ips: 10.103.0.9/32
  transfer: 0 B received, 4.60 MiB sent
  persistent keepalive: every 20 seconds

Auto Endpoint detection is enabled by default. So that the hosts are able to communicate each other with internal ip if they are in the same sub network.

In your setup, the host could not communicate each other with the network IP of k8s cluster. Or you may disable the auto endpoint detection. In netmaker.env, set ENDPOINT_DETECTION=false and restart the containers with docker compose down & docker compose up -d

wuwo1952368901 commented 1 day ago

After synchronizing the configuration through "netclient pull", the node still cannot ping. Use the "wg show" command to check for the following:

interface: netmaker
  public key: publickey
  private key: (hidden)
  listening port: 51821

peer: publickey
  endpoint: xxx.xxx.xxx.xxx:51821
  allowed ips: 10.104.0.4/32
  latest handshake: 1 minute, 3 seconds ago
  transfer: 209.23 KiB received, 143.68 KiB sent
  persistent keepalive: every 20 seconds

peer: publickey
  endpoint: xxx.xxx.xxx.xxx:51821
  allowed ips: 10.104.0.3/32
  latest handshake: 1 minute, 35 seconds ago
  transfer: 5.31 MiB received, 958.77 KiB sent
  persistent keepalive: every 20 seconds

peer: publickey
  endpoint: xxx.xxx.xxx.xxx:51821
  allowed ips: 10.104.0.5/32
  transfer: 0 B received, 39.17 KiB sent
  persistent keepalive: every 20 seconds

peer: publickey
  endpoint: xxx.xxx.xxx.xxx:51821
  allowed ips: 10.104.0.2/32
  transfer: 0 B received, 39.31 KiB sent
  persistent keepalive: every 20 seconds

The last two nodes cannot be pinged properly. The wg show command shows that the problematic nodes do not have a "latest handshake".

@yabinma @afeiszli

abhishek9686 commented 1 day ago

10.104.0.5 Hi, can share the output of wg show of this peer 10.104.0.5

wuwo1952368901 commented 1 day ago

10.104.0.5 Hi, can share the output of wg show of this peer 10.104.0.5

This is the information for the "wg show" on 10.104.0.5:

interface: netmaker
  public key: publickey
  private key: (hidden)
  listening port: 51821

peer: publickey
  endpoint: xxx.xxx.xxx.xxx:51821
  allowed ips: 10.104.0.4/32
  latest handshake: 1 minute, 11 seconds ago
  transfer: 11.06 MiB received, 56.23 MiB sent
  persistent keepalive: every 20 seconds

peer: publickey
  endpoint: xxx.xxx.xxx.xxx:51821
  allowed ips: 10.104.0.3/32
  latest handshake: 1 minute, 21 seconds ago
  transfer: 368.95 MiB received, 321.13 MiB sent
  persistent keepalive: every 20 seconds

peer: publickey
  endpoint: xxx.xxx.xxx.xxx:51821
  allowed ips: 10.104.0.2/32
  transfer: 0 B received, 489.67 KiB sent
  persistent keepalive: every 20 seconds

peer: publickey
  endpoint: xxx.xxx.xxx.xxx:51821
  allowed ips: 10.104.0.1/32
  transfer: 0 B received, 465.68 KiB sent
  persistent keepalive: every 20 seconds
wuwo1952368901 commented 17 hours ago

Through tcpdump packet capture, it was found that the netmaker network card has packets, but the external network card does not have packets. The commands are as follows (all of which are operated on peer 10.104.0.1):

tcpdump -i netmaker host 10.104.0.2 and icmp
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on netmaker, link-type RAW (Raw IP), snapshot length 262144 bytes
12:18:18.400768 IP 10.104.0.1 > 10.104.0.2: ICMP echo request, id 24016, seq 616, length 64
12:18:19.424768 IP 10.104.0.1 > 10.104.0.2: ICMP echo request, id 24016, seq 617, length 64
12:18:20.448792 IP 10.104.0.1 > 10.104.0.2: ICMP echo request, id 24016, seq 618, length 64
12:18:21.472789 IP 10.104.0.1 > 10.104.0.2: ICMP echo request, id 24016, seq 619, length 64
12:18:22.496784 IP 10.104.0.1 > 10.104.0.2: ICMP echo request, id 24016, seq 620, length 64
12:18:23.520791 IP 10.104.0.1 > 10.104.0.2: ICMP echo request, id 24016, seq 621, length 64
12:18:24.544723 IP 10.104.0.1 > 10.104.0.2: ICMP echo request, id 24016, seq 622, length 64
12:18:25.568768 IP 10.104.0.1 > 10.104.0.2: ICMP echo request, id 24016, seq 623, length 64
12:18:26.592790 IP 10.104.0.1 > 10.104.0.2: ICMP echo request, id 24016, seq 624, length 64
12:18:27.616776 IP 10.104.0.1 > 10.104.0.2: ICMP echo request, id 24016, seq 625, length 64
12:18:28.640789 IP 10.104.0.1 > 10.104.0.2: ICMP echo request, id 24016, seq 626, length 64
12:18:29.664798 IP 10.104.0.1 > 10.104.0.2: ICMP echo request, id 24016, seq 627, length 64
12:18:30.688800 IP 10.104.0.1 > 10.104.0.2: ICMP echo request, id 24016, seq 628, length 64
12:18:31.712777 IP 10.104.0.1 > 10.104.0.2: ICMP echo request, id 24016, seq 629, length 64
tcpdump -i eth0 host xxx.xxx.xxx.xxx
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
12:18:14.305050 IP peer.10.104.0.1.51821 > peer.10.104.0.2.51821: UDP, length 148
12:18:19.328962 IP peer.10.104.0.1.51821 > peer.10.104.0.2.51821: UDP, length 148
12:18:24.545006 IP peer.10.104.0.1.51821 > peer.10.104.0.2.51821: UDP, length 148
12:18:29.665054 IP peer.10.104.0.1.51821 > peer.10.104.0.2.51821: UDP, length 148
abhishek9686 commented 16 hours ago

Through tcpdump packet capture, it was found that the netmaker network card has packets, but the external network card does not have packets. The commands are as follows (all of which are operated on peer 10.104.0.1):

tcpdump -i netmaker host 10.104.0.2 and icmp
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on netmaker, link-type RAW (Raw IP), snapshot length 262144 bytes
12:18:18.400768 IP 10.104.0.1 > 10.104.0.2: ICMP echo request, id 24016, seq 616, length 64
12:18:19.424768 IP 10.104.0.1 > 10.104.0.2: ICMP echo request, id 24016, seq 617, length 64
12:18:20.448792 IP 10.104.0.1 > 10.104.0.2: ICMP echo request, id 24016, seq 618, length 64
12:18:21.472789 IP 10.104.0.1 > 10.104.0.2: ICMP echo request, id 24016, seq 619, length 64
12:18:22.496784 IP 10.104.0.1 > 10.104.0.2: ICMP echo request, id 24016, seq 620, length 64
12:18:23.520791 IP 10.104.0.1 > 10.104.0.2: ICMP echo request, id 24016, seq 621, length 64
12:18:24.544723 IP 10.104.0.1 > 10.104.0.2: ICMP echo request, id 24016, seq 622, length 64
12:18:25.568768 IP 10.104.0.1 > 10.104.0.2: ICMP echo request, id 24016, seq 623, length 64
12:18:26.592790 IP 10.104.0.1 > 10.104.0.2: ICMP echo request, id 24016, seq 624, length 64
12:18:27.616776 IP 10.104.0.1 > 10.104.0.2: ICMP echo request, id 24016, seq 625, length 64
12:18:28.640789 IP 10.104.0.1 > 10.104.0.2: ICMP echo request, id 24016, seq 626, length 64
12:18:29.664798 IP 10.104.0.1 > 10.104.0.2: ICMP echo request, id 24016, seq 627, length 64
12:18:30.688800 IP 10.104.0.1 > 10.104.0.2: ICMP echo request, id 24016, seq 628, length 64
12:18:31.712777 IP 10.104.0.1 > 10.104.0.2: ICMP echo request, id 24016, seq 629, length 64
tcpdump -i eth0 host xxx.xxx.xxx.xxx
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
12:18:14.305050 IP peer.10.104.0.1.51821 > peer.10.104.0.2.51821: UDP, length 148
12:18:19.328962 IP peer.10.104.0.1.51821 > peer.10.104.0.2.51821: UDP, length 148
12:18:24.545006 IP peer.10.104.0.1.51821 > peer.10.104.0.2.51821: UDP, length 148
12:18:29.665054 IP peer.10.104.0.1.51821 > peer.10.104.0.2.51821: UDP, length 148

can you share your network diagram?

wuwo1952368901 commented 15 hours ago

Through tcpdump packet capture, it was found that the netmaker network card has packets, but the external network card does not have packets. The commands are as follows (all of which are operated on peer 10.104.0.1):

tcpdump -i netmaker host 10.104.0.2 and icmp
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on netmaker, link-type RAW (Raw IP), snapshot length 262144 bytes
12:18:18.400768 IP 10.104.0.1 > 10.104.0.2: ICMP echo request, id 24016, seq 616, length 64
12:18:19.424768 IP 10.104.0.1 > 10.104.0.2: ICMP echo request, id 24016, seq 617, length 64
12:18:20.448792 IP 10.104.0.1 > 10.104.0.2: ICMP echo request, id 24016, seq 618, length 64
12:18:21.472789 IP 10.104.0.1 > 10.104.0.2: ICMP echo request, id 24016, seq 619, length 64
12:18:22.496784 IP 10.104.0.1 > 10.104.0.2: ICMP echo request, id 24016, seq 620, length 64
12:18:23.520791 IP 10.104.0.1 > 10.104.0.2: ICMP echo request, id 24016, seq 621, length 64
12:18:24.544723 IP 10.104.0.1 > 10.104.0.2: ICMP echo request, id 24016, seq 622, length 64
12:18:25.568768 IP 10.104.0.1 > 10.104.0.2: ICMP echo request, id 24016, seq 623, length 64
12:18:26.592790 IP 10.104.0.1 > 10.104.0.2: ICMP echo request, id 24016, seq 624, length 64
12:18:27.616776 IP 10.104.0.1 > 10.104.0.2: ICMP echo request, id 24016, seq 625, length 64
12:18:28.640789 IP 10.104.0.1 > 10.104.0.2: ICMP echo request, id 24016, seq 626, length 64
12:18:29.664798 IP 10.104.0.1 > 10.104.0.2: ICMP echo request, id 24016, seq 627, length 64
12:18:30.688800 IP 10.104.0.1 > 10.104.0.2: ICMP echo request, id 24016, seq 628, length 64
12:18:31.712777 IP 10.104.0.1 > 10.104.0.2: ICMP echo request, id 24016, seq 629, length 64
tcpdump -i eth0 host xxx.xxx.xxx.xxx
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
12:18:14.305050 IP peer.10.104.0.1.51821 > peer.10.104.0.2.51821: UDP, length 148
12:18:19.328962 IP peer.10.104.0.1.51821 > peer.10.104.0.2.51821: UDP, length 148
12:18:24.545006 IP peer.10.104.0.1.51821 > peer.10.104.0.2.51821: UDP, length 148
12:18:29.665054 IP peer.10.104.0.1.51821 > peer.10.104.0.2.51821: UDP, length 148

can you share your network diagram?

Is this what you want?

image