Open wuwo1952368901 opened 2 weeks ago
After running normally for a period of time, some nodes may experience ping failure. The netclient service needs to be restarted before it can be restored, but after a period of recovery, there may be issues with the system. How can we investigate the specific cause? @afeiszli
After running normally for a period of time, some nodes may experience ping failure. The netclient service needs to be restarted before it can be restored, but after a period of recovery, there may be issues with the system. How can we investigate the specific cause? @afeiszli
can you provide more information on your environment?
They are not behind NAT.
OS:
Debian
debian_version:12.7
kernel: Linux 6.1.0-26-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.112-1 (2024-09-30) x86_64 GNU/Linux
Debian
debian_version:12.4
kernel: Linux 6.1.0-17-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.69-1 (2023-12-30) x86_64 GNU/Linux
They are not behind NAT.
OS:
Debian debian_version:12.7 kernel: Linux 6.1.0-26-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.112-1 (2024-09-30) x86_64 GNU/Linux Debian debian_version:12.4 kernel: Linux 6.1.0-17-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.69-1 (2023-12-30) x86_64 GNU/Linux
When the issue happened, there are several places to check usually:
wg
command to check if the target host ip in the peer list.journalctl -u netclient > ./netclient.log
import the netclient log and check if any error or what may be doing at the time when the issue occurs.They are not behind NAT. OS:
Debian debian_version:12.7 kernel: Linux 6.1.0-26-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.112-1 (2024-09-30) x86_64 GNU/Linux Debian debian_version:12.4 kernel: Linux 6.1.0-17-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.69-1 (2023-12-30) x86_64 GNU/Linux
When the issue happened, there are several places to check usually:
wg
command to check if the target host ip in the peer list.journalctl -u netclient > ./netclient.log
import the netclient log and check if any error or what may be doing at the time when the issue occurs.- Maybe it's worth of checking the system log if there is anything unusual at the time being.
Through the wg command, I found that the endpoint IP of the peer is incorrect. It automatically obtained the network IP of my k8s cluster.
peer: publickey
endpoint: 10.42.6.133:51821
allowed ips: 10.103.0.6/32
transfer: 0 B received, 4.47 MiB sent
persistent keepalive: every 20 seconds
peer: publickey
endpoint: 10.42.9.197:51821
allowed ips: 10.103.0.9/32
transfer: 0 B received, 4.60 MiB sent
persistent keepalive: every 20 seconds
They are not behind NAT. OS:
Debian debian_version:12.7 kernel: Linux 6.1.0-26-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.112-1 (2024-09-30) x86_64 GNU/Linux Debian debian_version:12.4 kernel: Linux 6.1.0-17-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.69-1 (2023-12-30) x86_64 GNU/Linux
When the issue happened, there are several places to check usually:
wg
command to check if the target host ip in the peer list.journalctl -u netclient > ./netclient.log
import the netclient log and check if any error or what may be doing at the time when the issue occurs.- Maybe it's worth of checking the system log if there is anything unusual at the time being.
Through the wg command, I found that the endpoint IP of the peer is incorrect. It automatically obtained the network IP of my k8s cluster.
peer: publickey endpoint: 10.42.6.133:51821 allowed ips: 10.103.0.6/32 transfer: 0 B received, 4.47 MiB sent persistent keepalive: every 20 seconds peer: publickey endpoint: 10.42.9.197:51821 allowed ips: 10.103.0.9/32 transfer: 0 B received, 4.60 MiB sent persistent keepalive: every 20 seconds
Auto Endpoint detection is enabled by default. So that the hosts are able to communicate each other with internal ip if they are in the same sub network.
In your setup, the host could not communicate each other with the network IP of k8s cluster.
Or you may disable the auto endpoint detection. In netmaker.env, set ENDPOINT_DETECTION=false
and restart the containers with docker compose down & docker compose up -d
After synchronizing the configuration through "netclient pull", the node still cannot ping. Use the "wg show" command to check for the following:
interface: netmaker
public key: publickey
private key: (hidden)
listening port: 51821
peer: publickey
endpoint: xxx.xxx.xxx.xxx:51821
allowed ips: 10.104.0.4/32
latest handshake: 1 minute, 3 seconds ago
transfer: 209.23 KiB received, 143.68 KiB sent
persistent keepalive: every 20 seconds
peer: publickey
endpoint: xxx.xxx.xxx.xxx:51821
allowed ips: 10.104.0.3/32
latest handshake: 1 minute, 35 seconds ago
transfer: 5.31 MiB received, 958.77 KiB sent
persistent keepalive: every 20 seconds
peer: publickey
endpoint: xxx.xxx.xxx.xxx:51821
allowed ips: 10.104.0.5/32
transfer: 0 B received, 39.17 KiB sent
persistent keepalive: every 20 seconds
peer: publickey
endpoint: xxx.xxx.xxx.xxx:51821
allowed ips: 10.104.0.2/32
transfer: 0 B received, 39.31 KiB sent
persistent keepalive: every 20 seconds
The last two nodes cannot be pinged properly. The wg show command shows that the problematic nodes do not have a "latest handshake".
@yabinma @afeiszli
10.104.0.5 Hi, can share the output of
wg show
of this peer10.104.0.5
10.104.0.5 Hi, can share the output of
wg show
of this peer10.104.0.5
This is the information for the "wg show" on 10.104.0.5:
interface: netmaker
public key: publickey
private key: (hidden)
listening port: 51821
peer: publickey
endpoint: xxx.xxx.xxx.xxx:51821
allowed ips: 10.104.0.4/32
latest handshake: 1 minute, 11 seconds ago
transfer: 11.06 MiB received, 56.23 MiB sent
persistent keepalive: every 20 seconds
peer: publickey
endpoint: xxx.xxx.xxx.xxx:51821
allowed ips: 10.104.0.3/32
latest handshake: 1 minute, 21 seconds ago
transfer: 368.95 MiB received, 321.13 MiB sent
persistent keepalive: every 20 seconds
peer: publickey
endpoint: xxx.xxx.xxx.xxx:51821
allowed ips: 10.104.0.2/32
transfer: 0 B received, 489.67 KiB sent
persistent keepalive: every 20 seconds
peer: publickey
endpoint: xxx.xxx.xxx.xxx:51821
allowed ips: 10.104.0.1/32
transfer: 0 B received, 465.68 KiB sent
persistent keepalive: every 20 seconds
Through tcpdump packet capture, it was found that the netmaker network card has packets, but the external network card does not have packets. The commands are as follows (all of which are operated on peer 10.104.0.1):
tcpdump -i netmaker host 10.104.0.2 and icmp
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on netmaker, link-type RAW (Raw IP), snapshot length 262144 bytes
12:18:18.400768 IP 10.104.0.1 > 10.104.0.2: ICMP echo request, id 24016, seq 616, length 64
12:18:19.424768 IP 10.104.0.1 > 10.104.0.2: ICMP echo request, id 24016, seq 617, length 64
12:18:20.448792 IP 10.104.0.1 > 10.104.0.2: ICMP echo request, id 24016, seq 618, length 64
12:18:21.472789 IP 10.104.0.1 > 10.104.0.2: ICMP echo request, id 24016, seq 619, length 64
12:18:22.496784 IP 10.104.0.1 > 10.104.0.2: ICMP echo request, id 24016, seq 620, length 64
12:18:23.520791 IP 10.104.0.1 > 10.104.0.2: ICMP echo request, id 24016, seq 621, length 64
12:18:24.544723 IP 10.104.0.1 > 10.104.0.2: ICMP echo request, id 24016, seq 622, length 64
12:18:25.568768 IP 10.104.0.1 > 10.104.0.2: ICMP echo request, id 24016, seq 623, length 64
12:18:26.592790 IP 10.104.0.1 > 10.104.0.2: ICMP echo request, id 24016, seq 624, length 64
12:18:27.616776 IP 10.104.0.1 > 10.104.0.2: ICMP echo request, id 24016, seq 625, length 64
12:18:28.640789 IP 10.104.0.1 > 10.104.0.2: ICMP echo request, id 24016, seq 626, length 64
12:18:29.664798 IP 10.104.0.1 > 10.104.0.2: ICMP echo request, id 24016, seq 627, length 64
12:18:30.688800 IP 10.104.0.1 > 10.104.0.2: ICMP echo request, id 24016, seq 628, length 64
12:18:31.712777 IP 10.104.0.1 > 10.104.0.2: ICMP echo request, id 24016, seq 629, length 64
tcpdump -i eth0 host xxx.xxx.xxx.xxx
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
12:18:14.305050 IP peer.10.104.0.1.51821 > peer.10.104.0.2.51821: UDP, length 148
12:18:19.328962 IP peer.10.104.0.1.51821 > peer.10.104.0.2.51821: UDP, length 148
12:18:24.545006 IP peer.10.104.0.1.51821 > peer.10.104.0.2.51821: UDP, length 148
12:18:29.665054 IP peer.10.104.0.1.51821 > peer.10.104.0.2.51821: UDP, length 148
Through tcpdump packet capture, it was found that the netmaker network card has packets, but the external network card does not have packets. The commands are as follows (all of which are operated on peer 10.104.0.1):
tcpdump -i netmaker host 10.104.0.2 and icmp tcpdump: verbose output suppressed, use -v[v]... for full protocol decode listening on netmaker, link-type RAW (Raw IP), snapshot length 262144 bytes 12:18:18.400768 IP 10.104.0.1 > 10.104.0.2: ICMP echo request, id 24016, seq 616, length 64 12:18:19.424768 IP 10.104.0.1 > 10.104.0.2: ICMP echo request, id 24016, seq 617, length 64 12:18:20.448792 IP 10.104.0.1 > 10.104.0.2: ICMP echo request, id 24016, seq 618, length 64 12:18:21.472789 IP 10.104.0.1 > 10.104.0.2: ICMP echo request, id 24016, seq 619, length 64 12:18:22.496784 IP 10.104.0.1 > 10.104.0.2: ICMP echo request, id 24016, seq 620, length 64 12:18:23.520791 IP 10.104.0.1 > 10.104.0.2: ICMP echo request, id 24016, seq 621, length 64 12:18:24.544723 IP 10.104.0.1 > 10.104.0.2: ICMP echo request, id 24016, seq 622, length 64 12:18:25.568768 IP 10.104.0.1 > 10.104.0.2: ICMP echo request, id 24016, seq 623, length 64 12:18:26.592790 IP 10.104.0.1 > 10.104.0.2: ICMP echo request, id 24016, seq 624, length 64 12:18:27.616776 IP 10.104.0.1 > 10.104.0.2: ICMP echo request, id 24016, seq 625, length 64 12:18:28.640789 IP 10.104.0.1 > 10.104.0.2: ICMP echo request, id 24016, seq 626, length 64 12:18:29.664798 IP 10.104.0.1 > 10.104.0.2: ICMP echo request, id 24016, seq 627, length 64 12:18:30.688800 IP 10.104.0.1 > 10.104.0.2: ICMP echo request, id 24016, seq 628, length 64 12:18:31.712777 IP 10.104.0.1 > 10.104.0.2: ICMP echo request, id 24016, seq 629, length 64
tcpdump -i eth0 host xxx.xxx.xxx.xxx tcpdump: verbose output suppressed, use -v[v]... for full protocol decode listening on eth0, link-type EN10MB (Ethernet), snapshot length 262144 bytes 12:18:14.305050 IP peer.10.104.0.1.51821 > peer.10.104.0.2.51821: UDP, length 148 12:18:19.328962 IP peer.10.104.0.1.51821 > peer.10.104.0.2.51821: UDP, length 148 12:18:24.545006 IP peer.10.104.0.1.51821 > peer.10.104.0.2.51821: UDP, length 148 12:18:29.665054 IP peer.10.104.0.1.51821 > peer.10.104.0.2.51821: UDP, length 148
can you share your network diagram?
Through tcpdump packet capture, it was found that the netmaker network card has packets, but the external network card does not have packets. The commands are as follows (all of which are operated on peer 10.104.0.1):
tcpdump -i netmaker host 10.104.0.2 and icmp tcpdump: verbose output suppressed, use -v[v]... for full protocol decode listening on netmaker, link-type RAW (Raw IP), snapshot length 262144 bytes 12:18:18.400768 IP 10.104.0.1 > 10.104.0.2: ICMP echo request, id 24016, seq 616, length 64 12:18:19.424768 IP 10.104.0.1 > 10.104.0.2: ICMP echo request, id 24016, seq 617, length 64 12:18:20.448792 IP 10.104.0.1 > 10.104.0.2: ICMP echo request, id 24016, seq 618, length 64 12:18:21.472789 IP 10.104.0.1 > 10.104.0.2: ICMP echo request, id 24016, seq 619, length 64 12:18:22.496784 IP 10.104.0.1 > 10.104.0.2: ICMP echo request, id 24016, seq 620, length 64 12:18:23.520791 IP 10.104.0.1 > 10.104.0.2: ICMP echo request, id 24016, seq 621, length 64 12:18:24.544723 IP 10.104.0.1 > 10.104.0.2: ICMP echo request, id 24016, seq 622, length 64 12:18:25.568768 IP 10.104.0.1 > 10.104.0.2: ICMP echo request, id 24016, seq 623, length 64 12:18:26.592790 IP 10.104.0.1 > 10.104.0.2: ICMP echo request, id 24016, seq 624, length 64 12:18:27.616776 IP 10.104.0.1 > 10.104.0.2: ICMP echo request, id 24016, seq 625, length 64 12:18:28.640789 IP 10.104.0.1 > 10.104.0.2: ICMP echo request, id 24016, seq 626, length 64 12:18:29.664798 IP 10.104.0.1 > 10.104.0.2: ICMP echo request, id 24016, seq 627, length 64 12:18:30.688800 IP 10.104.0.1 > 10.104.0.2: ICMP echo request, id 24016, seq 628, length 64 12:18:31.712777 IP 10.104.0.1 > 10.104.0.2: ICMP echo request, id 24016, seq 629, length 64
tcpdump -i eth0 host xxx.xxx.xxx.xxx tcpdump: verbose output suppressed, use -v[v]... for full protocol decode listening on eth0, link-type EN10MB (Ethernet), snapshot length 262144 bytes 12:18:14.305050 IP peer.10.104.0.1.51821 > peer.10.104.0.2.51821: UDP, length 148 12:18:19.328962 IP peer.10.104.0.1.51821 > peer.10.104.0.2.51821: UDP, length 148 12:18:24.545006 IP peer.10.104.0.1.51821 > peer.10.104.0.2.51821: UDP, length 148 12:18:29.665054 IP peer.10.104.0.1.51821 > peer.10.104.0.2.51821: UDP, length 148
can you share your network diagram?
Is this what you want?
Contact Details
No response
What happened?
Suddenly unable to ping between nodes.
Version
v0.24.2
What OS are you using?
No response
Relevant log output
No response
Contributing guidelines