Open sicutdeux opened 11 months ago
I encountered the same issue today and a 3 weeks ago. on different hosts. (but otherwiese identical config (ansible)) only completely resetting the netclient and re-registering solved it.
netclient 24.1
I encountered the same issue today and a 3 weeks ago. on different hosts. (but otherwiese identical config (ansible)) only completely resetting the netclient and re-registering solved it.
netclient 24.1
@cocoonkid is this a self-hosted setup or SaaS?
selfhosted. outside of 2 running docker containers the systems are Identical Rocky Linux 9.3. Identical because setup with ansible. I will update all of them today to the latest netclient too.
I encountered the same issue today and a 3 weeks ago. on different hosts. (but otherwiese identical config (ansible)) only completely resetting the netclient and re-registering solved it.
netclient 24.1
can you share the complete log file? you can send it to me over discord, if that's fine
I would do but I upgraded the netmaker host to 24.2 and all clients and the OS itself fully so docker got restarted. So no serverside logs.
The client I rm'ed /etc/netclient and enrolled anew with fresh IP to the network it was supposed to live in.
Journalctl had the service logs still. DM'ed you now.
It happened again today, again on an random client.. Whole network is on latest netclient and nm-server.
I DM @abhishek9686 about it. But same error again.
{"time":"2024-07-03T17:43:48.913852722Z","level":"ERROR","source":"handlers.go 95}","msg":"failed to decrypt message for host","id":"da00efc7-baa0-4ee7-b79f-2819aea88e28","name":"starry-fahrenheit","error":"could not decrypt message, [49 14
Here is a screenshot from syslog when it suddenly starts going weird. netclient] 2024-07-03 00:25:07 error publishing checkin cannot publish ... Mqclient not connected --> this one happens all the time every few minutes so that is expected and the tunnels stilll work. But then it died for real.
I am getting the same issue out of the blue. I was running on 0.23 for a while. I have upgraded to 0.24.3. But issue is still there. It is limited only to certain hosts which is quite odd.
I have manually upgraded the affected endpoints and now looks better, lets see if it stays stable.
I am getting the same issue out of the blue. I was running on 0.23 for a while. I have upgraded to 0.24.3. But issue is still there. It is limited only to certain hosts which is quite odd.
we couldn't get to the bottom of this behaviour yet. By any chance is there any resource scarce on this machine where this issue has been observed? Also only remediation if you encounter this behaviour is to delete the host and re-join the network.
I dont think so. Could it be some certificate expiring and not getting renewed ? I have not checked exactly how certificates are used for MQTT/API.
we couldn't get to the bottom of this behaviour yet. By any chance is there any resource scarce on this machine where this issue has been observed? Also only remediation if you encounter this behaviour is to delete the host and re-join the network.
In my case I have all servers under zabbix monitoring so ressources are clearly not an issue. They are all under 20% load in average.
Upgraded all servers to 0.24.3 today and continue to watch for symptoms. Currently I have a ping running targting the netmaker server and if that ping fails it executes on the clients:
netclient leave <network>
netclient join -t <code>
(Originally it also did an rm -rf /etc/netclient but I found this unnecessary.)
I found that this also changes the IP sometimes when doing this via systemd. When doing it manually the server keeps its wireguard IP. Is there a way to set it to static dhcp so i can make sure the servers keep their IP?
I don't seem to get consistent behaivour..very weird.
This is the script I am using currently to fix a failed tunnel (and adjust the zabbix monitoring IP accordingly IF it changed.)
import subprocess
import logging
import signal
import sys
import time
TUNNEL_TEST_HOSTNAME = "'<netmaker-wireguard-ip address>"
TARGET_NETWORK = '<name-of-network>'
ENROLLMENT_CODE = '<enrollment-code>'
DNS_SERVER = '<netmaker-wireguard-ip address>'
CONFIG_FILE = "/etc/zabbix/zabbix_agent2.conf"
hostname = subprocess.check_output("hostname", shell=True).decode().strip()
WIREGUARD_HOSTNAME = hostname + ".customer_endpoints"
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
def check_ip(TUNNEL_TEST_HOSTNAME):
"""Check if the target IP is responding."""
try:
ip_address = resolve_dns(TUNNEL_TEST_HOSTNAME)
if not ip_address:
logging.error(f"DNS resolution failed for {TUNNEL_TEST_HOSTNAME}")
return False
time.sleep(1)
logging.info(f"Resolved {TUNNEL_TEST_HOSTNAME} to {ip_address}")
output = subprocess.check_output(["ping", "-c", "3", "-w", "3", ip_address], stderr=subprocess.STDOUT)
time.sleep(1)
logging.info(f"Ping output: {output.decode().strip()}")
return True
except subprocess.CalledProcessError as e:
logging.error(f"Ping failed: {e.output.decode().strip()}")
return False
except Exception as e:
logging.error(f"Unexpected error when checking IP: {e}")
return False
def leave_network():
"""Leave the network using netclient."""
logging.info("Leaving the network...")
try:
subprocess.check_output(["netclient", "leave", TARGET_NETWORK], stderr=subprocess.STDOUT)
except subprocess.CalledProcessError as e:
logging.error(f"Failed to leave network: {e.output.decode().strip()}")
except Exception as e:
logging.error(f"Unexpected error when leaving network: {e}")
def join_network():
"""Join the network using netclient."""
logging.info("Rejoining the network...")
try:
subprocess.check_output(["netclient", "join", "-t", ENROLLMENT_CODE], stderr=subprocess.STDOUT)
except subprocess.CalledProcessError as e:
logging.error(f"Failed to join network: {e.output.decode().strip()}")
except Exception as e:
logging.error(f"Unexpected error when joining network: {e}")
def signal_handler(sig, frame):
logging.info('Received termination signal. Exiting...')
sys.exit(0)
def resolve_dns(name):
try:
output = subprocess.check_output(["dig", "+short", name, f"@{DNS_SERVER}"], stderr=subprocess.STDOUT, timeout=5)
logging.debug(f"dig output: {output.decode().strip()}")
ip_address = output.decode().strip()
if ip_address:
return ip_address
except subprocess.TimeoutExpired:
logging.error("dig command timed out")
except subprocess.CalledProcessError as e:
logging.error(f"dig command failed: {e.output.decode().strip()}")
except Exception as e:
logging.error(f"Unexpected error in resolve_dns: {e}")
return None
def get_current_listen_ip():
with open(CONFIG_FILE, 'r') as file:
for line in file:
if line.startswith("ListenIP="):
return line.strip().split('=')[1]
return None
def update_listen_ip(new_ip):
with open(CONFIG_FILE, 'r') as file:
lines = file.readlines()
with open(CONFIG_FILE, 'w') as file:
for line in lines:
if line.startswith("ListenIP="):
file.write(f"ListenIP={new_ip}\n")
else:
file.write(line)
def restart_zabbix_agent():
subprocess.run(["systemctl", "restart", "zabbix-agent2.service"], check=True)
def zabbix_ip_watcher():
new_ip = resolve_dns(WIREGUARD_HOSTNAME)
print(new_ip)
if new_ip:
current_ip = get_current_listen_ip()
if new_ip != current_ip:
update_listen_ip(new_ip)
logging.info(f"Changed {current_ip} to {new_ip}")
restart_zabbix_agent()
logging.info("Zabbix Agent restarted with new ListenIP")
logging.info("zabbix_ip_watcher executed no change.")
def wireguard_ip_watcher():
"""Main watchdog function."""
signal.signal(signal.SIGINT, signal_handler)
signal.signal(signal.SIGTERM, signal_handler)
if check_ip(TUNNEL_TEST_HOSTNAME):
logging.info(f"{TUNNEL_TEST_HOSTNAME} is responsive")
else:
logging.warning(f"{TUNNEL_TEST_HOSTNAME} is not responsive. Reconnecting network.")
leave_network()
time.sleep(1)
join_network()
time.sleep(8)
if check_ip(TUNNEL_TEST_HOSTNAME):
logging.info(f"{TUNNEL_TEST_HOSTNAME} is responsive AGAIN")
zabbix_ip_watcher()
if __name__ == "__main__":
wireguard_ip_watcher()
It runs as systemd service templated via ansible jinja. A timer is running that executess it every minute.
[Unit]
Description=WireGuard IP Watcher Service
After=network.target
[Service]
Type=oneshot
ExecStart=/usr/bin/python3 /manifests/{{ inventory_hostname }}/configuration/wireguard_ip_watcher/wireguard_ip_watcher.py
User=root
{% if customer_endpoint | default(false) %}
Environment=WIREGUARD_HOSTNAME={{inventory_hostname}}.endpoints
{% endif %}
[Install]
WantedBy=multi-user.target
The really weird thing so far is I have the symptoms only in one of three networks so far.
But the servers are ALL setup in their base with the same ansible playbook so it should happen everywhere..
All servers now log with verbosity 4.
I"ll catch the next fail with additional logs and update here.
I have the same issue.
I noticed some peer's ports are changed randomly when "could not decrypt message" error appears.
Command netclient pull
restores connection.
In the changes recently(including v0.24.3), some of the communication from netclient to server changed from mq message to restful api call. It's unlikely there will be "could not decrypt message" issue happened in Netmaker server side. Please keep an eye on it.
In the context, the "could not decrypt message" issue happened in client side as reported. Unfortunately, I did not re-produce it yet. I added more debug log for further investigation. That would be helpful if sharing more information when the issue happened again in netclient side. When it occurs, please check the netclient.yml and servers.yml files in /etc/netclient/, to check if the traffickey and traffickeyprivate are right in the place.
code fixed added for the decrypt issue on client side in v0.25.0. Please test and verify.
Updated all nodes and will report. I had no real time to grep the logs the last weeks. unbelievable, sorry for that.
Contact Details
axzelmarin@gmail.com
What happened?
I've seen this hapening a multiple times in the past, the only fix for me was reinstall everything but I've a buch of network with almost 30 clients machines and don't want to go to the same process again, let me describe the issue to see if anyone here can help me since I've issues using Discord does not allow me to post questions and my axiety is through the roof with this:
I use CheckMK using the network that netmaker creates, 9h ago the checkmk host could not connect to the netmaker server, after rebooting both vm's the connection did not came back, on netclient on the nemaker server i get this error:
Nov 22 18:16:01 netmaker netclient[918998]: {"time":"2023-11-22T18:16:01.50537199Z","level":"ERROR","source":"mqhandlers.go 199}","msg":"error decrypting message","error":"could not decrypt message, [139 199 75 75 156 222 13 97 14 169 >
On my laptop I use wireguard client using a Client config that is made by the dashboard, from this host I cant reach any client machine in the network.
2023-11-22 13:24:23.092 [NET] peer(o9jA…vISs) - Sending handshake initiation
I keep getting this error on the wireguard
and in the dashboard and all the networks I've this:
which is the netmaker server that is not reachable on every of those networks.
I've attached last lines on the docker containers.
Version
v0.21.2
What OS are you using?
Linux
Relevant log output
Contributing guidelines