Closed Charburner closed 1 year ago
Forgot to mention related strongswan issues: https://github.com/strongswan/strongswan/issues/268 https://github.com/strongswan/strongswan/issues/566
At a first glance I wouldn't expect these tickets relate to the same, OPNsense 22.7.x is still using stroke.
Sorry, I'm not fully aware of everything inside of IPsec / strongswan.
I know there is a migration happening from ipsec.conf to swanctl.conf.
Does this include the change from using stroke to using swanctl as the control facility (as in: the command to start/stop tunnels)?
Maybe this is part of the problem as I'm using both commands in some of my "repair" scripts?
For example:
ipsec start con2
swanctl --initiate --child con4
In general I would recommend to try a newer released version before reporting an issue that might already be fixed. I've made the same comment in the forum thread.
Cheers, Franco
Will do it tonight
The server is now running with OPNsense 23.1_6-amd64 :-) I did not yet migrate the tunnels (ipsec.conf) to connections (swanctl.conf) though. I will report back if another crash occurs.
The migration to swanctl.conf is automatic for old tunnels so underneath they work like the new connections.
Cheers, Franco
Nice, good to know
We just had another crash (after about 14 hours with 20 active SAs). All tunnels down, "Status Overview" in webgui empty, ipsec service not running.
netstat -Lan | grep charon.vici
said:
unix 5/0/3 /var/run/charon.vici
ipsec stop
said:
Stopping strongSwan IPsec failed: starter is not running
ipsec start
said:
no files found matching '/usr/local/etc/strongswan.opnsense.d/*.conf' Starting weakSwan 5.9.9 IPsec [starter]... charon is already running (/var/run/charon.pid exists) -- skipping daemon start no files found matching '/usr/local/etc/ipsec.conf' failed to open config file '/usr/local/etc/ipsec.conf' unable to start strongSwan -- fatal errors in config
In top
I saw the charon process with sigwait state:
64269 root 17 52 0 82M 17M sigwai 3 5:24 0.00% charon
I killed the process and tried to start ipsec once again, of course it did not work. So I had to restart the server to fix the problem.
you need swanctl
(https://docs.strongswan.org/docs/5.9/swanctl/swanctl.html)to control the deamon nowadays, stop/start from the gui should forcefully stop the daemon, if that doesn't work, you might also be looking at a kernel/driver/hardware issue by the way.
Stop/start from the gui did not work.
What can I do to further investigate the issue? Do you need me to add more detail information or try specific commands or change settings to get a clue?
when stalled, I would certainly try to manually kill the processes with a kill -9
and see if that stops them, if not, driver/hardware is starting to sounds more and more logical. What type of equipment is this?
It's a virtual machine running on a Hyper-V-Failovercluster (Microsoft Server 2012R2)
Another crash today...
netstat -Lan | grep charon.vici
said
unix 1/0/3 /var/run/charon.vici
I was able to kill
the process
daemon: /usr/local/libexec/ipsec/charon[88628] (daemon)
but for
/usr/local/libexec/ipsec/charon --use-syslog
I had to use kill -9
.
After that I was able to login via web gui again and start the ipsec service there.
There is a possible fix from the honorable Tobias Brunner (StrongSwan) which atleast seems to fix the issue on pfSense. Any chance we could get this in one of the next OPNsense patches? 🙏 ❤️ https://github.com/strongswan/strongswan/commit/f33cf9376e90f371c9eaa1571f37bd106cbf3ee4 https://github.com/strongswan/strongswan/issues/566
Frankly, all we need is for StrongSwan to push this fix into a release.
Hi, just for contribute. I have a PFSense running with about 25 site-to-site IPSec tunnels, and 56 SAs. I´m facing exactl the same issue. Version: PFSense community edition - version 2.6.0.
Have found a script to run inside cron to check the connections status, and send a service stop and then start. This helped in some way, but the problem persists. Will continue waiting for a solution.
I added said patch to a snapshot build for a customer... but it appears it's not helpful for them so I wouldn't get my hopes up.
# opnsense-revert -z strongswan
Commit in question is https://github.com/opnsense/ports/commit/73f93c86d2
Cheers, Franco
Have found a script to run inside cron to check the connections status, and send a service stop and then start. This helped in some way, but the problem persists. Will continue waiting for a solution.
The script helped me too as a workaround, I forgot to mention. Before I realized that the crashes are depending on the total amount of (phase 2) tunnels, I had to split up the tunnels from my OPNsense cluster (CARP) to currently 5 distinct servers. But one of them still has 19 connections which can't be separated, so here the cron script repairs the once in a while occuring crash. I have adapted the script from https://forum.netgate.com/topic/172075/my-ipsec-service-hangs/73 for OPNsense as follows.
!/usr/local/bin/bash
log="/var/log/ipsec_restart.log" queue=$(netstat -Lan | grep charon.vici | tr -s ' ' | cut -d' ' -f2) queue_length=$(echo $queue | cut -d'/' -f1) if [ $queue_length -gt 0 ] then echo "$(date +%F\ %H:%M:%S) charon crash (vici queue: $queue)" >> $log /usr/bin/killall -9 charon sleep 5 /usr/local/etc/rc.d/strongswan onestop; /usr/local/sbin/pluginctl -c ipsec sleep 5 /usr/local/etc/rc.d/strongswan onestop; /usr/local/sbin/pluginctl -c ipsec fi
@Charburner maybe you want to test the snapshot build instead
@fichtner Thanks for the build, I've already scheduled server maintenance for next weekend :-)
I tried
opnsense-patch -r ports 73f93c86d2
but it said
Fetched 73f93c86d2 via https://github.com/opnsense/ports I can't seem to find a patch in there anywhere.
opnsense-patch -l -r ports
73f93c86d24 security/strongswan: add patch for vici stalls
I never installed a single patch/commit before, maybe I did it wrong?
Instructions have been posted. Trying to patch the ports tree is a bit ineffective and was suggested nowhere never before.
Oh yeah, I missed that part and instead searched google for how to install a patch/commit on opnsense... (https://forum.opnsense.org/index.php?topic=7537.0)
opnsense-revert -z strongswan
Patch successfully installed, reboot is scheduled for tonight. It might need several weeks to see if another crash occurs.
Hmm, so does this mean no crashes?
Almost lost it because of this bug...
The Dashboard become unresponsive because of the IPSec widget. (Async calls for widgets to mitigate bugs like this maybe?)
I'm going to test this aswel.
You guys are funny. This has been shipped to stable weeks ago.
I thought you were waiting for his tests.
I was using the latest version (until this morning) and the crashes kept happening.
Going to update to 23.7 and try again.
Ok so the reported commit was never the issue. Unlikely that it’s been fixed on 23.7.
Cheers, Franco
Patch successfully installed, reboot is scheduled for tonight. It might need several weeks to see if another crash occurs.
Hey guys,
last charon crash: 2023-06-15 patch installed: 2023-06-20
no crashes for seven weeks (on this system with 19 active phase 2 connections) - looks good to me of course I have no way to really be sure like reproducing a crash or testing actively
ok closing as per positive result from @Charburner who opened this ticket. If other issues still exist please make a separate ticket.
just to add more contextual information: before applying the patch, crashes occured multiple times per month
2023-04-02 18:49:01 charon crash (vici queue: 1/0/3 2023-04-03 13:06:05 charon crash (vici queue: 2/0/3) 2023-04-06 23:51:55 charon crash (vici queue: 1/0/3) 2023-04-10 14:57:06 charon crash (vici queue: 1/0/3) 2023-04-12 08:51:15 charon crash (vici queue: 2/0/3) 2023-04-16 19:05:00 charon crash (vici queue: 1/0/3) 2023-04-16 23:56:00 charon crash (vici queue: 1/0/3) 2023-04-20 11:12:00 charon crash (vici queue: 1/0/3) 2023-04-23 14:01:00 charon crash (vici queue: 1/0/3) 2023-05-06 03:31:00 charon crash (vici queue: 1/0/3) 2023-05-13 05:02:00 charon crash (vici queue: 1/0/3) 2023-06-04 06:13:00 charon crash (vici queue: 1/0/3) 2023-06-09 18:56:00 charon crash (vici queue: 1/0/3) 2023-06-15 19:33:00 charon crash (vici queue: 1/0/3)
Hello we have the exact same problem on OPNsense 24.4.1-amd64 should it fixed there?
Important notices
Before you add a new report, we ask you kindly to acknowledge the following:
Describe the bug
IPsec / strongswan suddenly crashes as reported in the German forum: https://forum.opnsense.org/index.php?topic=31857.0
The ipsec service becomes unresponsive (in web gui and via ssh "ipsec" command), all tunnels fail, the log just stops. No hints for a root cause on my side so far - but other reports state that it is a problem with the "charon vici queue" (see links below). What helps is to stop and start ipsec via SSH or reboot the server.
Other similar reports I found: https://forum.opnsense.org/index.php?topic=22224.0 https://forum.netgate.com/topic/172075/my-ipsec-service-hangs/68 https://forum.netgate.com/topic/165661/charon-becoming-unresponsive/33 https://redmine.pfsense.org/issues/13014
My system is actually running as a CARP failover cluster in Hyper-V and the IPsec part was stable for about 4 months (sept 22 - jan 23) after I set it up. Back then I started with 1 SA and added more over the next months, currently there are 21. When I added the last 3 SAs in the beginning of January, the first crash occured. Maybe that was kind of a threshold for my system regarding to IPsec / charon / vici.
To Reproduce
This is quite difficult and possible the reason why there is no solution after about 1.5 years (referring to the oldest reports from pfSense). As far as I know there is yet no reliable way to reproduce. Maybe have a system with atleast 10-20 SAs, maybe some inactive tunnels but with keep-alive enabled and/or "interesting traffic", maybe a lot of manual trying to start inactive tunnels. For example I have some monitoring scripts in place, which look at "ipsec status" or "ipsec statusall" and depending on the tunnel, they will try to restart phase 1 / phase 2 connections with "ipsec up $con_id" or "swanctl --initiate --child $con_id".
Expected behavior
The ipsec service should be rock stable and even if some connections may fail from time to time for different reasons, the daemon state itself should never get in such a broken state that it fully stops working and even possible block the web gui from loading.
Describe alternatives you considered
Updated from 22.7.3 to 22.7.10 Considered to split the cluster into multiple independent servers to lower the crash chance by equally distributing all SAs
Environment
OPNsense 22.7.10_2-amd64 FreeBSD 13.1-RELEASE-p5 OpenSSL 1.1.1s 1 Nov 2022