opnsense / core

OPNsense GUI, API and systems backend
https://opnsense.org/
BSD 2-Clause "Simplified" License
3.35k stars 752 forks source link

IPsec crashing #6308

Closed Charburner closed 1 year ago

Charburner commented 1 year ago

Important notices

Before you add a new report, we ask you kindly to acknowledge the following:

Describe the bug

IPsec / strongswan suddenly crashes as reported in the German forum: https://forum.opnsense.org/index.php?topic=31857.0

The ipsec service becomes unresponsive (in web gui and via ssh "ipsec" command), all tunnels fail, the log just stops. No hints for a root cause on my side so far - but other reports state that it is a problem with the "charon vici queue" (see links below). What helps is to stop and start ipsec via SSH or reboot the server.

Other similar reports I found: https://forum.opnsense.org/index.php?topic=22224.0 https://forum.netgate.com/topic/172075/my-ipsec-service-hangs/68 https://forum.netgate.com/topic/165661/charon-becoming-unresponsive/33 https://redmine.pfsense.org/issues/13014

My system is actually running as a CARP failover cluster in Hyper-V and the IPsec part was stable for about 4 months (sept 22 - jan 23) after I set it up. Back then I started with 1 SA and added more over the next months, currently there are 21. When I added the last 3 SAs in the beginning of January, the first crash occured. Maybe that was kind of a threshold for my system regarding to IPsec / charon / vici.

To Reproduce

This is quite difficult and possible the reason why there is no solution after about 1.5 years (referring to the oldest reports from pfSense). As far as I know there is yet no reliable way to reproduce. Maybe have a system with atleast 10-20 SAs, maybe some inactive tunnels but with keep-alive enabled and/or "interesting traffic", maybe a lot of manual trying to start inactive tunnels. For example I have some monitoring scripts in place, which look at "ipsec status" or "ipsec statusall" and depending on the tunnel, they will try to restart phase 1 / phase 2 connections with "ipsec up $con_id" or "swanctl --initiate --child $con_id".

Expected behavior

The ipsec service should be rock stable and even if some connections may fail from time to time for different reasons, the daemon state itself should never get in such a broken state that it fully stops working and even possible block the web gui from loading.

Describe alternatives you considered

Updated from 22.7.3 to 22.7.10 Considered to split the cluster into multiple independent servers to lower the crash chance by equally distributing all SAs

Environment

OPNsense 22.7.10_2-amd64 FreeBSD 13.1-RELEASE-p5 OpenSSL 1.1.1s 1 Nov 2022

Charburner commented 1 year ago

Forgot to mention related strongswan issues: https://github.com/strongswan/strongswan/issues/268 https://github.com/strongswan/strongswan/issues/566

AdSchellevis commented 1 year ago

At a first glance I wouldn't expect these tickets relate to the same, OPNsense 22.7.x is still using stroke.

Charburner commented 1 year ago

Sorry, I'm not fully aware of everything inside of IPsec / strongswan. I know there is a migration happening from ipsec.conf to swanctl.conf. Does this include the change from using stroke to using swanctl as the control facility (as in: the command to start/stop tunnels)? Maybe this is part of the problem as I'm using both commands in some of my "repair" scripts? For example: ipsec start con2 swanctl --initiate --child con4

fichtner commented 1 year ago

In general I would recommend to try a newer released version before reporting an issue that might already be fixed. I've made the same comment in the forum thread.

Cheers, Franco

Charburner commented 1 year ago

Will do it tonight

Charburner commented 1 year ago

The server is now running with OPNsense 23.1_6-amd64 :-) I did not yet migrate the tunnels (ipsec.conf) to connections (swanctl.conf) though. I will report back if another crash occurs.

fichtner commented 1 year ago

The migration to swanctl.conf is automatic for old tunnels so underneath they work like the new connections.

Cheers, Franco

Charburner commented 1 year ago

Nice, good to know

Charburner commented 1 year ago

We just had another crash (after about 14 hours with 20 active SAs). All tunnels down, "Status Overview" in webgui empty, ipsec service not running.

netstat -Lan | grep charon.vici said:

unix 5/0/3 /var/run/charon.vici

ipsec stop said:

Stopping strongSwan IPsec failed: starter is not running

ipsec start said:

no files found matching '/usr/local/etc/strongswan.opnsense.d/*.conf' Starting weakSwan 5.9.9 IPsec [starter]... charon is already running (/var/run/charon.pid exists) -- skipping daemon start no files found matching '/usr/local/etc/ipsec.conf' failed to open config file '/usr/local/etc/ipsec.conf' unable to start strongSwan -- fatal errors in config

In top I saw the charon process with sigwait state:

64269 root 17 52 0 82M 17M sigwai 3 5:24 0.00% charon

I killed the process and tried to start ipsec once again, of course it did not work. So I had to restart the server to fix the problem.

AdSchellevis commented 1 year ago

you need swanctl (https://docs.strongswan.org/docs/5.9/swanctl/swanctl.html)to control the deamon nowadays, stop/start from the gui should forcefully stop the daemon, if that doesn't work, you might also be looking at a kernel/driver/hardware issue by the way.

Charburner commented 1 year ago

Stop/start from the gui did not work.

What can I do to further investigate the issue? Do you need me to add more detail information or try specific commands or change settings to get a clue?

AdSchellevis commented 1 year ago

when stalled, I would certainly try to manually kill the processes with a kill -9 and see if that stops them, if not, driver/hardware is starting to sounds more and more logical. What type of equipment is this?

Charburner commented 1 year ago

It's a virtual machine running on a Hyper-V-Failovercluster (Microsoft Server 2012R2)

Charburner commented 1 year ago

Another crash today... netstat -Lan | grep charon.vici said

unix 1/0/3 /var/run/charon.vici

I was able to kill the process

daemon: /usr/local/libexec/ipsec/charon[88628] (daemon)

but for

/usr/local/libexec/ipsec/charon --use-syslog

I had to use kill -9. After that I was able to login via web gui again and start the ipsec service there.

Charburner commented 1 year ago

There is a possible fix from the honorable Tobias Brunner (StrongSwan) which atleast seems to fix the issue on pfSense. Any chance we could get this in one of the next OPNsense patches? 🙏 ❤️ https://github.com/strongswan/strongswan/commit/f33cf9376e90f371c9eaa1571f37bd106cbf3ee4 https://github.com/strongswan/strongswan/issues/566

fichtner commented 1 year ago

Frankly, all we need is for StrongSwan to push this fix into a release.

danielrlauck commented 1 year ago

Hi, just for contribute. I have a PFSense running with about 25 site-to-site IPSec tunnels, and 56 SAs. I´m facing exactl the same issue. Version: PFSense community edition - version 2.6.0.

Have found a script to run inside cron to check the connections status, and send a service stop and then start. This helped in some way, but the problem persists. Will continue waiting for a solution.

fichtner commented 1 year ago

I added said patch to a snapshot build for a customer... but it appears it's not helpful for them so I wouldn't get my hopes up.

# opnsense-revert -z strongswan

Commit in question is https://github.com/opnsense/ports/commit/73f93c86d2

Cheers, Franco

Charburner commented 1 year ago

Have found a script to run inside cron to check the connections status, and send a service stop and then start. This helped in some way, but the problem persists. Will continue waiting for a solution.

The script helped me too as a workaround, I forgot to mention. Before I realized that the crashes are depending on the total amount of (phase 2) tunnels, I had to split up the tunnels from my OPNsense cluster (CARP) to currently 5 distinct servers. But one of them still has 19 connections which can't be separated, so here the cron script repairs the once in a while occuring crash. I have adapted the script from https://forum.netgate.com/topic/172075/my-ipsec-service-hangs/73 for OPNsense as follows.

!/usr/local/bin/bash

log="/var/log/ipsec_restart.log" queue=$(netstat -Lan | grep charon.vici | tr -s ' ' | cut -d' ' -f2) queue_length=$(echo $queue | cut -d'/' -f1) if [ $queue_length -gt 0 ] then echo "$(date +%F\ %H:%M:%S) charon crash (vici queue: $queue)" >> $log /usr/bin/killall -9 charon sleep 5 /usr/local/etc/rc.d/strongswan onestop; /usr/local/sbin/pluginctl -c ipsec sleep 5 /usr/local/etc/rc.d/strongswan onestop; /usr/local/sbin/pluginctl -c ipsec fi

fichtner commented 1 year ago

@Charburner maybe you want to test the snapshot build instead

Charburner commented 1 year ago

@fichtner Thanks for the build, I've already scheduled server maintenance for next weekend :-)

Charburner commented 1 year ago

I tried

opnsense-patch -r ports 73f93c86d2

but it said

Fetched 73f93c86d2 via https://github.com/opnsense/ports I can't seem to find a patch in there anywhere.

opnsense-patch -l -r ports

73f93c86d24 security/strongswan: add patch for vici stalls

I never installed a single patch/commit before, maybe I did it wrong?

fichtner commented 1 year ago

Instructions have been posted. Trying to patch the ports tree is a bit ineffective and was suggested nowhere never before.

Charburner commented 1 year ago

Oh yeah, I missed that part and instead searched google for how to install a patch/commit on opnsense... (https://forum.opnsense.org/index.php?topic=7537.0)

opnsense-revert -z strongswan

Patch successfully installed, reboot is scheduled for tonight. It might need several weeks to see if another crash occurs.

fichtner commented 1 year ago

Hmm, so does this mean no crashes?

rmundel commented 1 year ago

Almost lost it because of this bug...

The Dashboard become unresponsive because of the IPSec widget. (Async calls for widgets to mitigate bugs like this maybe?)

I'm going to test this aswel.

fichtner commented 1 year ago

You guys are funny. This has been shipped to stable weeks ago.

rmundel commented 1 year ago

I thought you were waiting for his tests.

I was using the latest version (until this morning) and the crashes kept happening.

Going to update to 23.7 and try again.

fichtner commented 1 year ago

Ok so the reported commit was never the issue. Unlikely that it’s been fixed on 23.7.

Cheers, Franco

Charburner commented 1 year ago

Patch successfully installed, reboot is scheduled for tonight. It might need several weeks to see if another crash occurs.

Hey guys,

last charon crash: 2023-06-15 patch installed: 2023-06-20

no crashes for seven weeks (on this system with 19 active phase 2 connections) - looks good to me of course I have no way to really be sure like reproducing a crash or testing actively

fichtner commented 1 year ago

ok closing as per positive result from @Charburner who opened this ticket. If other issues still exist please make a separate ticket.

Charburner commented 1 year ago

just to add more contextual information: before applying the patch, crashes occured multiple times per month

2023-04-02 18:49:01 charon crash (vici queue: 1/0/3 2023-04-03 13:06:05 charon crash (vici queue: 2/0/3) 2023-04-06 23:51:55 charon crash (vici queue: 1/0/3) 2023-04-10 14:57:06 charon crash (vici queue: 1/0/3) 2023-04-12 08:51:15 charon crash (vici queue: 2/0/3) 2023-04-16 19:05:00 charon crash (vici queue: 1/0/3) 2023-04-16 23:56:00 charon crash (vici queue: 1/0/3) 2023-04-20 11:12:00 charon crash (vici queue: 1/0/3) 2023-04-23 14:01:00 charon crash (vici queue: 1/0/3) 2023-05-06 03:31:00 charon crash (vici queue: 1/0/3) 2023-05-13 05:02:00 charon crash (vici queue: 1/0/3) 2023-06-04 06:13:00 charon crash (vici queue: 1/0/3) 2023-06-09 18:56:00 charon crash (vici queue: 1/0/3) 2023-06-15 19:33:00 charon crash (vici queue: 1/0/3)

Marvo2011 commented 4 months ago

Hello we have the exact same problem on OPNsense 24.4.1-amd64 should it fixed there?