opnsense / core

OPNsense GUI, API and systems backend
https://opnsense.org/
BSD 2-Clause "Simplified" License
3.39k stars 759 forks source link

IPsec config reload uninstalls manual SPD entries #6950

Closed Daggolin closed 1 year ago

Daggolin commented 1 year ago

Important notices

Before you add a new report, we ask you kindly to acknowledge the following:

Describe the bug

When triggering an action on the web interface that leads to a new swanctl.conf file being written and the config being reloaded any IPsec tunnel relying on manual SPD entries (VPN > IPsec > Security Policy Database) stops working.

Additional context

I am not sure where the core issue originates from, but I have noticed the following:

The manual SPD entries used in our setup don't have an explicit Reqid set, they are associated to the phase 2 using the Connection child dropdown menu.

Expected behavior

Either ipsec_configure_spd should skip the removal of manual SPD entries for active phase 2 connections or it should invoke the same logic as ipsec/updown_event.py to restore them after their deletion.

The active IPsec tunnels that rely on manual SPD entries should keep functioning when the IPsec config gets updated. Adding a new PSK (VPN > IPsec > Pre-Shared Keys) or enabling/disabling an IPsec tunnel (VPN > IPsec > Connections; VPN > IPsec > Tunnel Settings [legacy]) should not remove the installed manual SPD entries and break tunnels.

Environment

Versions:

CPU type:

AdSchellevis commented 1 year ago

Which option was used to build the tunnels? The legacy tunnel one or the new connections?

Daggolin commented 1 year ago

The new connections. Currently in the process of migrating from legacy to the new connections. The legacy tunnels had a field to specify the SPD entries and those still work fine. The OPNsense instance still has several legacy tunnels running, but only the one running as new connection is using the VPN > IPsec > Security Policy Database manual entries (cause the new connections don't have an easy to use field like the legacy connections do). The issue only affects the new connections, because the legacy instances still use the field.

AdSchellevis commented 1 year ago

The new one is more flexible as it can also hook multiple clients, but this issue can likely be triggered when using both in combination. If you remove the ipsec_configure_spd() in the following line, does it work correctly then?

https://github.com/opnsense/core/blob/4bef809bd031f0aa3d55963e57a82d988fd2d45a/src/etc/inc/plugins.inc.d/ipsec.inc#L1613

I'm not sure I can offer a 100% solid fix, but preventing this from happening when manual spd's aren't used in legacy shouldn't be an issue.

Daggolin commented 1 year ago

Yes, I had tested this in the meantime to confirm the observations and conclusions I initially posted.

AdSchellevis commented 1 year ago

I'm thinking maybe we can just remove the items attached to any of the legacy tunnels, when a tunnel is dropped and the reqid isn't used, the garbage left shouldn't hurt anyone.

AdSchellevis commented 1 year ago

@Daggolin looking at the ticket and the code again, I think you're reusing the same reqid as being used in the legacy tunnel settings. According to the following code section it should only drop entries which belong to the legacy tunnel already:

https://github.com/opnsense/core/blob/4bef809bd031f0aa3d55963e57a82d988fd2d45a/src/etc/inc/plugins.inc.d/ipsec.inc#L770-L778

If you use a higher reqid (or link the entry to the correct child) it should work as expected as far as I can see.

Daggolin commented 1 year ago

@Daggolin looking at the ticket and the code again, I think you're reusing the same reqid as being used in the legacy tunnel settings.

As I said in the inital description of the issue, the shown reqid for the installed manual SPD entries is 1, which is the ID of another tunnel. So yes, a reuse is very likely happening under the hood, but it's not explicitly set by me to collide. I may have worded that unclear in my initial description of the issue. I have now manually checked swanctl.conf and swanctl -l to confirm this.

The new connection does not have an explicit reqid set for its child on the WebUI, because the help text suggests that OPNsense or swanctl is going to automatically pick incremental ones (This might be helpful in some scenarios, like route based tunnels (VTI), but works only if each CHILD_SA configuration is instantiated not more than once. The default uses dynamic reqids, allocated incrementally). Looking at the swanctl.conf file generated by OPNsense it seems that the new connection doesn't get an explicit reqid assigned by OPNsense, but it's rather left unset which implicates the default-value of 0 for reqid, causing swantctl to allocate a reqid incrementally on start. In our case - according to swanctl -l - this means that the new connection's child got the reqid 1 at swanctl restart/reconnect, colliding with the legacy tunnel that OPNsense explicity assigned the same id to. It seems the existence of legacy tunnels is enough to make the automatic value unreliable.

Manually assigning a high reqid to the child of the new tunnel solves the issue of SPD entries getting deleted. However, this still means that the use of automatically incrementally allocated reqid in new connections does not work reliable if you have even a single legacy connection configured (even if it’s disabled the WebUI might draw the wrong conclusions, haven’t tested that). Maybe OPNsense could handle those incrementally allocated reqid internally, taking legacy tunnels into consideration and explicity writing a reqid to the swantctl.conf file instead of leaving it up to swanctl by not setting it. I honestly think that mixed setups with legacy connections and new connections existing side-by-side are going to be quite common for a while when users migrate their existing legacy connections to new connections one by one.

AdSchellevis commented 1 year ago

I'm afraid we can't fix this, legacy pins the reqids in which case you will have to pin them in the connections as well. We can make a note somewhere in the documentation that there are constraints when using both options, but there's not much else we can do.

The reason for not pinning reqid's in connections in quite simple, when multiple clients connect, they all will use a unique reqid at strongswan's end unless you enforce anything. One other option you could try is to replace the lower numbers in the tunnels manually (they're stored in the config.xml) with higher ones so their less likely to collide.

Daggolin commented 1 year ago

The reason for not pinning reqid's in connections in quite simple, when multiple clients connect, they all will use a unique reqid at strongswan's end unless you enforce anything.

I was afraid something like that was the reason why you decided to pass no value (0) into the swanctl config in the first place.

One other option you could try is to replace the lower numbers in the tunnels manually (they're stored in the config.xml) with higher ones so their less likely to collide.

I had already considered this option, but I don’t think it offers any advantages over assigning high reqid to the new connections within the setup I am working with, but thanks for the suggestion.

We can make a note somewhere in the documentation that there are constraints when using both options, but there's not much else we can do.

It would be very helpful if the help text of the reqid field of the new connections would explicitly mention that using both, legacy and new connections, can cause issues and users should always assign a unique value themselves if they still have legacy tunnels set up.

Alternatively, as OPNsense explicitly assigns the values for legacy tunnels, OPNsense could automatically assign high numbers to them, but that would just shift the issue elsewhere and if you would automatically reassign existing config values during an update it could mess with users who already set manual values, potentially breaking their setup. So that‘s not a good option really.

As you said this is not something you can fix currently and as possible workarounds exist, feel free to close this issue if you don‘t need any infos or input from me unless you want to keep it around until documentation and/or possible button adjustments are done.

AdSchellevis commented 1 year ago

Let me think a bit about wording and close the issue later. I do agree explaining the constraints would be helpful, although our documentation is likely a better place to do that.