opnsense / core

OPNsense GUI, API and systems backend
https://opnsense.org/
BSD 2-Clause "Simplified" License
3.25k stars 725 forks source link

WireGuard/OpenVPN: tun naming collision #6566

Closed patschi closed 10 months ago

patschi commented 1 year ago

Important notices

Before you add a new report, we ask you kindly to acknowledge the following:

Describe the bug

It looks like there is a naming collision with /dev/tun interfaces when using OpenVPN and WireGuard-go on the same OPNsense machine. I was in need in configuring my first OpenVPN server for connecting mobile clients, but after setting this up according to the OPNsense documentation the start failed with:

<29>1 2023-05-16T19:16:13+00:00 fw openvpn 15222 - [meta sequenceId="4"] MANAGEMENT: unix domain socket listening on /var/etc/openvpn/server1.sock
<28>1 2023-05-16T19:16:13+00:00 fw openvpn 15222 - [meta sequenceId="5"] WARNING: using --duplicate-cn and --client-config-dir together is probably not what you want
<28>1 2023-05-16T19:16:13+00:00 fw openvpn 15222 - [meta sequenceId="6"] NOTE: the current --script-security setting may allow this configuration to call user-defined scripts
<29>1 2023-05-16T19:16:13+00:00 fw openvpn 15222 - [meta sequenceId="7"] Diffie-Hellman initialized with 4096 bit key
<29>1 2023-05-16T19:16:13+00:00 fw openvpn 15222 - [meta sequenceId="8"] Outgoing Control Channel Encryption: Cipher 'AES-256-CTR' initialized with 256 bit key
<29>1 2023-05-16T19:16:13+00:00 fw openvpn 15222 - [meta sequenceId="9"] Outgoing Control Channel Encryption: Using 256 bit message hash 'SHA256' for HMAC authentication
<29>1 2023-05-16T19:16:13+00:00 fw openvpn 15222 - [meta sequenceId="10"] Incoming Control Channel Encryption: Cipher 'AES-256-CTR' initialized with 256 bit key
<29>1 2023-05-16T19:16:13+00:00 fw openvpn 15222 - [meta sequenceId="11"] Incoming Control Channel Encryption: Using 256 bit message hash 'SHA256' for HMAC authentication
<27>1 2023-05-16T19:16:13+00:00 fw openvpn 15222 - [meta sequenceId="12"] Cannot open TUN/TAP dev /dev/tun1: Device busy (errno=16)
<29>1 2023-05-16T19:16:13+00:00 fw openvpn 15222 - [meta sequenceId="13"] Exiting due to fatal error

So essentially the culprit is:

Cannot open TUN/TAP dev /dev/tun1: Device busy (errno=16)

How the naming scheme works I wanted to understand more, so I checked the code. That seems to be the most interesting piece: https://github.com/opnsense/core/blob/bebf3a2a7c2de1eb102ff41aefc401d8e999e52f/src/etc/inc/plugins.inc.d/openvpn.inc#L461-L469

Here you can see that always [mode][vpnid] is used - so when using the first OpenVPN client with tun mode, this will be tun1.

As you can see on the logs above, in my case the $vpnid is 1:

unix domain socket listening on /var/etc/openvpn/server1.sock

Here in the code we can see it starts its numbering scheme with 1: https://github.com/opnsense/core/blob/bebf3a2a7c2de1eb102ff41aefc401d8e999e52f/src/etc/inc/plugins.inc.d/openvpn.inc#L220-L224

While this code is checking the next available vpnid, it's checking solely the configuration perspective and not the free naming of the backing /dev/tun interfaces.

The issue The collision seems to occur with WireGuard - the wireguard-go implementation to be exact. To my knowledge this specific userworld implementation uses those tun interfaces.

I have 2 VPN tunnels in total and both are backed by WG, therefore also two tun interfaces:

root@fw:~ # ls -lsh /var/run/wireguard/
total 1
1 srwx------  1 root  wheel     0B Sep  9  2022 wg1.sock
1 srwx------  1 root  wheel     0B Sep  9  2022 wg2.sock

root@fw:~ # ls -lsh /dev/tun*
0 crw-------  1 uucp  dialer   0x64 May 16 20:24 /dev/tun0
0 crw-------  1 uucp  dialer   0x66 May 16 20:24 /dev/tun1

So /dev/tun0 and /dev/tun1 are used by WireGuard. But my first OpenVPN server with VPNID=1 also wants to use /dev/tun1 as per above code, hence the VPN server fails.

To Reproduce

Steps to reproduce the behavior:

  1. Create 2 WG tunnels
  2. Create the first OpenVPN server ever (as vpnid is incrementing)

Expected behavior

The OpenVPN server to be created and able to start.

Describe alternatives you considered

When you create disabled placeholder OVPNs to artificially increase the VPNIDs, you can get the OVPN to work just fine - with the identical configuration:

<29>1 2023-05-16T20:35:34+00:00 fw openvpn 82961 - [meta sequenceId="16"] /usr/local/etc/inc/plugins.inc.d/openvpn/ovpn-linkup ovpns2 1500 1623 10.0.8.1 255.255.255.0 init
[...]
<29>1 2023-05-16T20:35:34+00:00 fw openvpn 82961 - [meta sequenceId="19"] Listening for incoming TCP connection on [AF_INET]IP:443
[...]
<29>1 2023-05-16T20:35:34+00:00 fw openvpn 82961 - [meta sequenceId="25"] Initialization Sequence Completed

Screenshots

n/a

Relevant log files

Relevant piece attached above.

Additional context

n/a

Environment

Software version used and hardware type if relevant, e.g.:

OPNsense 22.7.2-amd64

fichtner commented 1 year ago

This is a known issue stemming from an implementation of OpenVPN dating back decades. Tun behaviour hasn't changed and /dev/tunX devices are still not providing alias support. The whole thing is designed to fail and other VPNs using tun devices will inherently break this. I know that @AdSchellevis is rewriting OpenvPN in MVC at the moment and we might discuss options, but I'm a bit pessimistic about the outcome given the constraints that tun driver gives us.

Cheers, Franco

patschi commented 1 year ago

Ah, good to know! I was able to find a few threads and similar GitHub issue, but not this specific behavior is a known issue.

Doing some brainstorming, I think the easiest way might be using $vpnid across WireGuard, OpenVPN and IPsec (for tun modes). So the first WG VPN is vpnid=1, the 2nd VPN being OpenVPN VPN then gets vpnid=2, and so on. The while loop could stay, identifying the next available, unused vpnid.

But granted, not checked how complicated/feasible the implementation will be.

fichtner commented 1 year ago

Yep, you'd need a shared implementation. I think also zerotier openconnect and others are using tun driver...Not sure how feasible this is depending on the individual service's constraints. And the last bit is having all on MVC which OpenVPN is currently not, but at least it will be in the near future as mentioned earlier.

There might be another option using the "original" interface name which is actually shown by ifinfo (but oddly enough not ifconfig):

# ifinfo ovpnc1
Interface ovpnc1 (tun1):
[...]

So after rename we could still derive the actual tun/tap index number...

Cheers, Franco

AdSchellevis commented 1 year ago

There might be another option using the "original" interface name which is actually shown by ifinfo (but oddly enough not ifconfig):

For the new MVC version we could easily expose the device (tun/tap) number to the user in the advanced settings as well, we are storing and validating it anyway.

fichtner commented 1 year ago

The only challenge I see is that the device number is a dynamic value at device creation time (depending on how much else was configured in the meantime by other VPNs).

fichtner commented 1 year ago

In particular OpenVPN dev-node needs to be switched to a runtime value from ifinfo output which would probably fix most of the concerns...

AdSchellevis commented 1 year ago

I can take a look at that for the new version, thought we could influence the number in /dev/ as well, but I'm probably wrong.

AdSchellevis commented 1 year ago

ah, yes you can:

ifconfig tun99 create

creates /dev/tun99

fichtner commented 1 year ago

That is what we do, but we fail to consider that it might already be used by something else so we cannot rely on $vpnid being the actual device node number.

fichtner commented 1 year ago

The workflow is not very complicated since we rename the OpenVPN device anyway before we start using it. We just have to give the right number to openvpn config as mentioned above.

AdSchellevis commented 1 year ago

Unfortunately openvpn can't map it back by itself, it still feels a bit silly to tell it which device to use and what it's linked to. But when you're able to choose the number, it would be easier to choose a number that's likely not taken (start at 100 for example). The downside of asking the number on generating the config is that these are less loosely coupled (as you can't generate it statically anymore)

fichtner commented 1 year ago

Yep, fixed device range offset might be a good idea. Just offset $vpnid for tun/tap creation and done :)

patschi commented 1 year ago

Yep, you'd need a shared implementation.

Might be possible using the global $config variable here.

To add: How's the idea in having a pre-start-hook when starting up OpenVPN? So instead of calling the openvpn binary, we call a script which is doing preparation ahead of the start.

In other words: https://github.com/opnsense/core/blob/bebf3a2a7c2de1eb102ff41aefc401d8e999e52f/src/etc/inc/plugins.inc.d/openvpn.inc#LL933C19-L933C52

Before we call the binary, we can adjust the OpenVPN config to determine a unused /dev/tunX interface. Then rewrite and start it.

I'm not sure how much VPN sessions the largest OPNsense instances out there have or if there might be scenarios where /dev/tun might exceed e.g 100 for other VPN sessions for some reasons (some application issues where tun's get stuck or whatsoever).

fichtner commented 1 year ago

To add: How's the idea in having a pre-start-hook when starting up OpenVPN?

OPNsense OpenVPN integration is a giant pre-start-hook ;)

But this is what Ad meant with dynamic data influencing configuration file. I think the point is really that OpenVPN will not figure out the device node name itself. And the fixed interface offset is probably the easiest fix too.

OPNsense-bot commented 10 months ago

This issue has been automatically timed-out (after 180 days of inactivity).

For more information about the policies for this repository, please read https://github.com/opnsense/core/blob/master/CONTRIBUTING.md for further details.

If someone wants to step up and work on this issue, just let us know, so we can reopen the issue and assign an owner to it.