opnsense / core

OPNsense GUI, API and systems backend
https://opnsense.org/
BSD 2-Clause "Simplified" License
3.36k stars 752 forks source link

Improve multi-WAN failover resiliency: multiple IP monitoring per gateway before taking down, and auto DHCP renewal when gateway comes back up (when using virtual Linux Bridges from Proxmox as interfaces) #5866

Closed kwand closed 1 year ago

kwand commented 2 years ago

Important notices

Before you add a new report, we ask you kindly to acknowledge the following:

Is your feature request related to a problem? Please describe.

As you may have heard, there was a recent nationwide outage in Canada with one of our telecoms (Rogers), with very weird behaviours during the service restoration process. Thankfully, I do have a second slower connection to a different ISP (Bell); was finally convinced to setup multi-WAN failover yesterday and link that other connection to my Opnsense box.

However, I noticed some potential flaws in the process (which seem to be shared by many others in the pfSense/Opnsense community in real-world failures), as well as problems when switching over from the secondary gateway (Bell) when the primary finally came back up this morning (Rogers): 1) Single monitor IP per gateway. This seems a bit problematic as:

Describe the solution you like

  1. Use multiple monitor IPs per gateway to decide whether it is up or down. There have been multiple issues opened in both pfSense and opnsense repos/forums (see #4163, https://redmine.pfsense.org/issues/1189, https://forum.opnsense.org/index.php?topic=27355.0, https://forum.netgate.com/topic/84721/hack-for-multiple-ips-for-gateway-monitoring/2, etc.) over the years as well as multiple solutions being proposed (and by others more knowledgeable than me, so I will leave the details to them.) Though, I particularly like this solution proposed in the pfSense forums years ago:

B brainloss Sep 15, 2015, 5:31 AM

I would like to see a "proper" solution. Single IP monitoring is causing us no end of issues. Gateways being marked as down, but really the monitor IP has dissapeared, or ICMP is blocked but real world taffic tcp/udp is flowing perfectly.

My concept would include many IP's and have some weighted rules. Something like www.policyd-weight.org comes to mind. This would allow a list of say 20 IP's to monitor and allow for x number to be down and some marked as higher "number value" than others, then only mark the gateway as down if the sum of these values is below y. Could even use the same IPs for many gateways and if one ip down on one gateway the IP can be checked against another gateway.

I have no development skills, but would be willing to test and give feedback.

–Paul

  1. Allow for an option in the Gateway settings to reload the DHCP services when a gateway goes down and comes back up.

Describe alternatives you considered

I'm sure that I could put together 'hacks' using scripts and cron jobs to achieve what I want, but I would much prefer a GUI solution as this doesn't seem too complicated to implement. (Personally, I don't have much experience with networking CLI tools and FreeBSD. I'm more comfortable with OpenWRT (especially their uci command) and Linux networking - there's actually also a very easy way to achieve 1) with mwan3 in OpenWRT)

Running opnsense on bare metal to fix 2) (so interface link states are properly reported) is not an option as I also run transparent OpenWRT VMs for CAKE traffic shaping (due to the fact that it isn't available in FreeBSD yet)

Additional context

See:

ThomasTr commented 2 years ago

Hi, I'm currently investigating if my problem correlates with yours: One of my WAN connections is a Zyxel 5G Modem NR7101 which failed two times the last days middle in the night and didn't recover itself. Tried a few things (unplug & restart the modem...) the last chance was to reboot the firewall. Perhaps it has to do with DHCP renewal, must investigating further.

OPNsense-bot commented 1 year ago

This issue has been automatically timed-out (after 180 days of inactivity).

For more information about the policies for this repository, please read https://github.com/opnsense/core/blob/master/CONTRIBUTING.md for further details.

If someone wants to step up and work on this issue, just let us know, so we can reopen the issue and assign an owner to it.