No error message for non-supported multi WAN w/ single gateway IP setup

sjjh commented 1 year ago

Important notices

Before you add a new report, we ask you kindly to acknowledge the following:

[x] I have read the contributing guide lines at https://github.com/opnsense/core/blob/master/CONTRIBUTING.md
[x] I am convinced that my issue is new after having checked both open and closed issues at https://github.com/opnsense/core/issues?q=is%3Aissue

Describe the bug

I can create a multiwan setup with two (PPPoE) gateways which get their IP addresses via DHCP from the ISP and have uplink both the same upstream gateway IP. This setup is apparently not supported by FreeBSD (as multipath is disabled due to other issues), see: https://forum.opnsense.org/index.php?topic=34189.0 Although this setup is not supported, no error message or warning is shown in the Web GUI.

To Reproduce

Just create two gateways having the same upstream gateway address.

Expected behavior

It works or an error message is shown.

Describe alternatives you considered

Not using multi WAN.

Additional context

The resulting feature request would be to check the IP address of the respective gateway of the two gateways. In case both gateways have the same gateway IP address, show a big error message in the Web GUI and log an error message to the log file.

Environment

OPNsense 23.1.7_3-amd64 FreeBSD 13.1-RELEASE-p7 OpenSSL 1.1.1t 7 Feb 2023

fichtner commented 1 year ago

Quick test adding a second gateway...

The following input errors were detected:

The gateway IP address "10.3.0.2" already exists.

fichtner commented 1 year ago

If you talk about unspecified dynamic gateways delivered by the ISP at runtime.. I'm not even sure how and where to present that.

sjjh commented 1 year ago

Sorry, did not test it with a manual config. It's about DHCP (as stated in the initial post). Probably in most cases when a second gateway will be created usind DHCP, the connection will be established very quickly and thus the issue will be noticeable directly. Thus I could imagine to present the error message in the gateway > single gateway screen. Even if it later occurs, I would imagine someone will be looking sooner or later at the gateway screen and would see an error message there. Additionally I believe a log entry might be helpful.

fichtner commented 1 year ago

PPPoE and DHCP are distinct, but routers are provided in both cases. These routers are written to files on the disk:

# ls /tmp/*_router

To my knowledge the problem first and foremost is that you cannot simultaneously push traffic through both WANs if they have the same gateway address. The second one is dead. Default gateway switching still works, but since the single point of failure is your ISP gateway the failover point is rather moot.

So typically you only use such a setup if you want to bundle two connections in order to use doubled bandwidth. All these constraints may or may not apply to the case at hand. My reluctance here is adding an error to a functional setup as well. Some people don't mind or haven't noticed. Not sure where the sweet spot for this request shall be.

Cheers, Franco

sjjh commented 1 year ago

In our setup we are using a 1Gbit/s link for "most" traffic, and a 30Mbit/s link dedicated only for VoIP traffic (so no bundling to double the bandwidth). The gateway to use is selected by firewall / NAT rules (the internal VoIP traffic is coming from a separate VLAN). This seems to work and it does not look like as if one interface would be dead completely. We are experiencing irregular issues if the web-gateway goes down (e.g. after connection breakages or reboots) that the web traffic is using the VoIP-gateway and will not come back to the web-WAN it should use. (Web-gateway is marked as upstream, default, higher priority (equals lower number) than the VoIP-gateway)

I understood (but might be wrong due to my limited understanding) that the setup with two gateways and only one gateway-gateway address is not supported at all. Thus I thought that an error message would be helpful (would at least have saved me quite some hours of research on the net). If one gateway would be dead (even if people would not notice), it still sounds sensible to me to show an error message to make them aware of that fact. Right now, at least I, wasn't aware of the root cause of the topic and it costs me quite some time do research.

sjjh commented 1 year ago

Probably obvious, but in case it helps, yes, both files contain the same IP address:

$ ls /tmp/pppoe*_router
/tmp/pppoe1_router  /tmp/pppoe2_router
$ diff /tmp/pppoe1_router /tmp/pppoe2_router
$

fichtner commented 1 year ago

Do you have a gateway group set? Loss and delay triggers are broken currently, see #6231

Cheers, Franco

sjjh commented 1 year ago

No, no gateway group is used. We also disabled gateway monitoring as we do have no fallback anyway it does add no value (and could potential only lead to false positive).

fichtner commented 1 year ago

But you are using default gateway switching? I’m not sure how that works without proper monitoring.

sjjh commented 1 year ago

no, no switching at all. Just two gateways, for specific traffic:

GW_Internet_WAN (1Gbit/s) -> all the web surfing traffic, email, ...
GW_VoIP_WAN (30Mbit/s) -> only VoIP traffic (we do have an PBX on premise, using SIP trunking) Reason for that setup: phoning should still work, even if surfing takes all the bandwidth. QoS, Shaping, ... did not work very well, thus we decided to use two separate gateways.

fichtner commented 1 year ago

Are both set to upstream gateway? Can you explain "web-gateway goes down" a little more?

Thanks, Franco

sjjh commented 1 year ago

Only the "web" gateway is set to upstream.

With "web-gateway goes down" I mean occasions as e.g. power loss of firewall, cable disconnected, reboot of firewall, taking the gateway down in SW, forcing the gateway down by (false positive) gateway monitoring result, ... all situations when the interface is not up. Not in all but in some cases we than have issues as described that all the traffic will only use the other VoIP gateway and stick there, even if both gateways are available again. My expectation was, that as soon as the web-gateway will come up again, it will be used again (due to priority, marking as upstream, ...) but it is not. Often it only helps to take the VoIP-gateway down, and after a while then the traffic switches back to the web-gateway.

fichtner commented 1 year ago

Ok, when the traffic is stuck on the VOIP WAN will this resolve it?

# /usr/local/etc/rc.filter_configure

If this doesn't work you could also try

# /usr/local/etc/rc.routing_configure

But I suspect the first one will work.

Cheers, Franco

sjjh commented 1 year ago

Will try, when I experience the problem next time, and report back.

sjjh commented 1 year ago

So, after maintenance work of our ISP tonight, leading to a cut-off of the uplink, this morning we were having the same issue, that the web-traffic was using the wrong gateway. I tried both, # /usr/local/etc/rc.filter_configure and # /usr/local/etc/rc.routing_configure, and both did not work. I also tried reconnecting both gateways in the web UI under Interfaces > Overview > reload, which also did not work. Only editing the gateways under System > Gateways > Single (enabling the the monitoring monitoring and reapplying the changes) helped to bring traffic back to the correct gateway.

fichtner commented 1 year ago

That seems to indicate gateway monitoring (dpinger) plays a bigger role here in decision. It would perhaps appear dpinger is "stuck" on the second link. Have you tried to disable host routes for the gateways?

Can you share the gateway log during the event and fix?

The development version has improved gateway monitor handling and recovery, but perhaps due to the same gateway IP this might be a OS problem of sorts still.

Cheers, Franco

sjjh commented 1 year ago

Have you tried to disable host routes for the gateways?

sry, not sure. Can you point me to the setting you are talking about?

Can you share the gateway log during the event and fix?

root@fw:/var/log/gateways # ls -l
total 184
-rw-------  1 root  wheel  10557 Mar 30 13:55 gateways_20230330.log
-rw-------  1 root  wheel  57752 Mar 31 23:58 gateways_20230331.log
-rw-------  1 root  wheel  99903 Apr  1 20:56 gateways_20230401.log
-rw-------  1 root  wheel   3875 Apr 24 12:35 gateways_20230424.log
-rw-------  1 root  wheel    115 May 23 20:08 gateways_20230523.log
-rw-------  1 root  wheel    932 Jun 22 07:53 gateways_20230622.log
lrwxr-x---  1 root  wheel     39 Jun 22 08:01 latest.log -> /var/log/gateways/gateways_20230622.log
root@fw:/var/log/gateways # cat latest.log 
<12>1 2023-06-22T07:51:55+02:00 fw.example.com dpinger 29060 - [meta sequenceId="1"] send_interval 1000ms  loss_interval 2000ms  time_period 60000ms  report_interval 0ms  data_len 0  alert_interval 1000ms  latency_alarm 500ms  loss_alarm 20%  alarm_hold 10000ms  dest_addr 8.8.8.8  bind_addr n.n.n.1  identifier "GW_INTERNET_WAN_PPPOE "
<12>1 2023-06-22T07:51:55+02:00 fw.example.com dpinger 30434 - [meta sequenceId="2"] send_interval 1000ms  loss_interval 2000ms  time_period 60000ms  report_interval 0ms  data_len 0  alert_interval 1000ms  latency_alarm 500ms  loss_alarm 20%  alarm_hold 10000ms  dest_addr 8.8.4.4  bind_addr n.n.n.2  identifier "GW_VOIP_WAN_PPPOE "
<12>1 2023-06-22T07:52:48+02:00 fw.example.com dpinger 29060 - [meta sequenceId="3"] exiting on signal 15
<12>1 2023-06-22T07:52:48+02:00 fw.example.com dpinger 30434 - [meta sequenceId="4"] exiting on signal 15

fichtner commented 1 year ago

It's a setting for each individual gateway: "Disable Host Route"

So after 07:52:48 the first line was up being used again? It's a bit strange since rc.routing_configure will also restart all monitors.

Cheers, Franco

sjjh commented 1 year ago

It's a setting for each individual gateway: "Disable Host Route"

Sorry, overlooked that one. It's not disabled. Shall I disable it for both and check if it makes a difference next time? (if there is a next time -- due to the ongoing issues we are currently considering to abandon the second gateway and just use one, as long as bandwidth permits it)

So after 07:52:48 the first line was up being used again?

Yes, after the mentioned steps in my above post the gw internet WAN worked again as expected.

fichtner commented 1 year ago

It's relatively strange about the fix with the "apply", in a nutshell the GUI is calling /usr/local/etc/rc.routing_configure. Just to make sure gateway monitor is now disabled (option checked).

Yes you can check disable host route setting, but it only makes sense if monitor itself is enabled (option unchecked).

Cheers, Franco

sjjh commented 1 year ago

Just to make sure gateway monitor is now disabled (option checked).

It initially was disabled (option checked), I enabled it (to have a change I could apply), and then disabled it again (and applied again), for both gateways respectively.

Yes you can check disable host route setting, but it only makes sense if monitor itself is enabled (option unchecked).

Which is not (monitor). I'll nevertheless just enable it, if it cannot hurt and we'll (might) see next time if it makes any difference.

sjjh commented 1 year ago

FYI: We removed the second gateway (as it is not supported, as stated in the initial posting) to erase this as a root cause for other connection problems. Thus I will not be able to test/debug this any further. The initial bug/feature request is IMHO nevertheless valid, thus leaving this bug open. :)

syserr0r commented 1 year ago

We have something similar with 3 WAN links to the same ISP and currently with the same gateway address (this was not always the case):

WAN [priority 10]
WAN2 [priority 20]
WAN_VOIP [priority 100, 'upstream gateway' unticked in gateways]

We use gateway rules to enforce traffic from our VOIP server over the WAN_VOIP interface. We use similar rules for assigning certain traffic to certain interfaces. Remaining traffic goes over a gateway group balancing WAN and WAN2.

I was honestly not aware this was unsupported.

Things that I have noticed that might not be working:

gateway monitoring for WAN2 and WAN_VOIP show 100% loss even though these links appear to be working. WAN seems OK
- This likely also "breaks" the gateway group causing it to only use WAN and not really balancing
Attempting to add static routes with a gateway of WAN2 or WAN_VOIP once applied show as WAN in the routes status (presumably the route is added by IP not interface - likely also why gateway monitoring is broken)

We are currently in contact with the ISP to see if we can get different gateway IPs assigned.

I am happy to provide some testing although it is a production system so I am weary of anything that might affect client connectivity,

I have now disabled gateway monitoring on WAN2/WAN_VOIP and will see what the impact (if any) is on the gateway groups.

AdSchellevis commented 1 year ago

Overlapping networks break normal (destination) routing constraints, this is an issue on most platforms. It's like instructing the mailman the same address is located at different locations, in which case a letter might be delivered randomly.

In theory it should be possible to define virtual overlapping networks using fibs (https://man.freebsd.org/cgi/man.cgi?query=setfib), but it comes with quite some constraints (the running application should choose on which virtual network it lives). Unfortunately that's not a scenario easy to support from our end. If I'm not mistaken in linux the problem is similar, but solvable using VRF (https://docs.kernel.org/networking/vrf.html), which probably has similar challenges.

OPNsense-bot commented 11 months ago

This issue has been automatically timed-out (after 180 days of inactivity).

For more information about the policies for this repository, please read https://github.com/opnsense/core/blob/master/CONTRIBUTING.md for further details.

If someone wants to step up and work on this issue, just let us know, so we can reopen the issue and assign an owner to it.

opnsense / core

No error message for non-supported multi WAN w/ single gateway IP setup #6576