opnsense / core

OPNsense GUI, API and systems backend
https://opnsense.org/
BSD 2-Clause "Simplified" License
3.28k stars 727 forks source link

Multi-WAN Failover fails on unplugging cable of WAN1 / WAN2 #4160

Closed pete1019 closed 4 years ago

pete1019 commented 4 years ago

Important notices Before you add a new report, we ask you kindly to acknowledge the following:

[X] I have read the contributing guide lines at https://github.com/opnsense/core/blob/master/CONTRIBUTING.md

[X] I have searched the existing issues and I'm convinced that mine is new.

Describe the bug Multi-WAN Failover fails on unplugging cable (very short period) of WAN1 / WAN2. It will do strange things and not go back to Tier1. Hitting save on any interface will fix the issue till next time.

To Reproduce

  1. followed this guide 100%: https://docs.opnsense.org/manual/how-tos/multiwan.html
  2. Creating Failover with WAN1 being Tier1 and WAN2 being Tier2
  3. Unplugging Cable of Tier1 for only 0.1 seconds and plugging it back in
  4. cut internet connection on WAN1 without pulling the cable (link needs to stay up). You will see the trouble happening after that. Do some testing by cutting internet on Tier1 and Tier2 but do not pull the cable, try to cut the internet somehow else. Check the behavior.

Expected behavior Failover should work as set up (Tier 1, Tier 2) even on ports flapping or some hardware rebooting which is connected to WAN1 or WAN2.

OPNsense 20.1.7 APU4 PC-Engines

I am willing to show the problems via anydesk or teamviewer.

THANK YOU!

arch1mede commented 4 years ago

This actually happened to me as well, you can read the post on it @ https://forum.opnsense.org/index.php?topic=17198.msg78209#msg78209

pete1019 commented 4 years ago

i tested again and can sum up:

any physically un- und replugging of a ethernet-cable on the appliance (link going down and then back up) on any Tier (EDIT: if on DHCP) will mess up Failover afterwards. (fyi: i just saw unbound dns not running for a few seconds in the dashboard.)

If i use another switch to cut the internet (so no direct un- and replugging of cable on port) the Failover will work as intended. But keep in mind: failover will fail in the future if for example a device connected to the tier1 or tier2 WAN-interface will reboot (link down, link up).

mimugmail commented 4 years ago

AFAIK this only happens on dual DHCP oder dual PPPoE WANs. If you use static this will not occur.

arch1mede commented 4 years ago

Yeah but this is very rare that you can ever get static IP's, neither of my wan providers have static IP's available OR the cost of getting one is not very enticing.

arch1mede commented 4 years ago

I also followed that same doc and I didn't see anywhere that static IP's were mandatory for this to function properly. It shouldn't matter though.

pete1019 commented 4 years ago

AFAIK this only happens on dual DHCP oder dual PPPoE WANs. If you use static this will not occur.

I have dual DHCP. But why would this not being considered a bug? Same would be with dual PPPOE.

I will test with one of them being static.

Update: Made Tier2 DHCP to Static and pulled the Tier1 with still having DHCP. Same Problem. I will test both beeing static right now. Update2: Seems like un- and replugging on either Tier with DHCP enabled will cause this. With static it seems to be fine. But its not about dual DHCP it is about DHCP in general. Using DHCP on WAN will make failover fail if that interface goes physically down and up again.

mimugmail commented 4 years ago

Can you let the modem do the dialup so you can use static IP behind on OPNsense (for testing), same to DHCP.

Don't forget here are all volunteers, not every developer has dual DHCP or dual PPPoE for testing all the time.

pete1019 commented 4 years ago

Can you let the modem do the dialup so you can use static IP behind on OPNsense (for testing), same to DHCP.

Don't forget here are all volunteers, not every developer has dual DHCP or dual PPPoE for testing all the time.

You would only need one DHCP for reproducing. See my update above. I will try to help as good as possible.

I am not able to test PPPOE at the moment.

pete1019 commented 4 years ago

Final result:

WAN interface with DHCP will mess up Failover if it gets physically un- and replugged.

Easy to reproduce:

If i use STATIC there is no such problem.

FYI: with DHCP i can see Gateways disappearing on Dashboard. This does not happen with static.

If you are interested, this Windows Batch (.bat) will tell you your public ip again and again so you will see on which gateway you really are.

@Echo off
:loop
SET /A XCOUNT+=1
echo %XCOUNT%
For /f %%A in (
  'powershell -command "(Invoke-Webrequest "http://api.ipify.org" -TimeoutSec 1).content"'
) Do Set ExtIP=%%A
Echo External IP is : %ExtIP%
goto loop 
arch1mede commented 4 years ago

FYI: with DHCP i can see Gateways going down on Dashboard. This does not happen with static.

Isn't that the point though, detect a gateway down then switch over? If the gateways do not report down then this isn't a true failover, just a re-route.

pete1019 commented 4 years ago

FYI: with DHCP i can see Gateways going down on Dashboard. This does not happen with static.

Isn't that the point though, detect a gateway down then switch over? If the gateways do not report down then this isn't a true failover, just a re-route.

The whole list is not seen anymore, as if there are no gateways. I changed "going down" to "disappearing".

pete1019 commented 4 years ago

Any news on this? Thank you!

mimugmail commented 4 years ago

No, it's vacation time ..

pete1019 commented 4 years ago

No, it's vacation time ..

Do my testings help if someone has spare time? Thanks

ischilling commented 4 years ago

I have the same / similar issue.

Infrastructure

Both Internet connections do have fixed IPv4/IPv6, however, DHCP has to be used mandatory.

Tests done

Problem in addition I found in addition, that if I am not unplugging the connection to one oft the modems manually but if, like since the beginning of this year, with Kabel-Vodafone a 'standard issue', the Internet provider has great latency issues or simply on his side of the modem, the connection breaks, the result is the same as with unplugging the connection between OPNsense and modem.

Maybe this behavior is another bug, however, it may help to track down this issue since it isn't only annoying, it is simply rendering OPNsense in multi-wan environments for small-midsize-businesses worthless.

Side-Notes

All cards were Intel cards.

The issue persists.

ischilling commented 4 years ago

I can, btw., confirm that Multi-Wan on fixed IP between Modems and OPNsense as well as here described, Multi-Wan on fixed IP between Router (DC-Infrastructure) and OPNsense work like a charme. In the ladder environment, there is also no difference between copper or fibre, nor speed (tested up to 40 GBit with Mellanox-cards).

pete1019 commented 4 years ago

No, it's vacation time ..

I was just wondering: where is "vacation time"? In Germany it starts at the end of June. And in times of Covid-19 everything is different, right? Thanks

mimugmail commented 4 years ago

This and last week was homeoffice-only, no chance to test, sorry.

pete1019 commented 4 years ago

This and last week was homeoffice-only, no chance to test, sorry.

But you are still at it so everything is fine. It's a big issue. Hope to hear from you soon. Thanks

arch1mede commented 4 years ago

There also needs to be a solution for those that do not have access to static IP's.

marjohn56 commented 4 years ago

Looks as if because there is no 'interface down/up' when the cable is not removed, then dhcp is not re-triggered. I'll TRY and re-create this locally, although I don't have multwan per say.

pete1019 commented 4 years ago

Looks as if because there is no 'interface down/up' when the cable is not removed, then dhcp is not re-triggered. I'll TRY and re-create this locally, although I don't have multwan per say.

Great, if you need help or inspiration: let us know. Just one of the two WANs needs to be DHCP and you can reproduce the issue easily.

marjohn56 commented 4 years ago

OK.. my test consists of this, and remember I'm running on 20.7b.

I have a test router setup as failover getting it's two WAN addresses from my main router, two LAN networks so independent addresses. I have one LAN out of the test router and this is v4 only - v6 is disabled. In between the WAN input on my test router and the output from the switch port carrying the primary router LAN I have added another switch, we'll call it 'switch B',, this allows me to unplug that network without taking down the primary WAN interface on the test router. I think that pretty much matches what you are saying.

I've added some extra logging to the rc.syshook.d\monitor\10-dpinger script so I can see it being called and that echoes some junk to a log file for me.

It's working perfectly, I can unplug the Input side of 'Switch B', thus leaving the WAN port of the test router connected and I can watch the gateway loss increasing. I have waited until its showing 100% loss and then checked my temp log file to see if the 10dpinger script is called and also I have done a tracert from my pc that confirms the gateways have switched. I have then left it for around 5 minutes - I went and made a cup of tea.:) I then reconnected the input to 'Switch B' and watched as the loss started to decline, I waited until that went to around 30% at which point the indicator amber, but I waited until it went green. I then checked to see if the 10-dpinger script had run, it had. I also did a tracert from my PC and that confirmed the gateway had switched back to the primary gateway. So it appears fine on v4 only.

Might add that on one occasion it took around 30 seconds for the route to switch back to the primary, but it did switch,

Are you using v6 too? Perhaps there might be an issue when dual stack is used

pete1019 commented 4 years ago

No ipv6 here either.

Please keep this in mind: First unplug and replug the WAN with DHCP on your Opnsense-Appliance. Then try your test again (Switch B) without unplugging and repluggin on WAN of Opnsense.

Please see here (step by step): https://github.com/opnsense/core/issues/4160#issuecomment-641789848

Thanks!

marjohn56 commented 4 years ago

I have TWO opnsense appliances, primary router and test router. Which one do you mean?

pete1019 commented 4 years ago

I have TWO opnsense appliances, primary router and test router. Which one do you mean?

The one that is handling the Multi-WAN. The one you set up like this: https://docs.opnsense.org/manual/how-tos/multiwan.html

I am just using one Opnsense with two WAN and a dumb switch for testing as explained above.

marjohn56 commented 4 years ago

Did that... works fine.

pete1019 commented 4 years ago

Did anything change regarding this in the beta you are using? Could you please test with 20.7. (non Beta)? And are you using DHCP on the WAN-Interfaces?

marjohn56 commented 4 years ago

No, 20,7 is not out yet. Do you mean try with 20.1.7? :) Yes, can do that. It'll take me a few minutes to back up the configs and install that version. BBS

Yup, DHCP on both interfaces.

marjohn56 commented 4 years ago

Works fine on 20.1.1, now I'll update.

marjohn56 commented 4 years ago

20.1.7 appears to have a problem.. I'll see if I can try and find it.

mimugmail commented 4 years ago

If it worked with 20.1.1 and breaks wit 20.1.7 this patch can only be the reason (which should fix it rather than break it): https://github.com/opnsense/core/issues/3961

marjohn56 commented 4 years ago

That's in rc-linkup, that doesn't get called as far as I can see in this scenario, It's where the interface is still up, but the other side of the switch in the middle is down. I'm going to re-test 20.1.1 to confirm first that it was working.

marjohn56 commented 4 years ago

This is a strange one. I haven't been able to make it fail in 20.1.1, but 20.1.7 is a bit weird. as @pete1019 says, if you flip it a couple of times then you get the secondary WAN monitor works fine, but there's no route to host from the PC,.. It gets as far as OPNsense but that's it. Really doesn't matter which way you do it either, first changeover and back appears to work every time, after that it doesn't, even if the interface goes down and back up.

marjohn56 commented 4 years ago

@pete1019 can you re-open this please.

pete1019 commented 4 years ago

@pete1019 can you re-open this please.

This one here is still open.

i am not able to open this one because i am not the owner: https://github.com/opnsense/core/issues/3961

marjohn56 commented 4 years ago

The default route is missing..

marjohn56 commented 4 years ago

@pete1019 can you re-open this please. i am not able to open this one because i am not the owner: #3961

Sorry my bad... just saw closed. :)

marjohn56 commented 4 years ago

Yup, add the default route and it's working again. @pete1019 could you check that as well please. Do a netstat -4rW, and see if the default route vanishes.

pete1019 commented 4 years ago

Yup, add the default route and it's working again. @pete1019 could you check that as well please. Do a netstat -4rW, and see if the default route vanishes.

i am not able to. Don't have an appliance here. Could set up a VM again these days. But i remember: just hitting save on any interface did the trick as well.

Maybe somebody else could try? @ischilling @arch1mede

Thanks

marjohn56 commented 4 years ago

Yes save interface would restore the gateway. So that appears to be the issue.

AdSchellevis commented 4 years ago

@marjohn56 the system log probably contains more details if there's a race between dhcpc and the link-up event somehow.

The "default gateway" switching calls inside the filter code: https://github.com/opnsense/core/blob/e2f6272957d8f3e60b107d3eca450929415de4cb/src/etc/inc/system.inc#L416

marjohn56 commented 4 years ago

Don't believe it to be a dhcp issue, as when disconnect the primary interface the secondary interface is still up and running, with an address on the interface. It;s just that the default gateway doesn't get added although the old one is removed.

AdSchellevis commented 4 years ago

usually this should leave some content in the logs, but since dhcpc is responsible for providing the gateway and it doesn't exist with a static address, it sounds quite related to me.

marjohn56 commented 4 years ago

Maybe I misunderstand then, I've never looked at failover before. My assumption was that both interfaces would have a dhcp assigned address from the ISP, in the event of the primary interface going down the secondary interface would already have the information and it would just be a case of setting the default route/filters - or am I misreading this?

AdSchellevis commented 4 years ago

I don't think you are, but at a first glance there are only a couple of things that can go wrong, either the gateways are not known (see the files in /tmp/) or the process responsible for detecting a failure doesn't provide the correct signal (which would be dpinger logically). I haven't looked into this issue, but to me it doesn't look like https://github.com/opnsense/core/issues/3961 killed a feature, more likely it worked by accident (not switching where it was supposed to).

Personally, I would start looking at the events triggered in the (system) log, currently I don't have time to test this locally.

marjohn56 commented 4 years ago

I've checked that the 10-dpinger script is being called, at least I did on 20.7, so dpinger etc is doing its thing. I'm just trying to work my way through the gateways group stuff to work out what SHOULD happen when an interface goes down. Think I'll do a compare as this seems to be a regression but I could be wrong.

marjohn56 commented 4 years ago

checked, dhclient is still running on the secondary interface. More mysterious still is whilst checking through the code I saw this: if (isset($config['system']['gw_switch_default'])) { // When gateway switching is enabled, we might consider a different default gateway. // although this isn't really the right spot for the feature (it's a monitoring/routing decision),

I set that in general config - Allow gateway switching and it's worked every time now. Could that be the missing link?

AdSchellevis commented 4 years ago

euh, yes, if gateway switching isn't enabled it won't try to update the standard gateway. I would expect it to stay stale though (so in this case the question is which event lead to gateway removal).

marjohn56 commented 4 years ago

OK... well I've also made sure that the DNS servers in general are NOT the same as those specced in the gateways, so just leaving the gateways at the default address. If this all tests out, and I'll do some more testing tomorrow then I think we need to change the how to, as it shows the google DNS addresses. So it looks like you and @Franco don't need to worry about this, you can take a look at the IPv6 link-local monitoring issues. ;😁