opnsense / core

OPNsense GUI, API and systems backend
https://opnsense.org/
BSD 2-Clause "Simplified" License
3.36k stars 754 forks source link

Multi-WAN Failover fails on unplugging cable of WAN1 / WAN2 #4160

Closed pete1019 closed 4 years ago

pete1019 commented 4 years ago

Important notices Before you add a new report, we ask you kindly to acknowledge the following:

[X] I have read the contributing guide lines at https://github.com/opnsense/core/blob/master/CONTRIBUTING.md

[X] I have searched the existing issues and I'm convinced that mine is new.

Describe the bug Multi-WAN Failover fails on unplugging cable (very short period) of WAN1 / WAN2. It will do strange things and not go back to Tier1. Hitting save on any interface will fix the issue till next time.

To Reproduce

  1. followed this guide 100%: https://docs.opnsense.org/manual/how-tos/multiwan.html
  2. Creating Failover with WAN1 being Tier1 and WAN2 being Tier2
  3. Unplugging Cable of Tier1 for only 0.1 seconds and plugging it back in
  4. cut internet connection on WAN1 without pulling the cable (link needs to stay up). You will see the trouble happening after that. Do some testing by cutting internet on Tier1 and Tier2 but do not pull the cable, try to cut the internet somehow else. Check the behavior.

Expected behavior Failover should work as set up (Tier 1, Tier 2) even on ports flapping or some hardware rebooting which is connected to WAN1 or WAN2.

OPNsense 20.1.7 APU4 PC-Engines

I am willing to show the problems via anydesk or teamviewer.

THANK YOU!

marjohn56 commented 4 years ago

Tested this some more, and after setting the gateway switching to On it appears to behave itself. Need the op's and others to confirm this.

pete1019 commented 4 years ago

Tested this some more, and after setting the gateway switching to On it appears to behave itself. Need the op's and others to confirm this.

You mean this? When using Unbound for DNS resolution you should also enable Default Gateway Switching via System->Settings->General, as local generated traffic will only use the current default gateway which will not change without this option. From here: https://docs.opnsense.org/manual/how-tos/multiwan.html

Nope, this was always active in my tests on 20.1.7 and i can still reproduce the problem.

Please always check which gateway it really goes by checking something like www.ipcheck.com for example

marjohn56 commented 4 years ago

How long are you waiting for recovery, on mine it takes around 60 seconds.

pete1019 commented 4 years ago

you mean 60 seconds when packetloss is back under 10%?

With STATIC IP on WAN it is back instantly (after packetloss being under the threshold). I was not checking so long (60 seconds) on my tests.

marjohn56 commented 4 years ago

Interesting.. OK, a bit of deeper delving has maybe got me somewhere... It would appear that the call to configctl filter reload in 10-dpinger doesn't actually do anything, changing the line to /usr/local/sbin/configctl filter reload does.

@pete1019 -Try editing the file, you'll find it in /usr/local/etc/rc.syshook.d

mimugmail commented 4 years ago

Ok, I finally found the time to debug.

My test machine has 20.7b (ISO, not only UI), WAN1 is DHCP, gateway has prio 251, marked as upstream, monitoring enabled. WAN2 is static, 192.168.12.X, prio 255, marked as upstream, monitoring enabled. In System : Settings : General, default gateway switching is enabled. I do NOT use gateway groups or similar, just gateway switching. I shut the switchport where WAN1 sits (like unplugging the cable or a defect of modem) and it fails over to static. I reenable the port and it fails back to DHCP gateway. I did this 3 times .. always set the correct gateway.

Cant reproduce ..

pete1019 commented 4 years ago

Ok, I finally found the time to debug.

My test machine has 20.7b (ISO, not only UI), WAN1 is DHCP, gateway has prio 251, marked as upstream, monitoring enabled. WAN2 is static, 192.168.12.X, prio 255, marked as upstream, monitoring enabled. In System : Settings : General, default gateway switching is enabled. I do NOT use gateway groups or similar, just gateway switching. I shut the switchport where WAN1 sits (like unplugging the cable or a defect of modem) and it fails over to static. I reenable the port and it fails back to DHCP gateway. I did this 3 times .. always set the correct gateway.

Cant reproduce ..

Please do exactly as i stated here: https://github.com/opnsense/core/issues/4160#issuecomment-641789848

Looks like you always physically unplug and replug. Please only do this ones with the DHCP-Port. 2nd time please use a dumb switch and cut the connection there so you don't unplug the cable to WAN of opnsense. It is important to not physically detatch the cable again.

Also: i use Gateway-Group. Just like everything was explained in official Multi-WAN tutorial: https://docs.opnsense.org/manual/how-tos/multiwan.html

The reason why this is so important to work: imagine your Modem would reboot for some reason. It will get a link down in opnsense on your WAN. Later, only the Internet will fail (no link down and link up again) because your provider is down. It will switch to WAN2 but it will never (or not as soon as intended) switch back to WAN1.

mimugmail commented 4 years ago

I cant test this from home .. maybe next week when I get back to work ...

pete1019 commented 4 years ago

I cant test this from home .. maybe next week when I get back to work ...

No dumb little switch at home? But again: THANKS everyone for your time!

mimugmail commented 4 years ago

The machine is at work and needs cabling. But I'm happy gateway selection code is fine. I never use gateway groups, but we will see next week

pete1019 commented 4 years ago

The machine is at work and needs cabling. But I'm happy gateway selection code is fine. I never use gateway groups, but we will see next week

If i use physically unplug and replug on WAN with DHCP everything works fine for me as well. Thats why it is important to do it like this: https://github.com/opnsense/core/issues/4160#issuecomment-641789848

Excited on how your tests go next week.

marjohn56 commented 4 years ago

Yes it DOES work fine if you physically unplug, that's because a WAN down/up event is triggered. If however there is an upstream failure and dpinger should do the detection THAT is where the issue is. As I said, the reason it fails is due to what I pointed out in an earlier message, the problem is in 10-dpinger, it doesn;t run 'configctl filter reload', nor does it write anything to the log to say it hasn't. If you give the full path to configctl then it does work.

pete1019 commented 4 years ago

Yes it DOES work fine if you physically unplug, that's because a WAN down/up event is triggered. If however there is an upstream failure and dpinger should do the detection THAT is where the issue is. As I said, the reason it fails is due to what I pointed out in an earlier message, the problem is in 10-dpinger, it doesn;t run 'configctl filter reload', nor does it write anything to the log to say it hasn't. If you give the full path to configctl then it does work.

Thanks, so who is able to fix that and release it? I think pfsense does not have this issue as someone stated that here before.

I think i should set up another test-vm here. But i need to think about how to get a second WAN since i don't have the LTE-device here anymore. Can you please give more instructions what i should exactly do to test your fix? Log into opnsense via ssh... nano into " /usr/local/etc/rc.syshook.d", change what (line)? Will this survive an update? Thanks

marjohn56 commented 4 years ago

You are, until an update is released. You can fix it yourself, I've posted how. I don't really see the relevance of pfsense in the conversation,

Now fixed and will be in the next release or you can patch it yourself.

pete1019 commented 4 years ago

So is this commit fixing the issue? Anyone can confim? Thank you.

fichtner commented 4 years ago

Was fixed in 20.1.8 most likely. :)