MultiWAN / Gateway group connectivity issues since OPNsense upgrade

Rapterron commented 3 years ago

Dear,

As suggested by your bot in ticket #5089 I am also affected by this issue and want to raise this issue to a higher priority by trying to follow your templates and providing a detailed bug report. Please also take the feedback from ticket #5089 initially opened by "Malunke"

[X] I have read the contributing guide lines at https://github.com/opnsense/core/blob/master/CONTRIBUTING.md
[X] I am convinced that my issue is new after having checked both open and closed issues at https://github.com/opnsense/core/issues?q=is%3Aissue
Note: as suggested by OPNsense-bot opened this as new issue to get more traction as loosing the MultiWAN and load balancing capability of OPNsense is a major issue.

Describe the bug

Load balanced multi WAN routing using 2 Internet Gateways stopped working after a few minutes up to several hours. After rebooting the firewall works again for some time.

Detailed report

My setup:

The OPNsense is installed on a hardware platform using a Celeron CPU and Intel nic's. For troubleshooting I also Installed the OPNsense on a 2nd spare device but with the same issue (read more below).

I have 2 Internet gateways from 2 different ISPs both terminating each on an AVM Fritzbox (a widely used xDSL modem and router). The OPNsense is behind those both AVM VDSL router via 2 separate VLAN's. Between the OPNsense and each of the gateway router is a small transfer network with static assigned IPs. The transmission OPNsense -> Gatwayrouter is IP routing no NAT on this lag. The NAT will happen later on the gateway VDSL router towards Internet.

In the configuration I have both Gateways in a same Tier group and route the traffic via a floating firewall rule. To ensure the session consistence I enabled in the advances firewall settings under MultiWAN the function to lock sessions with a custom time of 300 seconds.

This Setup was working fine over the last few years (set up mid 2019) and after the update to the Version OPNsense 21 I experience sporadic Internet outages from all internal networks. I can ICMP ping endpoints behind the gateways while the bug were triggered but everything else on every client does not work. It seems like OPNsense somehow messed completely up with the sessions and gateway allocation. After a restart of the whole firewall it works for a few minutes - hours (no pattern recognized yet).

Steps taken for troubleshooting.

Reviewed and cleaned up the configuration and firewall rules. Checked both gateway router and made sure the internet connection is up on each. Using debuging tools on the opnsense while the bug were triggered like ping gateways run health audit. Fresh reinstalled the firewall on the same hardware while importing my configuration. Reinstalled the firewall on a spare hardware device while importing the configuration.

Since it's an productive firewall I need to be carefully and document every step but as I could not find a solution with multiwan enabled I will do next a full downgrade on the spare firewall. Once I can confirm the last working version I will also update you here.

Current version: OPNsense 21.1.8_1-amd64 Last known Version: OPNsense 20.xx (subversion not clear as I made several updates at once)

To Reproduce Having 2 Internet gateways in a same tier group and defined in a floating rule.

Expected behavior Having the Load balancing working as before the upgrade but in the current version

Describe alternatives you considered Disable the load balancing by changing the rule to use only one gateway. -> works without the feature Downgrade OPNsense (exact last working version needs to be checked but it's difficult as it's a sporadic issue and I don't want to take my whole network offline several times)

Environment

Main firewall on a server grade celeron Intel CPU with Intel NICs on a hardware made for firewalls (I do not have the exact type in mind but does not matter as this happens also on the spare firewall)

Spare firewall (currently in use) with the same issue: Intel(R) Atom(TM) CPU N450 @ 1.66GHz (2 cores) OPNsense 21.1.8_1-amd64 FreeBSD 12.1-RELEASE-p19-HBSD OpenSSL 1.1.1k 25 Mar 2021

I hope this information are helpful for you and please let me know if you need any further details

Rapterron commented 2 years ago

Hello @AdSchellevis,

Please apologize for not responding the last 2 weeks.

Yes I know the policy based routing can be very tricky and to sort out all other possible network setup errors I set up a clean environment with only one rule. This is a very simple setup LAN and 2x WAN in a separate VLAN with no connection to the prod. network and with 3 Test devices (2x android phones 1x windows client).

After the Kernel update it was still working and my next plan is to upgrade step by step all other modules as you mentioned until the issue is triggered but I need some more time for this as update + testing takes a few hours each step.

As a quick sidenote: I have also a virtual Installation of the latest opnsense which I upgraded lately but the issue is still present.

Maybe also helpful: I have also a pfsense test installation based on the very latest version and in a similar setup the multi WAN load balancing is working. I don't know how many modules OPNsense and pfsense have in common but maybe this information is helpful for you do narrow down the root cause.

Anyway I will let you know once I updated step by step. Cheers Christian.

Rapterron commented 2 years ago

Hello all.

during the last week I was actively patching version by version my test instance through the versions of the 20.x release. After each step I was testing for at least 1 - 2 days.

Out of accident I made a discovery which might help resolving this issue.

While patching from 20.1.9_1.... towards 20.7.5 (a version which had the issue before) it was still working I was wondering what happen and why it's working now. Later I remembered my other test and my configuration was still there (guess I forgot to save).

Luckily this mistake may lead to the root cause and the solution.

I disabled the function "Use sticky connections" from the Advanced Firewall UI grafik

and it's seems this resolved the connectivity issues.

Right now I am on 20.7.5 and it still works. Also my other test container with the latest 21.7.3_3 works now.

To double check this: when I set the sticky connections on the bug is triggered and between a few minutes till couple of hours the connectivity break and I need to reboot the firewall.

I will continue my test with this setting and try to identify possible downsides but for now this seems to be the best point to start checking the code.

Cheers Christian

AdSchellevis commented 2 years ago

@Rapterron if you're experiencing issues with sticky connections, I would inspect the size of the source tracking table at the time you have issues. From the command line pfctl -vvsinfo.

Rapterron commented 2 years ago

Hello @AdSchellevis and thank you for your ongoing support!

The issue was triggered this evening but I let him in this state for some time. 2 of 3 devices reporting no internet and unable to do anything online.

Below the output: pfctl -vvsinfo.txt

Note: this was in my 3rd test container running a OPNsense 21.7.3_3 If needed I can switch to another version.

Cheers Christian

AdSchellevis commented 2 years ago

@Rapterron The number of current source entries is way too limited to be problematic.

Source Tracking Table
  current entries                        5
  searches                            3581            0.1/s
  inserts                               57            0.0/s
  removals                              52            0.0/s

When a couple of devices report loss of connectivity, next thing I would do is check if there are still states assigned to these machines (Firewall->Daignostics->States), if so, kill them and try again.

FlavioDF commented 2 years ago

I've also the same problem.

OPNsense 21.7.3_3-amd64 FreeBSD 12.1-RELEASE-p20-HBSD OpenSSL 1.1.1l 24 Aug 2021

Is there any fix prevision ?

Thanks in advanced

Rapterron commented 2 years ago

Hello all, Hello @AdSchellevis

sorry for the delayed answer.

Apparently the issue almost instantly kicks in when I activate sticky connections. Under Firewall->Daignostics->States I can see the states and indeed when I clear it the connectivity works again. I will keep an eye on the stats and check if a reset resolves the issue and for how long.

As a further note and for everyone else reading, I have found a the Workaround for my setup by disabling the sticky connections. I did not experienced any side effects yet and if so I would either tune the gateway priority or routing via explicit firewall rules.

Cheers Christian

tsouza85 commented 2 years ago

OPNsense 21.7.5 (amd64/OpenSSL) FreeBSD 12.1-RELEASE-p21-HBSD

I have the same problem... I fix it by turning off sticky... it's a bug and no developer admits it...

AdSchellevis commented 2 years ago

sorry for the delayed answer.

Apparently the issue almost instantly kicks in when I activate sticky connections. Under Firewall->Daignostics->States I can see the states and indeed when I clear it the connectivity works again. I will keep an eye on the stats and check if a reset resolves the issue and for how long.

@Rapterron no problem, it has been busy here too. The problem so-far is that we still don't know in which version there was a change in behaviour, which makes it nearly impossible to make a lot of sense out of the reports to be honest. Currently it doesn't look like there's anything different between versions tested.

Under Firewall->Daignostics->States I can see the states and indeed when I clear it the connectivity works again. I will keep an eye on the stats and check if a reset resolves the issue and for how long.

If you kill sessions selectively for the machine that doesn't seem to have internet anymore, does it also solve the issue? If so, what is the "state" of the related states? (output of pfctl -s Source -vv might be useful too) and is there any change in connectivity on upstream connections, when these are pinned to an upstream gateway which isn't responding anymore, we may expect similar behaviour.

@tsouza85

I have the same problem... I fix it by turning off sticky... it's a bug and no developer admits it...

Very helpful, what did you do to help track your issue and deliver a reproducible test case?

Rapterron commented 2 years ago

Hello all,

First of all and for everyone who run into this problem:

Workaround is to disable sticky connections grafik

This will have the effect / benefit that multi session TCP connections will use all gateways in the group (will sometimes result in more bandwidth) but can also lead to unexpected side effects. In case of issues I recommend to create several gateway groups with multiple tiers (failover setup) and route your clients / networks via static rules through the groups. Like: Servers via GW A as default and failover GW B Clients via GW B as defualt and failover GW A ...

I upgraded my production network to the latest version "OPNsense 21.7.5-amd64" and can confirm that this setting works for me and I think I will keep this setting as this solves the issue and I also got the benefit of combining all my gateways for more bandwidth.

For example a test via speedtest.net uses all my 3 Gateways (2x VDSL + SpaceX STARLINK) grafik

So thank you all for working on this solution until this point.

Back to the issue

However even If I am "fine now" I am really certain that the developers are happy for every constructive input even if they are not answering in time.

I still have and will keep my test network up and running for further troubleshooting.

Unfortunately the tracking and testing takes some time but It's seems like after clearing the sessions it is working for much longer now (for 10 days now!) and I have to reboot the firewall to trigger the issue again (might also depends on the low traffic in the test environment).

From what I have in my notes the last working version was the first version of the 20x release (OPNsense 20.1-amd64) then we upgraded the kernel only to 20.1.7 (to sort out any kernel related issues) and it still works but soon after I update the remaining packets the issue will sneak in and will keep his place until the very latest version.

Thanks to snapshots I can quickly jump in my test environment back and forth between the versions.

@AdSchellevis I am now back at the first known version which has the issue and will follow your advise and check the sessions / clear only for one device.

And yes I can say as soon as I pin the network via rule to one single gateway the connection works but I haven't tested pinning a particular device on one gateway by rule. I will check this as well and let you know.

Cheers Christian

AdSchellevis commented 2 years ago

Hi Christian,

From what I have in my notes the last working version was the first version of the 20x release (OPNsense 20.1-amd64) then we upgraded the kernel only to 20.1.7 (to sort out any kernel related issues) and it still works but soon after I update the remaining packets the issue will sneak in and will keep his place until the very latest version.

Let's go into this a bit deeper, the amount of text in the issue (including the me-too comments) is making it hard to keep focus on the relevant datapoints. So 20.1 works, 20.1 with the kernel of 20.1.7 works, but it stops working after upgrading the rest of the packages to 20.1.7 as well, right? (core would be the only relevant one here by the way)

If we can answer that with a yes, then it's getting a bit weird, earlier you checked the contents of /tmp/rules.debug if I'm not mistaken, if these are the same between both versions, it doesn't make sense there's a difference at all. When it comes to source (policy based) routing, pf(4) should be the only relevant component here (kernel+ruleset).

And yes I can say as soon as I pin the network via rule to one single gateway the connection works but I haven't tested pinning a particular device on one gateway by rule. I will check this as well and let you know.

Not exactly what I meant, the question here is if anything changed on any of the upstream gateways leading to the session getting stale in some way. Gateway events might be relevant for example. No need to test single gateway rules. As soon as it's stale, the next question is, what's the "state" of the states for that source and what does source tracking report (hence the pfctl -s Source -vv).

Best regards,

Ad

hydrosIII commented 2 years ago

Same issue here. I am getting conectivity issues. It is also hapenning in Pfsense so It is something related to FreeBSD. It was working before. I have ping of mote than 1500 to 8.8.8.8 when enabling Multiwan either in failover mode or in loadbalancing mode.

Disabling sticky connections does not seem to work.

Updating to the last kernel : kernel-21.7.7 seemed to improve the issue for now. But I will have to test it for a few days to say if it is solved.

OPNsense-bot commented 2 years ago

This issue has been automatically timed-out (after 180 days of inactivity).

For more information about the policies for this repository, please read https://github.com/opnsense/core/blob/master/CONTRIBUTING.md for further details.

If someone wants to step up and work on this issue, just let us know, so we can reopen the issue and assign an owner to it.

Malunke commented 2 years ago

reopen because there is no solution yet

fichtner commented 2 years ago

There's no progress for three months. Let's please start to be realistic about this issue and its lack of actionable information. As per contribution guidelines updates can be made and if clarity arises the ticket will be reopened.

Cheers, Franco

Malunke commented 2 years ago

Ich möchte nicht schon wieder eine Grundsatzdiskussion über die Art und Weise der Behandlung von "seltenen" Problemen starten, dass ist nicht zielführend.

Es haben aber mehrere Personen nachvollziehbar geschildert, dass das Problem bei Ihnen auftritt und nachvollzogen werden kann. Ich kann leider nicht beliebig zwischen Versionen hin- und herspringen, sonst könnte ich auch etwas mehr dazu beitragen. Ich kann aber sagen, dass Problem besteht weiterhin.

Eine Fehlkonfiguration möchte ich nahezu ausschließen, habe aber schon immer eine Teamviewersession zur kurzen Beurteilung angeboten.

Warum so wenig Informationen einfließen mag an der Resignation der Beteiligten liegen. Ich habe bei mir schließlich auch die Krücke benutzt, und die Gateways in zwei verschiedene Tiers gesteckt. Damit habe ich zwar noch eine Ausfallsicherheit aber nicht die angestrebte Bandbreitenteilung, da im Regelbetrieb nun ausschließlich Gateway 1 genutzt wird.

Da aber keine Lösung ersichtlich ist, bleibt für mich nur das Umgehen des Problems und ggf. Suche nach einer anderen Firewalldistribution.

Mit freundlichen Grüßen

I don't want to start another fundamental discussion about the way "rare" problems are treated, as that would not be helpful.

However, several people have described in a comprehensible way that the problem occurs with them and can be reproduced. Unfortunately, I can't jump back and forth between versions at will, otherwise I could contribute a bit more. But I can say that the problem still exists.

I would like to almost rule out a misconfiguration, but have always offered a team viewer session for a brief assessment.

Why so little information flows in may be due to the resignation of those involved. I finally also used the crutch with me, and put the gateways in two different tiers. This still gives me fail-safety, but not the desired bandwidth sharing, since only gateway 1 is now used in regular operation.

But since no solution is apparent, the only thing left for me is to work around the problem and possibly search for another firewall distribution.

Thanks a lot.

fichtner commented 2 years ago

I don't want to start another fundamental discussion about the way "rare" problems are treated, as that would not be helpful.

I'd really appreciate it. :) I did work on shared forwarding for a couple of weeks now for FreeBSD 13 and this constant nagging here for free support isn't really useful.

Cheers, Franco

Malunke commented 2 years ago

Had you tried your shared forwarding project with FreeBSD 13 with the same config we're using here (some graphics show the configuration) especially with VLANs (and on my end qith vmxnet interface)? I can also send my xml-file if you are interested. Also a Teamviewer session is possible if somebody from developers team is interested.

I can only say - the problem still exists in the actual version. And even when it is a rare condition (or perhaps also no rare condition, nobody knows) everybody should be interested to investigate this issue and solve this condition either in the web-frontend to not allow such a condition or in the underlying system so that the underlying bug will be resolved.

(By the way - also my time isn't free of charge so constant nagging or ignoring bug reports from different voluntary bug reporters also isn't really useful.)

fichtner commented 2 years ago

I'm not here to offer free support. 3 beta versions have been posted and on Wednesday we will have the first release candidate. If you have energy to discuss how community rules don't apply in this particular case you also have energy to test the FreeBSD 13 code.

Cheers, Franco

Malunke commented 2 years ago

no comment - from here I'm out

Rapterron commented 2 years ago

Hi guys,

Sorry I should have answered earlier but you know life happens and I totally lost track about this.

I started at a new my job and have no access to my old testlab anymore so unfortunately I would need to rebuild anything from scratch.

Since I am working now for a German ISP in the hosting and network area I got a deeper understanding of professional firewall systems and bandwidth. I think I have mentioned this already but for most of the routers and firewall you use failover instate of active load-balancing since this kind of backbone firewalls have mostly bandwidths beyond 10Gbit and more which makes load balancing like with small bandwith VDSL lines with an enterprise firewall like the opnsense an exotic use case.

Unfortunately and I think I talk for all German IT guys the Internet Infrastructure here in Germany is a joke and way behind the standard. Most citys don't have fibre lines and use old copper wire connections with mostly 10mbit -> max 100mbit and sometimes also 500mbit via cable so as a German we need to play with uncommon solutions like bounding multiple WAN lines together which is indeed in the industrial world uncommon. For example our STARLINK satellite internet has more bandwidth than the fastest internet line you could rent...

I don't blame the dev's or anyone else it's just a matter of understanding for a rarely used case and I tried my best to provide as much information as possible but some problems are hard to track especially when you are the only one with a testlab (well who "had" a testlab).

I will continue following the opnsense project and may rebuild my lab but that's something in the far future as I am totally loaded with my new job.

Cheers Christian

denschub commented 2 years ago

I've now run into the same issue on my OPNsense 22.4.2-amd64, which is a Business Edition setup on a DEC750. I spent a bit of time debugging this and now hit a road block where I don't know how to progress further. Here is some additional, hopefully useful, information.

This issue appears to only affect new connections. Existing connections are not interrupted. New connections can "get stuck" (i.e. have 100% packet loss), but usually, killing that connection and retrying after a few seconds makes it work.
It indeed appears to be related to Sticky Connections. I'm able to reproduce with Sticky Connections enabled at least once per hour or so, but never without it.
Outbound NAT seems irrelevant, I've removed all rules and it still reproduces.
When connections are "broken", I see two states in Firewall > Diagnostics > States: The "in" policy-based WAN rule, and the "out" autogenerated "let out anything from firewall host itself" rule. A broken state doesn't look different from a working state.
Dropping the aforementioned two states when they're broken makes the connection become alive immediately. For example, in a broken continuous ping, which will never un-stuck itself, dropping the two state entries will make ping work.
Running a Package Capture on all interfaces with a "broken ping" running, I can see the ping requests arriving on the Firewall via the LAN interface, but I never see it leave on any WAN interfaces, and I also do not see responses. It looks like the packages just don't get forwarded to any WAN interface.
Also while having a "broken ping" running, I do not see any connection attempt in the Firewall logs. Not even with logging for the policy-based WAN rule enabled. Just nothing.

It seems like the traffic is dropped somewhere between the firewalls LAN interface and the firewall rules. Unfortunately, I have no clue how to debug that. And while I'd be happy to pay for a support subscription, I doubt this is something that can be resolved in the 2 hours included, and extra hours might be a bit too expensive for my home network. :) Maybe someone finds this information useful and can throw me a pointer how to debug this further.

AdSchellevis commented 2 years ago

@denschub it might be better to open a new ticket with the relevant information, we haven't been able to reproduce an issue so far unfortunately. There are somethings that you can try, the first thing likely being disabling shared forwarding and check how that changes behaviour , we can also offer a beta kernel for 22.7 (FreeBSD 13.1) for testing.

opnsense / core

MultiWAN / Gateway group connectivity issues since OPNsense upgrade #5094

First of all and for everyone who run into this problem:

Back to the issue