opnsense / core

OPNsense GUI, API and systems backend
https://opnsense.org/
BSD 2-Clause "Simplified" License
3.27k stars 727 forks source link

MultiWAN / Gateway group connectivity issues since OPNsense upgrade #5094

Closed Rapterron closed 2 years ago

Rapterron commented 3 years ago

Dear,

As suggested by your bot in ticket #5089 I am also affected by this issue and want to raise this issue to a higher priority by trying to follow your templates and providing a detailed bug report. Please also take the feedback from ticket #5089 initially opened by "Malunke"

Describe the bug

Load balanced multi WAN routing using 2 Internet Gateways stopped working after a few minutes up to several hours. After rebooting the firewall works again for some time.

Detailed report

My setup:

The OPNsense is installed on a hardware platform using a Celeron CPU and Intel nic's.  For troubleshooting I also Installed the OPNsense on a 2nd spare device but with the same issue (read more below).

I have 2 Internet gateways from 2 different ISPs both terminating each on an AVM Fritzbox (a widely used xDSL modem and router). The OPNsense is behind those both AVM VDSL router via 2 separate VLAN's. Between the OPNsense and each of the gateway router is a small transfer network with static assigned IPs. The transmission OPNsense -> Gatwayrouter is IP routing no NAT on this lag. The NAT will happen later on the gateway VDSL router towards Internet.

In the configuration I have both Gateways in a same Tier group and route the traffic via a floating firewall rule. To ensure the session consistence I enabled in the advances firewall settings under MultiWAN the function to lock sessions with a custom time of 300 seconds.

This Setup was working fine over the last few years (set up mid 2019) and after the update to the Version OPNsense 21 I experience sporadic Internet outages from all internal networks. I can ICMP ping endpoints behind the gateways while the bug were triggered but everything else on every client does not work. It seems like OPNsense somehow messed completely up with the sessions and gateway allocation. After a restart of the whole firewall it works for a few minutes - hours (no pattern recognized yet).

Steps taken for troubleshooting.

Reviewed and cleaned up the configuration and firewall rules. Checked both gateway router and made sure the internet connection is up on each. Using debuging tools on the opnsense while the bug were triggered like ping gateways run health audit. Fresh reinstalled the firewall on the same hardware while importing my configuration. Reinstalled the firewall on a spare hardware device while importing the configuration.

Since it's an productive firewall I need to be carefully and document every step but as I could not find a solution with multiwan enabled I will do next a full downgrade on the spare firewall. Once I can confirm the last working version I will also update you here.

Current version: OPNsense 21.1.8_1-amd64 Last known Version: OPNsense 20.xx (subversion not clear as I made several updates at once)

To Reproduce Having 2 Internet gateways in a same tier group and defined in a floating rule.

Expected behavior Having the Load balancing working as before the upgrade but in the current version

Describe alternatives you considered Disable the load balancing by changing the rule to use only one gateway. -> works without the feature Downgrade OPNsense (exact last working version needs to be checked but it's difficult as it's a sporadic issue and I don't want to take my whole network offline several times)

Environment

Main firewall on a server grade celeron Intel CPU with Intel NICs on a hardware made for firewalls (I do not have the exact type in mind but does not matter as this happens also on the spare firewall)

Spare firewall (currently in use) with the same issue: Intel(R) Atom(TM) CPU N450 @ 1.66GHz (2 cores) OPNsense 21.1.8_1-amd64 FreeBSD 12.1-RELEASE-p19-HBSD OpenSSL 1.1.1k 25 Mar 2021

I hope this information are helpful for you and please let me know if you need any further details

Rapterron commented 3 years ago

Hello, please find below some updates.

I went back all the versions and ended up the last working version is:

OPNsense 19.7.10_1-amd64 FreeBSD 11.2-RELEASE-p16-HBSD OpenSSL 1.0.2u 20 Dec 2019

As soon as I update to the next major version the bug is triggered.

I did not noticed this issue earlier as I this is an productive firewall and needed a maintenance window to update so I did multiple updates at once.

Configuration in detail

In this working configuration under advanced Multi-WAN I have:

If you need more details about my configuration, please let me know.

This configuration has worked for several years since I use OPNsense (was the main reason to switch to OPNsense). For now I will let my production firewall on this software but for testing I can quickly set up a new one since I used this opportunity to visualize the firewall (on Hyper-V Server Core).

Thank you and have a great Weekend.

AdSchellevis commented 3 years ago

You probably best compare the contents of /tmp/rules.debug between both versions as a starting point, checking the counters (inspect button) might also help identify issues with the ruleset (another rule matching first for example).

If your issue is related to the ruleset, comparing differences between those files might be the fastest way to narrow it down.

From a kernel/driver perspective 20.7 and 21.1 are roughly the same, so if your problem doesn't exist on 20.7.x, it's not likely a kernel or driver issue. That might be worth the effort to test as well.

sschueller commented 3 years ago

I have seen this issue (I think it's the same) on the pfsense release 2.5.1 as well. I have 2 periodic speedtest running, one on the active and one of the backup/failover WAN. After upgrade I can only run the speedtest on the active gateway. It may be related to this https://forum.netgate.com/topic/163070/pfsense-2-5-1-multi-wan-routing-trouble/4 and https://reviews.freebsd.org/R10:41063b40168b69b38e92d8da3af3b45e58fd98ca ?

I do the following: Install speedtest app

pkg add "https://install.speedtest.net/app/cli/ookla-speedtest-1.0.0-freebsd.pkg"

igb0 = Primary (active) igb2 = Failover/Backup WAN

Run on Interface 0 (Works)

/usr/local/bin/speedtest -I igb0

Run in Interface 2 (Fails)

/usr/local/bin/speedtest -I igb2

Now if I set igb2 as active by marking igb0 as down the speed test will work on igb2 but no longer on igb0.

AdSchellevis commented 3 years ago

@sschueller usually this behaviour is caused by missing reply-to rules on outgoing traffic, assuming a ping from any of the source addresses in your case doesn't work as well (ping -S <ip of igb0|igb2> 8.8.8.8) which is what I just tested on one of our machines (and works without issues).

sschueller commented 3 years ago

@sschueller usually this behaviour is caused by missing reply-to rules on outgoing traffic, assuming a ping from any of the source addresses in your case doesn't work as well (ping -S <ip of igb0|igb2> 8.8.8.8) which is what I just tested on one of our machines (and works without issues).

Both of those work for me without issue.

root@XX:~ # ping -S xx.xx.xx.xx 8.8.8.8
PING 8.8.8.8 (8.8.8.8) from xx.xx.xx.xx: 56 data bytes
64 bytes from 8.8.8.8: icmp_seq=0 ttl=118 time=13.290 ms
64 bytes from 8.8.8.8: icmp_seq=1 ttl=118 time=16.490 ms
^C
--- 8.8.8.8 ping statistics ---
2 packets transmitted, 2 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 13.290/14.890/16.490/1.600 ms

root@XX:~ # ping -S yy.yy.yy.yy 8.8.8.8
PING 8.8.8.8 (8.8.8.8) from yy.yy.yy.yy: 56 data bytes
64 bytes from 8.8.8.8: icmp_seq=0 ttl=116 time=36.244 ms
64 bytes from 8.8.8.8: icmp_seq=1 ttl=116 time=33.177 ms
^C
--- 8.8.8.8 ping statistics ---
2 packets transmitted, 2 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 33.177/34.711/36.244/1.534 ms
Rapterron commented 3 years ago

Hello,

@AdSchellevis Good indicator this may explain why the last working version is 19.7.10 while 20.x and 21.x having this problem.

Right now I am with the same configuration back on "OPNsense 19.7.10_1-amd64" which works fine and I see an almost perfect balance between both internet gateways on my dashboard.

A firewall rule going wild was also my first guess but as I do not have that much rules it's quite easy to review them and there are no obvious contradicting entry's.

For test purpose I could also disable all other rules and / or place the affected gateway rule as first rule to sort out any rule related issue. I also checked the counters under inspect and found nothing unexpected.

I also considered to completely re configure the firewall with a minimum rule set to test the connectivity but then I read about other users with similar issues.

I will update again to the latest version and try the mentioned steps and also check the diff between both /tmp/rules.debug files.

Will update the results here.

fichtner commented 3 years ago

Did you check "Disable State Killing on Gateway Failure" option under Firewall: Settings: Advanced yet?

Cheers, Franco

sschueller commented 3 years ago

Did you check "Disable State Killing on Gateway Failure" option under Firewall: Settings: Advanced yet?

Cheers, Franco

This has no effect on my setup. I still can only do a speedtest on the active gateway.

AdSchellevis commented 3 years ago

@sschueller maybe speedtest doesn't bind to the address of the interface. Just keep in mind that this scenario (keep traffic on originating interface) has no relation to the use of gateway groups (the topic of this issue).

Rapterron commented 3 years ago

Good evening.

I have a few more updates and narrowed down the issue . Maybe one thing to mention: I do not have any plugins (besides the dark theme) installed.

I upgraded the running firewall (exact the same config) from 19 again to 21 and immediately triggered the issue.

For debug purpose I deactivated all my rules and placed the gateway routing rule as first rule (would match all traffic and send to a gateway). Also here immediately triggered the issue and all sessions went crazy on multiple clients in the network.

Firefox for example returned the error message (translated accordingly from German) : unable to proceed the request. Protocol violation. And after a few ping tests even my workstation OS (Windows 10 pro) ran into a blue screen. Also my Android mobile phone was not proper working until I disabled WiFi.

This absolutely points me to an asynchronous routing where the packets from one session gets round robin spread out via 2 independent gateways (2 different public IPs) and violate the protocol and lead to unexpected broken sockets.

I have the suspicion the (mandatory) Multi WAN feature "Firewall->Settings-> advanced-> Sticky connections" seems to not working anymore.

I also compared both versions "rules.debug" (roughly compared the differentials) file but they are looking almost the same.

To be very sure the issue is on the OPNsense I also downgraded and replaced both gateway routers (because I also upgraded them recently).

Next step I could capture some traffic at the gateways to see what comes out of the OPNsense but I am very sure the not working sticky connection and asynchronous routing is the root cause.

I also could install a fresh OPNsense with the absolute minimum configuration for Multi WAN in a test network segment but this would take a few days to set up.

@AdSchellevis is there anything else I can do? Anything for you to narrow down and fix the issue? Should I switch to the development update channel?

Thank you very much for your support and please do not hesitate to let me know if you need any further information. Cheers Christian

AdSchellevis commented 3 years ago

Hi Christian,

I'm not expecting general issues to be honest, upstream (https://bugs.freebsd/org) doesn't seem to indicate an issue with the sticky-address keyword (and I'm quite sure a lot of people do use this feature). Best try to upgrade to 20.7 first, test what that does so kernels are roughly aligned and work your way up to the ruleset.

Problems with unstable connections often also relate to mtu by the way, as a wild guess you could also try to lower the mtu values to something like 1300.

Best regards,

Ad

Rapterron commented 3 years ago

@AdSchellevis yes that is strange I also did not hear about a general issue in freebsd but read here and then (and also in the forum linked in #5089) about this issue. As soon I upgrade to anything greater than 19.7.10_1 the issue is triggered.

I could imagine that this issue is not affecting that much users because they might use fail over or have such a huge upstream gateway that they are not interested in load balancing. However in my setup here I have 2x 100 Mbit (with 40Mbit upstream) VDSL connections for a few servers my test network and around 15 Users so I want to utilize the bandwidth of both connections.

For sure I could stay forever on version 19 which works but I guess this is not the best solution for either other users and even the developer team. I understand that this might not be a high prio issue for the team but at least to have it on the scope. I use the OPNsense I am happy with the OPNsense so I am also happy to contribute making this firewall better.

Back to the Issue: Yes I played around with the MultiWAN and gateway associated settings such as Disable State Killing on Gateway Failure / Sticky connections / time limit for the stickiness but it does not matter what I set.

I will do now the following:

One last word to #5089 and @Malunke I created this ticket as recommend by OPNsense-bot by using the templates and gain some movement as I haven't seen any comments in the ticket since a few days I was not sure if they would be any movement at all. I sincerely apologize for any misunderstanding.

Will let you know on my results Cheers Christian

AdSchellevis commented 3 years ago

Hi Christian,

If 20.1 is the first version which doesn't work correctly for you, it might be easier to track the changes (and focus on 20.1 first). The kernel between 19.7.7 and 20.1 is basically the same one if I'm not mistaken.

Best regards,

Ad

Rapterron commented 3 years ago

Hello all,

I had some time and setup a fresh OPNsense in my test network parallel to my productive network but with the same gateway routers.

Quick summary: The Issue also appears in minimal configuration.

Please let me try to report as detailed as possible providing all the important configuration.

Starting with a network diagram:

Untitled Diagram

Fresh installation OPNsense 21.1.8_1-amd64 with the minimum configuration / rules. Running currently on a Hyper-V Server (parallel to the productive instance): system_information

Here you see the interfaces: Please don't mind the 3rd WAN interface (called WAN3) this is not used in the rule and group (see below).

interfaces

Here are the gateways with configured gateway monitors to determinate the status:

gateways_config

As example the gateway configuration for one gateway:

edit_gateway

The gateway group with both gateways in a tier 1 group for load balancing:

gateway_group

gateway_group_config

Used in a floating firewall rule:

Notice: the deactivated rule would route the traffic via one WAN interface without load balancing. This is one of the workarounds.

floating rule

The content of the floating rule:

floating rule config

And to finish this set here are the additional firewall configurations like sticky connections:

firewall_advanced_settings

This is basically a very small setup using mostly default settings and is also referred in some online documentations.

As this is a very generic configuration without any sensitive data please find below the configuration:

config-OPNsense.localdomain-20210729172111.zip

Hope this information are helpful for you and please let me know if you need any further informations.

I will leave my working productive OPNsense on 19.7.10_1 as this point aslo to a more generic issue. For any further testing I can easily start my test router.

Cheers Christian

AdSchellevis commented 3 years ago

Hi Christian,

I really would advice to check between 19.7.x and 20.1 as suggested in my previous comment so we can compare differences, moving to (almost) the latest version is quite a gap in time.

The old installer for 20.1 is still available on our webserver (https://pkg.opnsense.org/releases/), if you're looking for a 19.7 one, I can look around if I can find one for you.

Best regards,

Ad

Rapterron commented 3 years ago

Hello Ad,

Thank you very much for your quick response.

Yes I had this in mind but then I thought it would be better to go tabula-rasa and build a config from scratch using an actual version. I have the firewall for several years now and want to exclude an issue based on my configuration and rules which grew over time.

Thank you I have all the ISO images still here on my file server and already prepared 3 VMs:

I am now on 20.1.9_1 and testing. Same config as in my previous above.

grafik

From your point of view is there anything in the example config which might be not correct or unwanted?

Thank you. Cheers Christian

AdSchellevis commented 3 years ago

Hi Christian,

At a first glance this looks rather normal, except that if you're also running DNS on the box you're likely forwarding local traffic to the next hop now (which would also disrupt your internet traffic).

The reason why I'm asking to test both older versions is that we can match code differences, as I mentioned before, at my end multiple gateways using either sticky or non-sticky connections over IPv4 work like a charm.... on the latest version.

Over the years I've seen quite some configuration issues from customer at our support desk, but that's really outside of community support scope.

Best regards,

Ad

fichtner commented 3 years ago

I am now on 20.1.9_1 and testing.

It would be beneficial to test specifically 20.1 initial release unless the error is not present there. 20.1 -> 20.1.9_1 covers a lot of ground still.

Rapterron commented 3 years ago

Hello Ad,

thanks again for the fast response.

Yes I am absolutely aware that this is "communuity" support and I am more than happy that you are taking a look on this case. You should not debug my productive configuration especially also other user having this therefore I reduced it to the really minimum to narrow it down on the low level.

Yes this is a separated test network and all traffic should be routed straight to one of the both gateways. For DNS I let the clients ask directly the google DNS server (in productive I have my own server).

Okay while I was testing it happens also on version 20.1.9_1 I have 2 windows10 clients and 2 android mobiles. On all systems a similar reaction.

On the windows you see that nothing is loading anymore but nslookup and traceroute / ping works but nothing else

tempsnip

On the Android mobile nothing will load and WIFI is reporting: connected without internet.

On the routers dashboard everything is "green" and after a reboot it's working again. Next time this happens I will check if also some other actions like modifying the group or turning a gateway off and on will temporary help.

@fichtner Hm okay that's a point. I will reinstall to the 20 initial release and test again.

Cheers Christian

Malunke commented 3 years ago

Is there any news here yet? Unfortunately, I can't actively contribute to the test, but I would be happy if the developer community would actually acknowledge this bug as a bug and not blame it on some exotic configurations. I have already shown in some posts that it occurs on different hardware and Rapterron has shown in an exemplary way that only an update of a working OpnSense machine triggers the problem. At the latest here the honor of the developer community should be awakened - with other projects this would have been enough long ago.

I would be happy if there would finally be news.

AdSchellevis commented 3 years ago

@Malunke not sure why you seem to have the urge to hijack this thread, but let me reference my last comment, maybe you forgot to read it https://github.com/opnsense/core/issues/5089#issuecomment-885870493

Malunke commented 3 years ago

(Before Reading - normally the following has nothing to do with a bug report. I still want to push this topic and want a solution.)

Hello, I do not hijack the thread and I have not forgotten the last comment. My problem is that simply nothing happens. In my opinion, everything was already said on the topic in the community forum in various threads - but nothing happened. After that I opened a bur report - nothing happened. After that Rapterron opened this bugreport and did excellent work until today - but still nothing happened.

I'm sorry, but I just can't understand this. I also can't understand why the developers don't react already at the community forum, although it is almost certainly a bug. The steps are quite simple: 1) Install firewall plain vanilla 2) Set up 2 WAN gateways 3) put both gateways into one gateway group (same tier) 4) set up firewall rule with gateway for this group 5) admire the error on your own body

Nothing to find here with extraordinary configurations etc. To top it all off Rapterron did a great job and found out at which version jump the error occurs - all this should prove that it is a bug in the firewall and not a RTFM error! But nothing happened until today - and if the developers have no infrastructure to test, I am speechless myself. The bug analysis and code comparison can't be demanded from the community either, although with almost every problem it is pointed out that you didn't buy a paid support package from OpnSense!

I am still politely waiting for a bug fix. I am also in the position to advise users or customers to or against a paid firewall product and OPNSense is not doing well here at all at the moment.

AdSchellevis commented 3 years ago

@Malunke Apparently you really don't seem to want to listen, as earlier reported I did setup a gateway group (https://github.com/opnsense/core/issues/5089#issuecomment-885782165), which didn't have issues, quite some people use this feature without issues so there's likely something different in your setup. I don't mind digging into something, as long as someone can explain what to look into (hence the question to pinpoint the exact version). I'm not spending any more time on this, good luck complaining about other people not helping you out, a bit of self-reflection would probably help.

Rapterron commented 3 years ago

Good evening.

I am still testing and thought it would be the best to wait with my report until the test are finished. I got notified by git hub on new posts so I think it's a good time to jump in again.

As mentioned I do not test in the productive environment at this time but use the same upstream gateways. My live firewall is still on OPNsense 19.7.10_1-amd64.

The test firewall is now on OPNsense 20.1-amd64 fresh installed from the ISO and not patched as recommended by @fichtner

grafik

Not that much traffic but both gateways are used for load balancing. grafik

grafik

Minimum configuration based on the flowchart and configuration file I mentioned above https://github.com/opnsense/core/issues/5094#issuecomment-889262266.

3 test devices: 2x Mobiles 1x Windows 10 The test environment is now running for about 2 days and 8 hrs and NO issue was triggered! Apparently the Idea from @fichtner was a good hint so the last working version is OPNsense 20.1-amd64

@AdSchellevis you mentioned you have already set up a test? Can you provide the configuration file so I can compare it with my minimum setup? Do you have NAT enabled? Because this might be the only difference. As far I remember by default OpnSense would create an automatic outbound NAT rule towards the gateways but as my gateway routers (AVM FritzBox) have a static route for the returning traffic pointing to the OpnSense it is not needed to NAT the LAN side (10.133.7.0/24) via OpnSense towards the gateway routers. This will save me some CPU time and double NAT as the gateway routers will NAT the traffic towards public internet anyway.

grafik Here you see the back route. 192.168.12.10 is the leg of my prod. OpnSense 192.168.12.11 is the leg from the test OpnSense

I have the production and test firewall on the same server hardware (hyper-v) and could load and test and switch versions and configs very easy.

Hope that helps and please let me know what to test next.

Side note on load balancing: As I am working for an ISP and use mostly Fortinet hardware I see that the majority of customer use failover instate of real load balancing. Mostly they have a big main line with gigagbit or even 10gigabit and a smaller backup line via LTE / 5G / xDSL or even STARLINK Satellite.

In a smaller setup like our apartment block we want to combine multiple xDSL lines (with a max bandwidth each 100mbit) to have a better distribution of the bandwidth to everyone.

Cheers Christian

mimugmail commented 3 years ago

Please do double Nat, these times are gone and it is way more clearer. Also try the floating rule where source is your LAN and not any. There might be cases where returning packets might get out again since they match too with source any

fichtner commented 3 years ago

https://github.com/opnsense/changelog/blob/613960454b7da72e18f282211f7cbd5f1bf844b7/community/20.1/20.1.7#L23 This one looks like a candidate. I think all the kernels of 20.1 are still on the mirror to try if you want.

AdSchellevis commented 3 years ago

Hi Christian,

My test was fairly simple, including outbound nat rules. it would be good to rule out the patch in pf first, it shouldn't have affect when the interface is there, but if I'm not mistaken that specific change originates from https://github.com/opnsense/src/issues/52 (https://github.com/opnsense/src/commit/923c95cfee2d3c0e4540b4efd6f188aa089c533f).

The old kernels are still there, so you could try to only update base and kernel to it with opnsense-update -bkr 20.1.7 if I'm not mistaken.

Best regards,

Ad

Malunke commented 3 years ago

Perhaps I'm able to help a little bit. I'm using Hybrid NAT (see my attachment) so I think NAT won't be the right direction.

However, what might differentiate Rapterron and me from other setups - we both use VLANs. Normally, this should not make a difference - but this could distinguish us from users with purely physical ports. I only have 2 network cables connected to my ESXi server, the rest is divided via tagging and actually runs super stable.

Trunkports are exclusively the uplink between 2 switches and the two connections to the ESXi server. So 4 trunk ports, the rest are edge ports without vlan-tag.

Aufbau

Malunke commented 3 years ago

I have a small typo in my picture - speedport IP is 10.1.200.1 instead of 10.1.200.10 (which is OPNSense's IP)

Malunke commented 3 years ago

I would also like to apologize for a moment - I don't want to troll. However, I am hardly used to it (and I have to do with the most diverse company support), which is quickly waved off and everything is directed back to the customer.

Since the problem occurs in my particular constellation and I am far from alone, Deciso should nevertheless voluntarily get to the bottom of the whole thing instead of simply ignoring it. I am also willingly to all outrages - configexport, teamviewer session, ...

I just need to know how to help diagnose it.

Ich möchte mich auch kurz entschuldigen - ich möchte nicht trollen. Jedoch bin ich es kaum gewöhnt (und ich habe mit den verschiedensten Firmensupports zu tun), das schnell abgewunken und alles zurück zum Kunden gelenkt wird.

Da in meiner speziellen Konstellation das Problem auftritt und ich bei weitem nicht alleine bin sollte trotzdem Deciso dazu bewegen, dem ganzen freiwillig auf den Grund zu gehen statt es einfach zu ignorieren. Ich bin auch gern zu allen Schandtaten bereit - Konfigexport, Teamviewersitzung, ...

Ich muss nur wissen, wie ich bei der Diagnose helfen kann.

mimugmail commented 3 years ago

"I just need to know how to help diagnose it."

"The old kernels are still there, so you could try to only update base and kernel to it with opnsense-update -bkr 20.1.7 if I'm not mistaken."

This would be a good start .. as your Firewall is virtual it shouldnt be a huge risk

Malunke commented 3 years ago

Dear mimugmail,

is it also possible to downgrade with this command (opnsense-update -bkr 20.1.7)?? I'm running the actual version OPNsense 21.7.1-amd64.

It is no problem to do a snapshot and play a little with my productive firewall (but after testing for some minutes or hours I have to revert to my snapshot). The possibility to create a parallel installation is not possible for me becaue my second WAN is only a Modem and I can connect only one firewall at this end.

mimugmail commented 3 years ago

You can clone the VM, Install a fresh 20.1, restore config.xml, update to latest 20.1 and revert back to mentioned kernel/base

Malunke commented 3 years ago

I'll try and report.

Rapterron commented 3 years ago

Please do double Nat, these times are gone and it is way more clearer. Also try the floating rule where source is your LAN and not any. There might be cases where returning packets might get out again since they match too with source any

@mimugmail

I guess we have something here.

I enabled outbound NAT and changed the floating rule to match the source to my LAN net and it's seems to be working without triggering the issue (test time around 5 hours).

grafik

To narrow down the root cause I disabled NAT but leaved the rule with defined source net. Still works (test time around 4 hours).

Now I changed back to source any and the issue was triggerd after a few minutes (~45 minutes).

grafik

The red marked rule triggers the issue and the blue one seems to be working.

I took this configuration and tested in an other virtual container with the latest OPNsense (OPNsense 21.7.1-amd64). Absolutely the same behavior. It's seems like it's the floating rule.

I will change the rule in my live environment create a snapshot and during a maintenance I will upgrade it to the latest version.

@Malunke Could you post a screenshot of your floating rule where you point to the gateway group?

mimugmail commented 3 years ago

Good news, thx for the feedback! So it seems a warning in the docs should be sufficient

Rapterron commented 3 years ago

@mimugmail Oh hold on. Some clients stopped working. This is something new 2 out of 3 clients still working using but one client has no connectivity. Rebooted the firewall and all 3 working again.

This is something new usually no client was working once the issue was triggered.

I will do further tests with more clients and more traffic. Keep you posted.

AdSchellevis commented 3 years ago

@Rapterron Which interfaces did you select in your floating rule? to be honest, these rules look quite dangerous and prone to errors to start with. We usually advise to stick to the interface rules, make sure to only select the correct traffic (exclude other traffic with pass rules without gateways first). (https://docs.opnsense.org/manual/firewall.html#policy-based-routing)

If for example you accidentially match local traffic for whatever reason, very weird things can happen (loss of dns, which probably was cached for some time at the clients end, suddenly blocking all internet traffic) Source (policy based) routing, doesn't care about other rules than its own, so local or known path doesn't mean many a lot any more as soon as you match traffic.

Rapterron commented 3 years ago

Hello @AdSchellevis

In the floating rule I selected the LAN interface as inbound interface only. grafik

In my productive environment I have a few more interfaces selected as "LAN" zone (for example client network, server network, guest network, wifi, etc.)

Out of historical grown reasons I only use floating rules in my production environment but the gateway rule is always on the end of the ruleset. Correct me when I am wrong but as far I remember the floating rules will be processed before the Interface specific rules right?

Anyway in the test setup I can play around and set now the gateway routing policy on the LAN interface as you mentioned. Will test and report.

Cheers Christian

AdSchellevis commented 3 years ago

Hi Christian,

Correct me when I am wrong but as far I remember the floating rules will be processed before the Interface specific rules right?

yes, for more context see https://docs.opnsense.org/manual/firewall.html#processing-order

When only a single interface is selected the rules should end up similar as those in the interfaces section, with floating rules it's just easier to accidentially over-select traffic. If behaviour is different (which I don't expect), you can always compare statements in /tmp/rules.debug

Best regards,

Ad

Malunke commented 3 years ago

Hello, I don't have floating rules, I only use rules in the respective interface tabs.

But I will try once to change source=* to source=LAN-networks. I will report if this made any change.

Unfortunately, a "new" OPNSense as a test system is a bit more difficult with me (I forgot to explain). I have "locked" all infrastructure components into their own VLAN with OPNSense as the routing instance, according to the BSI basic protection manual (BSI Grundschutzkompendium). As soon as I shut down my OPNSense I have no access to ESXi, switches etc. anymore. This also makes it difficult to easily switch to a test system.

Of course I have a disaster recovery plan (but this requires changing IP ranges, repatching outlets, ...) to get into the management network without a working OPNSense, but this is very time consuming. The first thing I try is to adjust the rule above.

Malunke commented 3 years ago

I just verified - my standard rule for gatewayselection had already set source=LAN (I thought I had source=* but it wasn't so). Unbenannt

Don't be surprised of the screenshot - momentarily it is gateway=* so that it doesn't use gateway-group because of this issue.

Rapterron commented 3 years ago

Hello all.

I want to provide some updates. I did some more longterm tests.

First it seemed to be working for several days but then I noticed I forgot to enable the policy based routing rule and the firewall was routing through the default route on one gateway.

Soon after I enabled the rule a few hours later (with very less traffic only a few <100MB) the issue was triggered. It does not matter if it's:

I can share the "rules.debug" if this helps but as there is only one rule on this test firewall I haven't seen any potential details.

As @Malunke mentioned can it be related to the VLAN tagging? This seems to be the only different to @AdSchellevis test setup. I could reconfigure the hypervisor and add 2 more virtual interfaces to connect untagged to the gateway networks.

AdSchellevis commented 3 years ago

@Rapterron but which was the last functioning version for you? I'm reading through the thread again, but I I can't find what the result of opnsense-update -bkr 20.1.7 was in relation to the plain 20.1 version which didn't have issues on your end. It's still imperative to know when something changed in behaviour for you in order to point you in a direction.

Rapterron commented 3 years ago

Hello, @AdSchellevis Thank you for your answer.

Yes OPNsense 20.1-amd64 was the last working version.

OPNsense 20.1-amd64 FreeBSD 11.2-RELEASE-p16-HBSD OpenSSL 1.1.1d 10 Sep 2019

I just verified it. Sorry if I am a bit slow I am on vacation and have only limited access to the network.

In the meantime I have also a 2nd parallel test instance up an running with the latest OPNsense to test (and found an possible workaround) but lets focus on the chronological order to not mix things up.

Oh I must have overlooked the manual kernel update proposal and did it right now. with opnsense-update -bkr 20.1.7 directly on the shell from the virtual interface

Even if the update was successful and had no errors the OPNsense still show the same version on the dashboard. Is this expected? OPNsense 20.1-amd64 FreeBSD 11.2-RELEASE-p16-HBSD OpenSSL 1.1.1d 10 Sep 2019

From the shell however he things he is up to date.

grafik

Will test this now and let you know about the outcome.

PS: the workaround on the latest version (well not new as this was already mentioned in other threads) was to disable "Shared forwarding" Even this could be a very last case workaround for me (in my current setup) it is not helpful for the others suffering with this issue.

AdSchellevis commented 3 years ago

@Rapterron no rush, the 20.1.7 kernel should be correctly installed now, uname -a would show a difference in build hash. We'll wait what the outcome of this is without changing anything else, if this works, we can discuss the next test step.

Rapterron commented 3 years ago

Hello @AdSchellevis,

almost 3 days working now without issues.

grafik

AdSchellevis commented 3 years ago

@Rapterron ok, that's good news, at a first glance it doesn't look kernel related in that case. did you try the latest 20.1 as well earlier by the way? if that didn't work, we probably best take small steps from here.

Rapterron commented 3 years ago

Hello,

@AdSchellevis Hm yes I tried the 20.1 already. The smalles steps I did were: Installing from the image -> testing -> upgrade via the UI and repository -> testing. The test setup still works fine and has no issues If you can tell me every single step to update we should be very easily locating the module which does not work as intended.

A bit off topic but similar: I came across an other wrong behavior on my "old" productive firewall (19.7.10_1). Beginning of this week I added a 3rd Gateway to the system (got invited to the SpaceX Starlink beta) and I added this gateway to the group (same Tier). As the Starlink router does not support custom routing I needed to add an outbound NAT and here my firewall started to load balance packets wrongly so I disabled Starlink again.

Routing a few clients explicit via this gateway or having it either alone on Tier 1 or Tier 2 (fail over) works fine but once in a Gateway group Loadbalanced with the other Gatways in the same Tier group it does not work.

Today I replaced the starlink router with an router who supports custom routing and I set up all 3 Gateways basically the same and now it works fine. All 3 Gateways in the same Tier 1 group load balanced the traffic fine.

Feels like the whole Group and load balancing settings are really fragile.

But anyway let's focus on this issue with the most recent version as it might be on my old firewall I ran into a bug which has been already fixed.

Thank you and have a great weekend! Cheers Christian

AdSchellevis commented 3 years ago

@Rapterron There are likely a lot of misassumptions about policy based routing, in our experience it's pretty stable, but often requires more knowledge about your network and expected traffic flow.

So if I understand the current situation correctly, your setup works good with the first editions of 20.1, but doesn't work anymore in the last version. We excluded the kernel part, but we can also step through the versions in the 20.1 branch, if I'm not mistaken you can upgrade to a specific version using the "firmware flavour" selection.

All available versions can be found on our mirror (https://pkg.opnsense.org/FreeBSD:12:amd64/21.1/MINT/), setting the flavour to custom + "20.1/MINT/20.1.2/OpenSSL" should offer the option to upgrade to 20.1.2. Can you upgrade + test until the non working 20.1.x is identified?