Gateway monitoring dies randomly

ouafnico commented 9 months ago

Important notices

Before you add a new report, we ask you kindly to acknowledge the following:

[x] I have read the contributing guide lines at https://github.com/opnsense/core/blob/master/CONTRIBUTING.md
[x] I am convinced that my issue is new after having checked both open and closed issues at https://github.com/opnsense/core/issues?q=is%3Aissue

Describe the bug

When using multiples gateways, you can configure each to ping a specific IP address to validate the gateway is working.

We got the same behavior on multiple firewalls (latest version at this date) : randomly, the gateway is considered down because the monitoring IP is not responding : the system can't ping it anymore.

To Reproduce

Steps to reproduce the behavior:

Go to System / gateways / single
edit a gateway, and add a "monitor IP" to the gateway
Confirm "Disable gateway monitoring" is not checked
Save and reload

Randomly, the gateway will be down, and the monitoring ip will not respond anymore.

Restarting the gateway service is not correcting anything, but editing the gateway, and saving without modifying anything, will get the gateway back, the monitoring IP is responding again.

Expected behavior

The gateway must stay alive when the monitoring IP is responding.

Logs

I didn't find anything on logs related to this "down" gateway.

Environment

Software version used and hardware type if relevant, e.g.:

OPNsense 23.7.4 (amd64). Intel(R) Atom(TM) CPU C3558

Internet connection concerned are fiber network access, or stable VDSL.

If I can provided anything...

ouafnico commented 9 months ago

Any idea ?

The same problems appears today. I checked on the firewall, it's pinging the IP monitored, but still shoes "100% LOSS" on the gateway interface.

Edit, no modifications, saved, applied : loss is gone and gateway is back.

iMiMx commented 9 months ago

I seem to recall there were a number of gateway monitoring changes - in .5, .6, etc.

https://github.com/opnsense/changelog/commit/5a1a69863eb136361117c785e45e6c9a8133ab3c https://github.com/opnsense/changelog/commit/a6845075b04722c872ae93e270afb0795ae710c2

Commits to changelog:

https://github.com/opnsense/changelog/commits/master

As with all problems, upgrading to the latest stable release is usually a good first step? You note you're on .4 when .9 is the latest.

Then, maybe try adding 'disable host route' on the gateway.

ouafnico commented 9 months ago

I got the bug on 23.7.6, every weeks.

iMiMx commented 9 months ago

My point is this: from what you've noted, you're not running the latest version with the latest fixes.

I can't see that this report will attract much attention, until you are.

ouafnico commented 9 months ago

I missread the latest version. I'm trying with .9.

p-rintz commented 7 months ago

Im running OPNsense 23.7.12-amd64 and also have the same issue.

One of my gateway monitors randomly dies, even though it should work.

Edit: Issue is also present with OPNsense 24.1_1-amd64

ouafnico commented 7 months ago

I tried differents versions of opnsense, including latest one. Got the same problem. Disabling host route is not a good option in my case.

p-rintz commented 7 months ago

I actually also have the issue, that if I disable the host route, after some time, the monitor will report 50-80% packet loss (on just that IPv4 Gateway, the corresponding IPv6 Gateway and any other gateways are unaffected).

~~Completely removing a remote monitor ip, seems to work ok as a workaround. (So i use the gateway as the ping address)~~

~~But this is not optimal to check connectivity.~~

Edit: Guess this didnt actually work. Gateway still times out randomly with my supposed workaround above. Took about 24h this time.

ouafnico commented 6 months ago

Any idea how we can address this issue?

It's happening a lot on a router where the ADSL is a little buggy, every times she dies, and come back, the gateway bugs and won't come back.

OPNsense-bot commented 3 months ago

This issue has been automatically timed-out (after 180 days of inactivity).

For more information about the policies for this repository, please read https://github.com/opnsense/core/blob/master/CONTRIBUTING.md for further details.

If someone wants to step up and work on this issue, just let us know, so we can reopen the issue and assign an owner to it.

ouafnico commented 3 months ago

This issue is still present.

fichtner commented 3 months ago

I don't doubt it, but I'm unable to tell from the little info about your WAN setup in general and not sure how to reproduce.

Nothing is really random here. There should at least be log messages correlating to the time when this stops working.

ouafnico commented 3 months ago

I don't doubt it, but I'm unable to tell from the little info about your WAN setup in general and not sure how to reproduce.

Nothing is really random here. There should at least be log messages correlating to the time when this stops working.

There are two firewalls, connected over a juniper virtual-chassis with LACP. WAN is a dedicated vlan, with a classical network like 192.168.8.0/24. Firewall 1 is 192.168.8.251, and firewall 2 with .252. A CARP is created between them with .254.

First gateway is 192.168.8.2 (dedicated ADSL), and a second with 192.168.8.1 (4G).

The first gateway is configured with a monitoring IP over the internet (it's one of our public IP; always up), and priority 10. The second gateway is configured with "disable gateway monitoring" as checked, and priority 20. Both are with "upstream gateway" checked.

A gateway group is created with both gateways, first as "Tier 1", second as "Tier 2".

What is happening : when the first gateway dies (it happens sometimes), and come back, the gateway stays as down on opnsense (always until a reboot, or an fake edit of the gateway, and an apply, even nothing is modified).

Only logs I found are theses, saying the gateway (the first) is down, but is not in reality (ADSL is up). On the opnsense, the public IP used for monitoring is not responding, but will be back after the fake edit explained before.

If I can provide more infos..

fichtner commented 3 months ago

Ok so it means you use gw_switch_default (System: Settings: General: Gateway switching).. Is WAN_GW connected to opt1 or another interface? It looks like the default switching isn't doing anything since it never switches away from opt1?

OPNsense-bot commented 3 months ago

This issue has been automatically timed-out (after 180 days of inactivity).

For more information about the policies for this repository, please read https://github.com/opnsense/core/blob/master/CONTRIBUTING.md for further details.

If someone wants to step up and work on this issue, just let us know, so we can reopen the issue and assign an owner to it.

ouafnico commented 3 months ago

Ok so it means you use gw_switch_default (System: Settings: General: Gateway switching).. Is WAN_GW connected to opt1 or another interface? It looks like the default switching isn't doing anything since it never switches away from opt1?

Yes I use the gateway switching. WAN_GW is connected to opt1. WAN_GW is the first gateway, the second is attached to opt1 too.

fichtner commented 3 months ago

Sorry about the bot. I'll reopen and hopefully block it from closing again at this point.

So you have two gateways for the same interface? I can see where the trouble could start here.

ouafnico commented 3 months ago

Sorry about the bot. I'll reopen and hopefully block it from closing again at this point.

So you have two gateways for the same interface? I can see where the trouble could start here.

Yes. We have a lot of equipments like this, but some don't have the same interface for the multiple gateways, I need to check if this problem happens too.

fichtner commented 3 months ago

I'm not entirely sure if it doesn't end up tripping over itself here, but the notion that this did work in the past would indicate we can fix it. Just need to find the era where this is happening.

fichtner commented 3 months ago

When this gets stuck can you do the following?

Collect data from:

# pluginctl -r return_gateways_status

Restart the monitors:

# pluginctl -c monitor

Collect data again (hopefully in a working state):

# pluginctl -r return_gateways_status

Cheers, Franco

ouafnico commented 3 months ago

Sure, I got the same issue actually on another.

{
    "dpinger": {
        "WAN_GW": {
            "status": "down",
            "monitor": "xxx.xxx.xxx.xxx", #voluntary hidden
            "name": "WAN_GW",
            "stddev": "0.0 ms",
            "delay": "0.0 ms",
            "loss": "100.0 %"
        },
        "WAN_GW_4G": {
            "status": "none",
            "monitor": "~",
            "name": "WAN_GW_4G",
            "stddev": "~",
            "delay": "~",
            "loss": "~"
        }
    }
}

after restart

{
    "dpinger": {
        "WAN_GW": {
            "status": "none",
            "monitor": "xxx.xxx.xxx.xxx", #voluntary hidden
            "name": "WAN_GW",
            "stddev": "0.2 ms",
            "delay": "11.2 ms",
            "loss": "0.0 %"
        },
        "WAN_GW_4G": {
            "status": "none",
            "monitor": "~",
            "name": "WAN_GW_4G",
            "stddev": "~",
            "delay": "~",
            "loss": "~"
        }
    }
}

fichtner commented 3 months ago

Ok, I think when it switches it may kill the route used by WAN_GW causing to malfunction so it's missing a restart somewhere.

Does that pluginctl -c monitor fix the condition btw?

ouafnico commented 3 months ago

Ok, I think when it switches it may kill the route used by WAN_GW causing to malfunction so it's missing a restart somewhere.

Does that pluginctl -c monitor fix the condition btw?

hum intersting.

Yes, it have fixed the thing.

fichtner commented 3 months ago

I'm assuming https://github.com/opnsense/core/commit/3786caf568a340 plays a role here being introduced in 23.7.6 like clockwork... Will post a debug/workaround soon.

fichtner commented 3 months ago

@ouafnico can you see if 00bd8b7 works better as a debug option? It has one additional output but I need it in the full context to make sense of it.

# opnsense-patch 00bd8b7

Cheers, Franco

ouafnico commented 3 months ago

Hello @fichtner

I've patched one machine concerned by the problem.

The firewall was bugged as always when I pull the patch.

Do I have to wait for the bug to reappear and give you general logs ? or other ?

fichtner commented 3 months ago

@ouafnico wait for it to reappear, report back in 2-3 days if it doesn't reappear too. I've added a debug line for every monitor it tries to restart and I think it won't restart the one we need it to (but the restart for the time being in this patch will try to fix all so it might be a workaround already). Thanks!

ouafnico commented 3 months ago

the bug is back quickly.

your log entry seems not to appear.

fichtner commented 3 months ago

I think I know why...

https://github.com/opnsense/core/blob/79312f47eabd8d55eb776dc6f25bd19d911e0e5e/src/etc/inc/plugins.inc.d/dpinger.inc#L292

Leads to

https://github.com/opnsense/core/blob/79312f47eabd8d55eb776dc6f25bd19d911e0e5e/src/opnsense/service/conf/actions.d/actions_interface.conf#L135

Which ends up not reloading the gateway monitors

https://github.com/opnsense/core/blob/79312f47eabd8d55eb776dc6f25bd19d911e0e5e/src/etc/rc.routing_configure#L38

I'm not sure how to fix because the reload avoidance is there for a reason, but I understand where the issue comes from. Not clear on the link monitor failing, which is the actual cause but might be something we have to work with.

I'll think more about it and report back.

Cheers, Franco

ouafnico commented 1 month ago

Hello @fichtner

Do you have any news on this part?

Cheers,

fichtner commented 1 month ago

@ouafnico sorry I couldn't chase this due to 24.7 related work but I will take another look now.

fichtner commented 2 weeks ago

@ouafnico could you try f02f580?

# opnsense-patch f02f580

meganie commented 2 weeks ago

@fichtner This worked for me. Thanks!

Before:

2024-08-20T04:00:07 Notice  dpinger ALERT: WAN_GW (Addr: 8.8.8.8 Alarm: none -> down RTT: 0.0 ms RTTd: 0.0 ms Loss: 100.0 %)    
2024-08-20T04:00:03 Notice  dpinger Reloaded gateway watcher configuration on SIGHUP    
2024-08-20T04:00:03 Warning dpinger send_interval 1000ms loss_interval 4000ms time_period 60000ms report_interval 0ms data_len 1 alert_interval 1000ms latency_alarm 0ms loss_alarm 0% alarm_hold 10000ms dest_addr 8.8.8.8 bind_addr 2.241.65.39 identifier "WAN_GW "  
2024-08-20T04:00:02 Notice  dpinger Reloaded gateway watcher configuration on SIGHUP    
2024-08-20T04:00:02 Warning dpinger exiting on signal 15    
2024-08-20T04:00:02 Warning dpinger WAN_GW 8.8.8.8: sendto error: 65    
2024-08-20T04:00:01 Warning dpinger WAN_GW 8.8.8.8: sendto error: 65    
2024-08-20T04:00:00 Warning dpinger WAN_GW 8.8.8.8: sendto error: 65

After:

2024-08-23T04:00:18 Notice  dpinger ALERT: WAN_GW (Addr: 8.8.8.8 Alarm: down -> none RTT: 13.6 ms RTTd: 0.3 ms Loss: 0.0 %) 
2024-08-23T04:00:18 Notice  dpinger Reloaded gateway watcher configuration on SIGHUP    
2024-08-23T04:00:08 Warning dpinger send_interval 1000ms loss_interval 4000ms time_period 60000ms report_interval 0ms data_len 1 alert_interval 1000ms latency_alarm 0ms loss_alarm 0% alarm_hold 10000ms dest_addr 8.8.8.8 bind_addr 2.241.13.118 identifier "WAN_GW " 
2024-08-23T04:00:08 Warning dpinger exiting on signal 15    
2024-08-23T04:00:07 Notice  dpinger ALERT: WAN_GW (Addr: 8.8.8.8 Alarm: none -> down RTT: 0.0 ms RTTd: 0.0 ms Loss: 100.0 %)    
2024-08-23T04:00:03 Notice  dpinger Reloaded gateway watcher configuration on SIGHUP    
2024-08-23T04:00:03 Warning dpinger send_interval 1000ms loss_interval 4000ms time_period 60000ms report_interval 0ms data_len 1 alert_interval 1000ms latency_alarm 0ms loss_alarm 0% alarm_hold 10000ms dest_addr 8.8.8.8 bind_addr 2.241.13.118 identifier "WAN_GW " 
2024-08-23T04:00:02 Notice  dpinger Reloaded gateway watcher configuration on SIGHUP    
2024-08-23T04:00:02 Warning dpinger exiting on signal 15    
2024-08-23T04:00:01 Warning dpinger WAN_GW 8.8.8.8: sendto error: 65    
2024-08-23T04:00:00 Warning dpinger WAN_GW 8.8.8.8: sendto error: 65

fichtner commented 2 weeks ago

@meganie thanks I'll add this to the development version for further testing then 👍

ouafnico commented 1 week ago

@ouafnico could you try f02f580?
# opnsense-patch f02f580

Checking it ;)

ouafnico commented 1 week ago

@fichtner Unless I did wrong, it did not work on my side.

fichtner commented 1 week ago

@ouafnico ok thanks so far!

@AdSchellevis would it make sense to offer this via cron with 0c9d8c9 in place?

https://github.com/opnsense/core/blob/0b42c910c4ab3b5869d7acc00d43a61de352920c/src/opnsense/service/conf/actions.d/actions_interface.conf#L136-L140

AdSchellevis commented 1 week ago

@fichtner I don't know, what would it solve?

fichtner commented 1 week ago

Getting dpinger unstuck in these cases. Could also do it without the routing restart, but I thought why not this script. It would be a permanent workaround for people having these issues. There are plenty more in the forum, e.g. https://forum.opnsense.org/index.php?topic=42330.0

fichtner commented 1 week ago

To be precise: offer a description for the action only.

AdSchellevis commented 1 week ago

@fichtner I don't mind adding a description and let people schedule it, I was just wondering why we would want it, as workaround for undetectable issues it's fine.

fichtner commented 1 week ago

https://github.com/opnsense/core/commit/c0bee56c10 then, just needs a minor docs note

note this is a workaround. I'll keep the ticket open.

fichtner commented 1 week ago

I messed up the whole thing due to refactoring after testing this effectively and I noticed while adding the documentation for it https://github.com/opnsense/docs/commit/69ef46844

Will prepare a proper backport for 24.7.2. Maybe it works after all.

fichtner commented 1 week ago

https://github.com/opnsense/core/commit/b7331952f3

# opnsense-revert opnsense && opnsense-patch b7331952f3 && service configd restart

Please try again :) Cron job is there as well but first without...

ouafnico commented 1 week ago

opnsense-revert opnsense && opnsense-patch b7331952f3 && service configd reload

tried too, and got at the end

=====
Message from opnsense-23.7.12_5:

--
Beep! Beep!
Fetched b7331952f3 via https://github.com/opnsense/core
1 out of 1 hunks failed while patching opnsense/service/conf/actions.d/actions_interface.conf

fichtner commented 1 week ago

this is for 24.7.2 actually

fichtner commented 1 week ago

one sec :)

fichtner commented 1 week ago

You likely just need 105ecf9a5af809 as well:

# opnsense-revert opnsense && opnsense-patch 105ecf9 b733195 && service configd restart

ouafnico commented 1 week ago

opnsense-revert opnsense && opnsense-patch 105ecf9 b733195 && service configd restart

Ok this one worked.

Do I have to try again the gateways ?

fichtner commented 1 week ago

Yes please, try to see if it makes a difference just with these things applied.

opnsense / core

Gateway monitoring dies randomly #7027