Open ouafnico opened 9 months ago
Any idea ?
The same problems appears today. I checked on the firewall, it's pinging the IP monitored, but still shoes "100% LOSS" on the gateway interface.
Edit, no modifications, saved, applied : loss is gone and gateway is back.
I seem to recall there were a number of gateway monitoring changes - in .5, .6, etc.
https://github.com/opnsense/changelog/commit/5a1a69863eb136361117c785e45e6c9a8133ab3c https://github.com/opnsense/changelog/commit/a6845075b04722c872ae93e270afb0795ae710c2
Commits to changelog:
https://github.com/opnsense/changelog/commits/master
As with all problems, upgrading to the latest stable release is usually a good first step? You note you're on .4 when .9 is the latest.
Then, maybe try adding 'disable host route' on the gateway.
I got the bug on 23.7.6, every weeks.
My point is this: from what you've noted, you're not running the latest version with the latest fixes.
I can't see that this report will attract much attention, until you are.
I missread the latest version. I'm trying with .9.
Im running OPNsense 23.7.12-amd64 and also have the same issue.
One of my gateway monitors randomly dies, even though it should work.
Edit: Issue is also present with OPNsense 24.1_1-amd64
I tried differents versions of opnsense, including latest one. Got the same problem. Disabling host route is not a good option in my case.
I actually also have the issue, that if I disable the host route, after some time, the monitor will report 50-80% packet loss (on just that IPv4 Gateway, the corresponding IPv6 Gateway and any other gateways are unaffected).
Completely removing a remote monitor ip, seems to work ok as a workaround. (So i use the gateway as the ping address)
But this is not optimal to check connectivity.
Edit: Guess this didnt actually work. Gateway still times out randomly with my supposed workaround above. Took about 24h this time.
Any idea how we can address this issue?
It's happening a lot on a router where the ADSL is a little buggy, every times she dies, and come back, the gateway bugs and won't come back.
This issue has been automatically timed-out (after 180 days of inactivity).
For more information about the policies for this repository, please read https://github.com/opnsense/core/blob/master/CONTRIBUTING.md for further details.
If someone wants to step up and work on this issue, just let us know, so we can reopen the issue and assign an owner to it.
This issue is still present.
I don't doubt it, but I'm unable to tell from the little info about your WAN setup in general and not sure how to reproduce.
Nothing is really random here. There should at least be log messages correlating to the time when this stops working.
I don't doubt it, but I'm unable to tell from the little info about your WAN setup in general and not sure how to reproduce.
Nothing is really random here. There should at least be log messages correlating to the time when this stops working.
There are two firewalls, connected over a juniper virtual-chassis with LACP. WAN is a dedicated vlan, with a classical network like 192.168.8.0/24. Firewall 1 is 192.168.8.251, and firewall 2 with .252. A CARP is created between them with .254.
First gateway is 192.168.8.2 (dedicated ADSL), and a second with 192.168.8.1 (4G).
The first gateway is configured with a monitoring IP over the internet (it's one of our public IP; always up), and priority 10. The second gateway is configured with "disable gateway monitoring" as checked, and priority 20. Both are with "upstream gateway" checked.
A gateway group is created with both gateways, first as "Tier 1", second as "Tier 2".
What is happening : when the first gateway dies (it happens sometimes), and come back, the gateway stays as down on opnsense (always until a reboot, or an fake edit of the gateway, and an apply, even nothing is modified).
Only logs I found are theses, saying the gateway (the first) is down, but is not in reality (ADSL is up). On the opnsense, the public IP used for monitoring is not responding, but will be back after the fake edit explained before.
If I can provide more infos..
Ok so it means you use gw_switch_default
(System: Settings: General: Gateway switching).. Is WAN_GW connected to opt1
or another interface? It looks like the default switching isn't doing anything since it never switches away from opt1
?
This issue has been automatically timed-out (after 180 days of inactivity).
For more information about the policies for this repository, please read https://github.com/opnsense/core/blob/master/CONTRIBUTING.md for further details.
If someone wants to step up and work on this issue, just let us know, so we can reopen the issue and assign an owner to it.
Ok so it means you use
gw_switch_default
(System: Settings: General: Gateway switching).. Is WAN_GW connected toopt1
or another interface? It looks like the default switching isn't doing anything since it never switches away fromopt1
?
Yes I use the gateway switching. WAN_GW is connected to opt1. WAN_GW is the first gateway, the second is attached to opt1 too.
Sorry about the bot. I'll reopen and hopefully block it from closing again at this point.
So you have two gateways for the same interface? I can see where the trouble could start here.
Sorry about the bot. I'll reopen and hopefully block it from closing again at this point.
So you have two gateways for the same interface? I can see where the trouble could start here.
Yes. We have a lot of equipments like this, but some don't have the same interface for the multiple gateways, I need to check if this problem happens too.
I'm not entirely sure if it doesn't end up tripping over itself here, but the notion that this did work in the past would indicate we can fix it. Just need to find the era where this is happening.
When this gets stuck can you do the following?
Collect data from:
# pluginctl -r return_gateways_status
Restart the monitors:
# pluginctl -c monitor
Collect data again (hopefully in a working state):
# pluginctl -r return_gateways_status
Cheers, Franco
Sure, I got the same issue actually on another.
{
"dpinger": {
"WAN_GW": {
"status": "down",
"monitor": "xxx.xxx.xxx.xxx", #voluntary hidden
"name": "WAN_GW",
"stddev": "0.0 ms",
"delay": "0.0 ms",
"loss": "100.0 %"
},
"WAN_GW_4G": {
"status": "none",
"monitor": "~",
"name": "WAN_GW_4G",
"stddev": "~",
"delay": "~",
"loss": "~"
}
}
}
after restart
{
"dpinger": {
"WAN_GW": {
"status": "none",
"monitor": "xxx.xxx.xxx.xxx", #voluntary hidden
"name": "WAN_GW",
"stddev": "0.2 ms",
"delay": "11.2 ms",
"loss": "0.0 %"
},
"WAN_GW_4G": {
"status": "none",
"monitor": "~",
"name": "WAN_GW_4G",
"stddev": "~",
"delay": "~",
"loss": "~"
}
}
}
Ok, I think when it switches it may kill the route used by WAN_GW causing to malfunction so it's missing a restart somewhere.
Does that pluginctl -c monitor fix the condition btw?
Ok, I think when it switches it may kill the route used by WAN_GW causing to malfunction so it's missing a restart somewhere.
Does that pluginctl -c monitor fix the condition btw?
hum intersting.
Yes, it have fixed the thing.
I'm assuming https://github.com/opnsense/core/commit/3786caf568a340 plays a role here being introduced in 23.7.6 like clockwork... Will post a debug/workaround soon.
@ouafnico can you see if 00bd8b7 works better as a debug option? It has one additional output but I need it in the full context to make sense of it.
# opnsense-patch 00bd8b7
Cheers, Franco
Hello @fichtner
I've patched one machine concerned by the problem.
The firewall was bugged as always when I pull the patch.
Do I have to wait for the bug to reappear and give you general logs ? or other ?
@ouafnico wait for it to reappear, report back in 2-3 days if it doesn't reappear too. I've added a debug line for every monitor it tries to restart and I think it won't restart the one we need it to (but the restart for the time being in this patch will try to fix all so it might be a workaround already). Thanks!
the bug is back quickly.
your log entry seems not to appear.
I think I know why...
Leads to
Which ends up not reloading the gateway monitors
I'm not sure how to fix because the reload avoidance is there for a reason, but I understand where the issue comes from. Not clear on the link monitor failing, which is the actual cause but might be something we have to work with.
I'll think more about it and report back.
Cheers, Franco
Hello @fichtner
Do you have any news on this part?
Cheers,
@ouafnico sorry I couldn't chase this due to 24.7 related work but I will take another look now.
@ouafnico could you try f02f580?
# opnsense-patch f02f580
@fichtner This worked for me. Thanks!
Before:
2024-08-20T04:00:07 Notice dpinger ALERT: WAN_GW (Addr: 8.8.8.8 Alarm: none -> down RTT: 0.0 ms RTTd: 0.0 ms Loss: 100.0 %)
2024-08-20T04:00:03 Notice dpinger Reloaded gateway watcher configuration on SIGHUP
2024-08-20T04:00:03 Warning dpinger send_interval 1000ms loss_interval 4000ms time_period 60000ms report_interval 0ms data_len 1 alert_interval 1000ms latency_alarm 0ms loss_alarm 0% alarm_hold 10000ms dest_addr 8.8.8.8 bind_addr 2.241.65.39 identifier "WAN_GW "
2024-08-20T04:00:02 Notice dpinger Reloaded gateway watcher configuration on SIGHUP
2024-08-20T04:00:02 Warning dpinger exiting on signal 15
2024-08-20T04:00:02 Warning dpinger WAN_GW 8.8.8.8: sendto error: 65
2024-08-20T04:00:01 Warning dpinger WAN_GW 8.8.8.8: sendto error: 65
2024-08-20T04:00:00 Warning dpinger WAN_GW 8.8.8.8: sendto error: 65
After:
2024-08-23T04:00:18 Notice dpinger ALERT: WAN_GW (Addr: 8.8.8.8 Alarm: down -> none RTT: 13.6 ms RTTd: 0.3 ms Loss: 0.0 %)
2024-08-23T04:00:18 Notice dpinger Reloaded gateway watcher configuration on SIGHUP
2024-08-23T04:00:08 Warning dpinger send_interval 1000ms loss_interval 4000ms time_period 60000ms report_interval 0ms data_len 1 alert_interval 1000ms latency_alarm 0ms loss_alarm 0% alarm_hold 10000ms dest_addr 8.8.8.8 bind_addr 2.241.13.118 identifier "WAN_GW "
2024-08-23T04:00:08 Warning dpinger exiting on signal 15
2024-08-23T04:00:07 Notice dpinger ALERT: WAN_GW (Addr: 8.8.8.8 Alarm: none -> down RTT: 0.0 ms RTTd: 0.0 ms Loss: 100.0 %)
2024-08-23T04:00:03 Notice dpinger Reloaded gateway watcher configuration on SIGHUP
2024-08-23T04:00:03 Warning dpinger send_interval 1000ms loss_interval 4000ms time_period 60000ms report_interval 0ms data_len 1 alert_interval 1000ms latency_alarm 0ms loss_alarm 0% alarm_hold 10000ms dest_addr 8.8.8.8 bind_addr 2.241.13.118 identifier "WAN_GW "
2024-08-23T04:00:02 Notice dpinger Reloaded gateway watcher configuration on SIGHUP
2024-08-23T04:00:02 Warning dpinger exiting on signal 15
2024-08-23T04:00:01 Warning dpinger WAN_GW 8.8.8.8: sendto error: 65
2024-08-23T04:00:00 Warning dpinger WAN_GW 8.8.8.8: sendto error: 65
@meganie thanks I'll add this to the development version for further testing then 👍
@ouafnico could you try f02f580?
# opnsense-patch f02f580
Checking it ;)
@fichtner Unless I did wrong, it did not work on my side.
@ouafnico ok thanks so far!
@AdSchellevis would it make sense to offer this via cron with 0c9d8c9 in place?
@fichtner I don't know, what would it solve?
Getting dpinger unstuck in these cases. Could also do it without the routing restart, but I thought why not this script. It would be a permanent workaround for people having these issues. There are plenty more in the forum, e.g. https://forum.opnsense.org/index.php?topic=42330.0
To be precise: offer a description
for the action only.
@fichtner I don't mind adding a description and let people schedule it, I was just wondering why we would want it, as workaround for undetectable issues it's fine.
https://github.com/opnsense/core/commit/c0bee56c10 then, just needs a minor docs note
note this is a workaround. I'll keep the ticket open.
I messed up the whole thing due to refactoring after testing this effectively and I noticed while adding the documentation for it https://github.com/opnsense/docs/commit/69ef46844
Will prepare a proper backport for 24.7.2. Maybe it works after all.
https://github.com/opnsense/core/commit/b7331952f3
# opnsense-revert opnsense && opnsense-patch b7331952f3 && service configd restart
Please try again :) Cron job is there as well but first without...
opnsense-revert opnsense && opnsense-patch b7331952f3 && service configd reload
tried too, and got at the end
=====
Message from opnsense-23.7.12_5:
--
Beep! Beep!
Fetched b7331952f3 via https://github.com/opnsense/core
1 out of 1 hunks failed while patching opnsense/service/conf/actions.d/actions_interface.conf
this is for 24.7.2 actually
one sec :)
You likely just need 105ecf9a5af809 as well:
# opnsense-revert opnsense && opnsense-patch 105ecf9 b733195 && service configd restart
opnsense-revert opnsense && opnsense-patch 105ecf9 b733195 && service configd restart
Ok this one worked.
Do I have to try again the gateways ?
Yes please, try to see if it makes a difference just with these things applied.
Important notices
Before you add a new report, we ask you kindly to acknowledge the following:
Describe the bug
When using multiples gateways, you can configure each to ping a specific IP address to validate the gateway is working.
We got the same behavior on multiple firewalls (latest version at this date) : randomly, the gateway is considered down because the monitoring IP is not responding : the system can't ping it anymore.
To Reproduce
Steps to reproduce the behavior:
Randomly, the gateway will be down, and the monitoring ip will not respond anymore.
Restarting the gateway service is not correcting anything, but editing the gateway, and saving without modifying anything, will get the gateway back, the monitoring IP is responding again.
Expected behavior
The gateway must stay alive when the monitoring IP is responding.
Logs
I didn't find anything on logs related to this "down" gateway.
Environment
Software version used and hardware type if relevant, e.g.:
OPNsense 23.7.4 (amd64). Intel(R) Atom(TM) CPU C3558
Internet connection concerned are fiber network access, or stable VDSL.
If I can provided anything...