Closed iqt4 closed 1 year ago
I have done some further investigations, verified the the documentation carefully Trigger Level Explained and can conclude that the implementation does not follow the documentation: Trigger Level Explained Member Down: Triggers when the gateway has 100% packet loss. Packet Loss: Triggers when the packet loss to a gateway is higher than the defined threshold. High Latency: Triggers when the latency to a gateway higher than its defined threshold. Packet Loss or High Latency: Triggers for either of the above conditions. Observation
I spent some time, digged into the code and it became clear what's going on.
configctl filter reload skip_alias
Next step is reloading the filter rules: https://github.com/opnsense/core/blob/7a6a1029130504cd310109d492914b9195717c22/src/etc/rc.syshook.d/monitor/10-dpinger https://github.com/opnsense/core/blob/7a6a1029130504cd310109d492914b9195717c22/src/etc/rc.filter_configure https://github.com/opnsense/core/blob/7a6a1029130504cd310109d492914b9195717c22/src/etc/inc/filter.inc#L154
Here we set the gateways https://github.com/opnsense/core/blob/7a6a1029130504cd310109d492914b9195717c22/src/etc/inc/filter.lib.inc#L85
pluginctl -r 'return_gateways_status'
)
Getting the current values from dpinger and compare to thresholds set up per gateway. This gives the status per gateway:
Status is down (corresponds to offline), loss or delay (corresponds to warning), none (online)Now we can explain the observed behaviour:
$is_up = false;
$is_up = true;
Case B)
a1) Gateway group trigger downlatency -> $is_up = stristr($gw_group->trigger, 'latency') === false;
-> results in $is_up = false;
a2) Gateway group trigger downloss -> $is_up = stristr($gw_group->trigger, 'loss') === false;
-> results in $is_up = false;
b) Gateway group trigger down -> default -> $is_up = true;
Test case
configctl filter reload skip_alias
Conclusion
There is obviously an incongruence between documentation, dpinger Alarm/Clear and public function getGroups
.
I've also noticed this behavior, and it looks to have been going on for a while: https://forum.opnsense.org/index.php?topic=31147.0;prev_next=prev#new
@iqt4 while reading briefly through the writeup, am I right in assuming you miss an event when the gateway comes back after a failure? (not sure if dpinger does support that or if we miss anything, but it helps classifying the ticket). So rerunning a filter reload after the first gateway restores brings you the situation you're looking for in that case.
@AdSchellevis I'm not really missing an event. dpinger has only one trigger level but in your gateway definition there are two thresholds defined. Problem is in Case B), thus, re-running filter reload won't help (see a1), a2))
Edit: One more thought. I assume there is a principle design flaw as the trigger levels (dpinger) are defined per gateway but the decision is independently defined per gateway group. Example: dpinger fires Alarm e.g. due to high latency but the gateway group is set to trigger level Packet loss only. Result: respective gateway stays up. Now we get an increased packet loss but as dpinger is still in alarm state you won't get another Alarm trigger and the gateway group state is not consistent.
Proposal: Keep it simple! If dpinger fires Alarm respective gateway is down and on Clear gateway is up again. In the gateway group just define failover or load balancing mode.
I'm having difficulties figuring out which change you're looking for, if you could open a PR for discussion (either on core or docs) that might be helpful.
I'm having difficulties figuring out which change you're looking for, if you could open a PR for discussion (either on core or docs) that might be helpful.
Please see my previous edit. Do you want to me copy this into a PR?
I agree the concept is likely flawed by design, we didn't change it over the years as far as I know. If you have a proposal for a change, I'm certainly not against a PR, it makes discussion easier. Simplifying the groups might also help refactoring them later to MVC (and api enable them).
@AdSchellevis: Opening a PR requires having a new branch, which I'm not ably do provide in that case. There are principle questions that have to be clarified, e.g., do you want to keep all four trigger cases or do you want to simplify the scenarios? Proposal: Simplify it - go for the unique case Member Down (means offline in the dashboard), and use the dpinger Alarm/Clear signals. Details are defined on gateway level (dpinger parameters for loss and latency). Btw. why is it called Multi WAN - shouldn't it be Multi Gateway!?
I'm willing to support but I'm not in the position to decide and refactoring to MVC goes beyond my knowledge.
Best Dirk
We might need to postpone in that case, with a proposal we can discuss how/if that effects others, without it we probably have to wait until we plan to refactor to mvc ourself and work through the scenario's then. I agree there's probably an improvement possible, our time is just limited and we do have to choose our priorities.
I'm taking a brief look here and the documentation should say:
Member Down: Triggers when the gateway is higher than the threshold for high packet loss or latency. Packet Loss: Triggers when the packet loss to a gateway is higher than the respective low threshold. Packet Latency: Triggers when the latency to a gateway higher than the respective low threshold. Packet Loss or [...] Latency: Triggers for either of the above conditions.
From that perspective the code makes sense from both sides (dpinger and getGroup) so I'm unsure what is missing / not working correctly.
Cheers, Franco
Thanks Franco for joining the discussion and clarifying the documentation. Highly appreciated.
In the current design, the dpinger alert functionality is based on high thresholds. I.e. if you set the gateway group to Packet Loss/Latency which are based on low thresholds, dpinger will not alert - only if the gateway exceeds the high thresholds as well. The other case when the gateway recovers (goes below high) the result depends (see https://github.com/opnsense/core/issues/6231#issuecomment-1374828144).
I can easily reproduce the behaviour in GNS3.
With your explanation, the proposal to circumvent the issue would be not to use the dpinger alert command -C /usr/local/etc/rc.syshook monitor
, instead use e.g. cron to regularly verify the gateway status and update the firewall rules if thresholds were passed (not sure whether configctl filter reload skip_alias
would be an overkill or needs some tweaking).
Cheers Dirk
Hi Dirk,
I see what you mean now, latencylow
and losslow
are not passed to dpinger, likely not even supported. Must have been from the apinger days. Now I'm unsure what "down" for dpinger even means. The easiest way to describe it would be member down and packet loss/latency cases being the same? Or member down meaning both loss and latency must trigger?
Cheers, Franco
PS: I favour member down being loss && delay so we still have 4 separate cases. I do think, however, that the whole latency/loss thing is a bit pointless the way dpinger implements it.
Dpinger 3.2:
alarm_flag is set to 1 if either latency or loss is in alarm state
My initial proposal is already described here: https://github.com/opnsense/core/issues/6231#issuecomment-1435913675 In consequence
If you try to handle loss and delay separately with own coding, you will be in trouble again. Example:
It’s starting to look like dpinger was never a replacement for apinger then. I’m unsure if changing all of this makes sense for a minor iteration and also low thresholds as GUI gimmicks only seems counter-intuitive. In that case they should be removed and the terminology simplified.
Cheers, Franco
Don't know about apinger, but let's consider the topic from requirement/use-case perspective which is at the end high availability on network layer with a) failover, b) load balancing. A gateway status is up or down - I would keep it simple (other implementations e.g. RouterOS behave similar). The question now is, what means up/down? For these use cases it depends on dpinger results and the user can set loss/latency parameter (may depend on the line - DSL, fibre, cable, provider ...).
Another aspect regarding existing connections when gateway status changes. I have seen many discussions in the forum. I assume a connection stays on its initial route/gateway until it dies/times out. There is no enforcement yet to switch the gateway - correct? This would be a new feature.
With respect to an intuitive solution, I feel this is implemented more from a technical rather than the user perspective: The gateways (groups) are defined in System: Gateway: Single (Group). Gateway switching option is in System: Settings: General. Multi-WAN/Gateway Monitoring options are in Firewall: Settings: Advanced. May be the gateway specific options can be consolidated under System: Gateway: Options
Just my thoughts Dirk
Let's try to keep this simple for now, if latencylow
and losslow
aren't used, remove those and simplify as Franco suggested. A larger overhaul of this section isn't planned on our end at the moment and likely doesn't have priority either. The previous program (apinger
) did require different parameters, which lost their usefulness back in 2018 when we moved to dpinger it seems.
The upstream documentation for dpinger (https://github.com/dennypage/dpinger) notes loss
and latency
triggers, which relate to the high numbers on our end.
As dpinger
doesn't report anything else then:
<Average Latency in μs> <Standard Deviation in μs> <Percentage of Loss>
According to their documentation, down
doesn't really exist in this case, but might be best described as high loss and high latency (in which case that option can safely be removed from the gateway).
When the trigger levels match the actual ones dpinger supports, logically the groups themselves will be configured properly as well when choosing only these (high latency
, high loss
, down
(both) ).
Best regards,
Ad
@AdSchellevis: [Let's try to keep this simple ...] --> agreed
Most likely the following - as already implemented - satisfies the requirements.
Member Down: Triggers when the gateway is higher than the threshold for high packet loss or latency.
All other Packet Loss/High Latency possibilities are not needed. In consequence, Trigger Level is no longer an option.
I'm not in the position to take decisions but I would conclude the following:
Cheers Dirk
Funny part about all of this is that the code we have is not wrong and works on the metrics given by dpinger, just that its alarm policy is unworkable. We might be better off disabling the -C
option and write our own small handler instead to trigger /usr/local/etc/rc.syshook monitor
.
Sounds good?!
Cheers, Franco
@iqt4 maybe you can take a look at c5b8dc4 and share your thoughts?
Thanks, Franco
@fichtner sorry Franco I have been ooo for a while. As far as I understand, instead of using the dpinger trigger (which does not cover all the trigger cases of a gateway group) you have written the standalone function gateway_alarm.php
. I have reviewed this function in https://github.com/opnsense/core/commit/c5b8dc483638253cf24f440cde54e939c4b13482 and made a few remarks in the code - hope you can read it.
Overall, I assume this design does not thoroughly solve the issue. gateway_alarm.php
is run per gateway not per gateway group. The trigger to switch a gateway is per design a feature of the gateway group.
May be running gateway_alarm.php
just once is sufficient containing following steps:
/usr/local/etc/rc.syshook monitor
If relevant config (default gw switching or gw groups) changes, restart gateway_alarm.php
.
My thoughts as requested - Cheers Dirk
Thanks for your valuable feedback! For now I would hope to avoid the moving goalpost too much and seek a quick replacement for the current limitation you originally reported (which still is the main reason to change something here). The POC was ok, but handling each daemon separately isn’t very elegant so we decided to merge the polling into a single service. The only time this service is not required is when not gateways are being monitored through dpinger. Oterhwise it will (regardless of the used settings) trigger the monitor facility as before including now missing triggers.
However, the monitoring facility is already too late to react operationally (losing too much context) so we are going move the code triggers into the php alarm script as well. This would avoid spurious alarm-based reloads and keep the monitor syshook for everyone else using it.
first goal is to do the right thing for gateway switching and getting it into master, then work on the gateway groups part further.
but I’d like to avoid multiple threads and wait intervals per monitor as it would increase complexity and likely introduce new bugs that may be unnecessary for the overall functionality during operation.
Cheers, Franco
@iqt4 I've pushed the work to master branch and it will be featured in the development version of 23.1.8 (not the community one). Essentially it's one additional service being spawned watching over all the dpinger reports and triggering the alarm script conditionally while dpinger no longer gets the opportunity to report an alarm itself. If you can look at it in the next weeks that would be helpful. It'll likely go into 23.7 with a bit more tweaking until then.
Cheers, Franco
@fichtner Sorry for late response. How can I get the code? root@OPNsense:~ # opnsense-patch ...
Switch to development release type in firmware GUI, save and check for updates and install.
Thanks for the explanation. I have tested the functionality on today's development version (OPNsense 23.7.b_107-amd64) for all four trigger levels (a Gateway group with two members: Tier 1 and Tier2). Didn't find any peculiarities! One observation you may consider regarding logging. As dpinger is no longer the driver for gateway switching, it would be helpful if the new script would log some information.
Sounds promising, thanks!
It logs both “ALERT” and “MONITOR” messages to gateway log. The first is the same as before on triggering the action to reload routes/filter and the second is the same for all gateway transitions that don’t trigger an action.
Important notices
Before you add a new report, we ask you kindly to acknowledge the following:
Describe the bug I already described the issue in the forum without response Re: Dual WAN Failover stuck. Thus, I seek support here to analyse and solve the problem.
We have a setup consisting of a WAN network (192.168.0.0/24) with two gateways to the internet (provider A 192.168.0.1 and provider B 192.168.0.2) and other internal networks (10.0.x.0/24). OPNsense interfaces are 192.168.0.10 and 10.0.x.1 respectively. DNS services are provided by one of the internal networks (no Unbound) and OPNsense points to these servers as well.
Gateways are marked as upstream with provider A priority 127 and provider provider B priority 255.
I configured multi WAN with failover according the documentation (without DNS, Default Gateway Switching disabled): WAN gateway group with provider A as Tier 1 and provider B as Tier 2. Trigger Level Packet Loss or High Latency.
If gateway A fails, traffic follows over gateway B as expected. However, if gateway A recovers as indicated at the dashboard or System: Gateways: Single, traffic is still going out of gateway B. I verified the firewall rules --> still point to gateway B (not A).
The setup was implemented during 2022 with the respective current OPNsense version and was never working stable in production.
To Reproduce Steps to reproduce the behavior:
pass in quick on vtnet0 route-to (vtnet1 192.168.0.1) inet from any to ! <private> flags S/SA keep state label "f5a781eeb65a44a79c529c6d7ba4cbb6"
pass in quick on vtnet0 route-to (vtnet1 192.168.0.2) inet from any to ! <private> flags S/SA keep state label "f5a781eeb65a44a79c529c6d7ba4cbb6"
Gateway log:<12>1 2022-12-22T22:55:47+01:00 OPNsense.localdomain dpinger 46446 - [meta sequenceId="1"] WAN_GWv4_1 37.209.40.1: Alarm latency 502128us stddev 312011us loss 0%
<13>1 2022-12-22T22:55:47+01:00 OPNsense.localdomain dpinger 14062 - [meta sequenceId="2"] GATEWAY ALARM: WAN_GWv4_1 (Addr: 37.209.40.1 Alarm: 1 RTT: 502128us RTTd: 312011us Loss: 0%)
<12>1 2022-12-22T22:56:11+01:00 OPNsense.localdomain dpinger 46446 - [meta sequenceId="3"] WAN_GWv4_1 37.209.40.1: Clear latency 411311us stddev 298042us loss 1%
<13>1 2022-12-22T22:56:11+01:00 OPNsense.localdomain dpinger 36717 - [meta sequenceId="4"] GATEWAY ALARM: WAN_GWv4_1 (Addr: 37.209.40.1 Alarm: 0 RTT: 411311us RTTd: 298042us Loss: 1%)
Config daemon log:<13>1 2022-12-22T22:55:48+01:00 OPNsense.localdomain configd.py 196 - [meta sequenceId="1"] [de0c153b-b628-488d-9aca-6dbc676535d1] Reloading filter
<13>1 2022-12-22T22:55:48+01:00 OPNsense.localdomain configd.py 196 - [meta sequenceId="2"] [12ff4b1c-04a6-46ac-afda-1c78ac9be651] request pf current overall table record count and table-entries limit
<13>1 2022-12-22T22:56:11+01:00 OPNsense.localdomain configd.py 196 - [meta sequenceId="3"] [286febe5-4a77-4691-96b1-2e4c32f6d2d4] Reloading filter
<13>1 2022-12-22T22:56:12+01:00 OPNsense.localdomain configd.py 196 - [meta sequenceId="4"] [da34e743-0c02-45b8-bfe8-e03091f0cd9d] request pf current overall table record count and table-entries limit
Expected behavior After recovery of primary gateway A, firewall rules need to be set up correctly.
Describe alternatives you considered I set up new installations from scratch:
If I restart the firewall from the OPNsense dashboard or from the command line _configctl filter reload skipalias the rules are correct again.
I assumed a timing problem and modified /usr/local/etc/rc.syshook.d/monitor/10-dpinger with a delay --> no success
/usr/bin/logger -t dpinger "GATEWAY ALARM: ${GATEWAY} (Addr: ${2} Alarm: ${3} RTT: ${4}us RTTd: ${5}us Loss: ${6}%)"
sleep 3
echo -n "Reloading filter:"
/usr/local/bin/flock -n -E 0 -o /tmp/filter_reload_gateway.lock configctl filter reload skip_alias
Screenshots Not relevant
Relevant log files See above
Additional context
Add any other context about the problem here.
Environment
Software version used and hardware type if relevant, e.g.:
OPNsense OPNsense 22.7.10_2-amd64 KVM