system: adjust to dpinger reality in latency/loss handing

iqt4 commented 1 year ago

Important notices

Before you add a new report, we ask you kindly to acknowledge the following:

[x] I have read the contributing guide lines at https://github.com/opnsense/core/blob/master/CONTRIBUTING.md
[x] I am convinced that my issue is new after having checked both open and closed issues at https://github.com/opnsense/core/issues?q=is%3Aissue

Describe the bug I already described the issue in the forum without response Re: Dual WAN Failover stuck. Thus, I seek support here to analyse and solve the problem.

We have a setup consisting of a WAN network (192.168.0.0/24) with two gateways to the internet (provider A 192.168.0.1 and provider B 192.168.0.2) and other internal networks (10.0.x.0/24). OPNsense interfaces are 192.168.0.10 and 10.0.x.1 respectively. DNS services are provided by one of the internal networks (no Unbound) and OPNsense points to these servers as well.

Gateways are marked as upstream with provider A priority 127 and provider provider B priority 255.

I configured multi WAN with failover according the documentation (without DNS, Default Gateway Switching disabled): WAN gateway group with provider A as Tier 1 and provider B as Tier 2. Trigger Level Packet Loss or High Latency.

If gateway A fails, traffic follows over gateway B as expected. However, if gateway A recovers as indicated at the dashboard or System: Gateways: Single, traffic is still going out of gateway B. I verified the firewall rules --> still point to gateway B (not A).

The setup was implemented during 2022 with the respective current OPNsense version and was never working stable in production.

To Reproduce Steps to reproduce the behavior:

Implement dual WAN failover as explained above Traffic over gateway A: pass in quick on vtnet0 route-to (vtnet1 192.168.0.1) inet from any to ! <private> flags S/SA keep state label "f5a781eeb65a44a79c529c6d7ba4cbb6"
Wait for outage of gateway A and recovery Gateway B gets active and remains in that state. pass in quick on vtnet0 route-to (vtnet1 192.168.0.2) inet from any to ! <private> flags S/SA keep state label "f5a781eeb65a44a79c529c6d7ba4cbb6" Gateway log: <12>1 2022-12-22T22:55:47+01:00 OPNsense.localdomain dpinger 46446 - [meta sequenceId="1"] WAN_GWv4_1 37.209.40.1: Alarm latency 502128us stddev 312011us loss 0% <13>1 2022-12-22T22:55:47+01:00 OPNsense.localdomain dpinger 14062 - [meta sequenceId="2"] GATEWAY ALARM: WAN_GWv4_1 (Addr: 37.209.40.1 Alarm: 1 RTT: 502128us RTTd: 312011us Loss: 0%) <12>1 2022-12-22T22:56:11+01:00 OPNsense.localdomain dpinger 46446 - [meta sequenceId="3"] WAN_GWv4_1 37.209.40.1: Clear latency 411311us stddev 298042us loss 1% <13>1 2022-12-22T22:56:11+01:00 OPNsense.localdomain dpinger 36717 - [meta sequenceId="4"] GATEWAY ALARM: WAN_GWv4_1 (Addr: 37.209.40.1 Alarm: 0 RTT: 411311us RTTd: 298042us Loss: 1%) Config daemon log: <13>1 2022-12-22T22:55:48+01:00 OPNsense.localdomain configd.py 196 - [meta sequenceId="1"] [de0c153b-b628-488d-9aca-6dbc676535d1] Reloading filter <13>1 2022-12-22T22:55:48+01:00 OPNsense.localdomain configd.py 196 - [meta sequenceId="2"] [12ff4b1c-04a6-46ac-afda-1c78ac9be651] request pf current overall table record count and table-entries limit <13>1 2022-12-22T22:56:11+01:00 OPNsense.localdomain configd.py 196 - [meta sequenceId="3"] [286febe5-4a77-4691-96b1-2e4c32f6d2d4] Reloading filter <13>1 2022-12-22T22:56:12+01:00 OPNsense.localdomain configd.py 196 - [meta sequenceId="4"] [da34e743-0c02-45b8-bfe8-e03091f0cd9d] request pf current overall table record count and table-entries limit

Expected behavior After recovery of primary gateway A, firewall rules need to be set up correctly.

Describe alternatives you considered I set up new installations from scratch:

GNC environment to test the design --> same issue observed
A test environment within the productive network --> same issue observed

If I restart the firewall from the OPNsense dashboard or from the command line _configctl filter reload skipalias the rules are correct again.

I assumed a timing problem and modified /usr/local/etc/rc.syshook.d/monitor/10-dpinger with a delay --> no success /usr/bin/logger -t dpinger "GATEWAY ALARM: ${GATEWAY} (Addr: ${2} Alarm: ${3} RTT: ${4}us RTTd: ${5}us Loss: ${6}%)" sleep 3 echo -n "Reloading filter:" /usr/local/bin/flock -n -E 0 -o /tmp/filter_reload_gateway.lock configctl filter reload skip_alias

Screenshots Not relevant

Relevant log files See above

Additional context

Add any other context about the problem here.

Environment

Software version used and hardware type if relevant, e.g.:

OPNsense OPNsense 22.7.10_2-amd64 KVM

iqt4 commented 1 year ago

I have done some further investigations, verified the the documentation carefully Trigger Level Explained and can conclude that the implementation does not follow the documentation: Trigger Level Explained Member Down: Triggers when the gateway has 100% packet loss. Packet Loss: Triggers when the packet loss to a gateway is higher than the defined threshold. High Latency: Triggers when the latency to a gateway higher than its defined threshold. Packet Loss or High Latency: Triggers for either of the above conditions. Observation

Failover Member Down: triggers even if gateway loss is not 100% All other cases, trigger seems to work as expected
Recovery Member Down: firewall rules are correct after recovery All other cases: firewall rules remain in failover state

iqt4 commented 1 year ago

I spent some time, digged into the code and it became clear what's going on.

dpinger triggers an alarm depending on high thresholds set in each single interface (dpinger parameters -D and -L) Manual trigger via command line configctl filter reload skip_alias
Next step is reloading the filter rules: https://github.com/opnsense/core/blob/7a6a1029130504cd310109d492914b9195717c22/src/etc/rc.syshook.d/monitor/10-dpinger https://github.com/opnsense/core/blob/7a6a1029130504cd310109d492914b9195717c22/src/etc/rc.filter_configure https://github.com/opnsense/core/blob/7a6a1029130504cd310109d492914b9195717c22/src/etc/inc/filter.inc#L154

Here we set the gateways https://github.com/opnsense/core/blob/7a6a1029130504cd310109d492914b9195717c22/src/etc/inc/filter.lib.inc#L85
Deep dive into gateway determination:
- https://github.com/opnsense/core/blob/a6b271177932d1a65bf8be03b3320cf264a5284a/src/etc/inc/plugins.inc.d/dpinger.inc#L395 (can be simulated on the command line with pluginctl -r 'return_gateways_status') Getting the current values from dpinger and compare to thresholds set up per gateway. This gives the status per gateway: Status is down (corresponds to offline), loss or delay (corresponds to warning), none (online)
- https://github.com/opnsense/core/blob/272c231c70bc6719bb80db7d6042c1141a6df4cd/src/opnsense/mvc/app/library/OPNsense/Routing/Gateways.php#L459 Here the active gateway per gateway group is determined depending on gateway status and trigger: down, force_down: gateway not up delay: trigger contains 'latency' -> gateway not up loss: trigger contains 'loss' -> gateway not up default: gateway is up

Now we can explain the observed behaviour:

dpinger triggers Alarm depending on high thresholds - independent of group gateway trigger!
- Gateway status As we are above the high thresholds, gateway status is always down
- Group gateway Respective gateway is always down -> $is_up = false;
dpinger triggers Clear if again below high thresholds
- Gateway status A) Gateway may recover fast -> values below low thresholds -> none (online) B) Gateway may recover slow -> values above low thresholds loss or delay (corresponds to warning)
- Group gateway Case A) -> default -> $is_up = true; Case B) a1) Gateway group trigger downlatency -> $is_up = stristr($gw_group->trigger, 'latency') === false; -> results in $is_up = false; a2) Gateway group trigger downloss -> $is_up = stristr($gw_group->trigger, 'loss') === false; -> results in $is_up = false; b) Gateway group trigger down -> default -> $is_up = true;

Test case

In my GNS3 environment I set latency (or packet loss) for Tier 1 gateway between low and high threshold.
Reload the filter configctl filter reload skip_alias
Result: Tier 1 gateway is considered down and routing goes over Tier 2 gateway

Conclusion There is obviously an incongruence between documentation, dpinger Alarm/Clear and public function getGroups.

buzzkillingtonne commented 1 year ago

I've also noticed this behavior, and it looks to have been going on for a while: https://forum.opnsense.org/index.php?topic=31147.0;prev_next=prev#new

AdSchellevis commented 1 year ago

@iqt4 while reading briefly through the writeup, am I right in assuming you miss an event when the gateway comes back after a failure? (not sure if dpinger does support that or if we miss anything, but it helps classifying the ticket). So rerunning a filter reload after the first gateway restores brings you the situation you're looking for in that case.

iqt4 commented 1 year ago

@AdSchellevis I'm not really missing an event. dpinger has only one trigger level but in your gateway definition there are two thresholds defined. Problem is in Case B), thus, re-running filter reload won't help (see a1), a2))

Edit: One more thought. I assume there is a principle design flaw as the trigger levels (dpinger) are defined per gateway but the decision is independently defined per gateway group. Example: dpinger fires Alarm e.g. due to high latency but the gateway group is set to trigger level Packet loss only. Result: respective gateway stays up. Now we get an increased packet loss but as dpinger is still in alarm state you won't get another Alarm trigger and the gateway group state is not consistent.

Proposal: Keep it simple! If dpinger fires Alarm respective gateway is down and on Clear gateway is up again. In the gateway group just define failover or load balancing mode.

AdSchellevis commented 1 year ago

I'm having difficulties figuring out which change you're looking for, if you could open a PR for discussion (either on core or docs) that might be helpful.

iqt4 commented 1 year ago

I'm having difficulties figuring out which change you're looking for, if you could open a PR for discussion (either on core or docs) that might be helpful.

Please see my previous edit. Do you want to me copy this into a PR?

AdSchellevis commented 1 year ago

I agree the concept is likely flawed by design, we didn't change it over the years as far as I know. If you have a proposal for a change, I'm certainly not against a PR, it makes discussion easier. Simplifying the groups might also help refactoring them later to MVC (and api enable them).

iqt4 commented 1 year ago

@AdSchellevis: Opening a PR requires having a new branch, which I'm not ably do provide in that case. There are principle questions that have to be clarified, e.g., do you want to keep all four trigger cases or do you want to simplify the scenarios? Proposal: Simplify it - go for the unique case Member Down (means offline in the dashboard), and use the dpinger Alarm/Clear signals. Details are defined on gateway level (dpinger parameters for loss and latency). Btw. why is it called Multi WAN - shouldn't it be Multi Gateway!?

I'm willing to support but I'm not in the position to decide and refactoring to MVC goes beyond my knowledge.

Best Dirk

AdSchellevis commented 1 year ago

We might need to postpone in that case, with a proposal we can discuss how/if that effects others, without it we probably have to wait until we plan to refactor to mvc ourself and work through the scenario's then. I agree there's probably an improvement possible, our time is just limited and we do have to choose our priorities.

fichtner commented 1 year ago

I'm taking a brief look here and the documentation should say:

Member Down: Triggers when the gateway is higher than the threshold for high packet loss or latency. Packet Loss: Triggers when the packet loss to a gateway is higher than the respective low threshold. Packet Latency: Triggers when the latency to a gateway higher than the respective low threshold. Packet Loss or [...] Latency: Triggers for either of the above conditions.

From that perspective the code makes sense from both sides (dpinger and getGroup) so I'm unsure what is missing / not working correctly.

Cheers, Franco

iqt4 commented 1 year ago

Thanks Franco for joining the discussion and clarifying the documentation. Highly appreciated.

In the current design, the dpinger alert functionality is based on high thresholds. I.e. if you set the gateway group to Packet Loss/Latency which are based on low thresholds, dpinger will not alert - only if the gateway exceeds the high thresholds as well. The other case when the gateway recovers (goes below high) the result depends (see https://github.com/opnsense/core/issues/6231#issuecomment-1374828144).

I can easily reproduce the behaviour in GNS3.

With your explanation, the proposal to circumvent the issue would be not to use the dpinger alert command -C /usr/local/etc/rc.syshook monitor, instead use e.g. cron to regularly verify the gateway status and update the firewall rules if thresholds were passed (not sure whether configctl filter reload skip_alias would be an overkill or needs some tweaking).

Cheers Dirk

fichtner commented 1 year ago

Hi Dirk,

I see what you mean now, latencylow and losslow are not passed to dpinger, likely not even supported. Must have been from the apinger days. Now I'm unsure what "down" for dpinger even means. The easiest way to describe it would be member down and packet loss/latency cases being the same? Or member down meaning both loss and latency must trigger?

Cheers, Franco

fichtner commented 1 year ago

PS: I favour member down being loss && delay so we still have 4 separate cases. I do think, however, that the whole latency/loss thing is a bit pointless the way dpinger implements it.

iqt4 commented 1 year ago

Dpinger 3.2:

alarm_flag is set to 1 if either latency or loss is in alarm state

My initial proposal is already described here: https://github.com/opnsense/core/issues/6231#issuecomment-1435913675 In consequence

Only high packet loss/latency parameters count for gateway decision.
There is no trigger choice for gateway groups.
The low thresholds are just "cosmetics" in the dashboard.

If you try to handle loss and delay separately with own coding, you will be in trouble again. Example:

In the gateway group Packet Loss is defined as trigger.
Dpinger sends Alarm due to latency.
OPNsense has to consider the gateway as up.
Gateway starts to loose packets above loss threshold.
Problem: No second Alarm from dpinger!

fichtner commented 1 year ago

It’s starting to look like dpinger was never a replacement for apinger then. I’m unsure if changing all of this makes sense for a minor iteration and also low thresholds as GUI gimmicks only seems counter-intuitive. In that case they should be removed and the terminology simplified.

Cheers, Franco

iqt4 commented 1 year ago

Don't know about apinger, but let's consider the topic from requirement/use-case perspective which is at the end high availability on network layer with a) failover, b) load balancing. A gateway status is up or down - I would keep it simple (other implementations e.g. RouterOS behave similar). The question now is, what means up/down? For these use cases it depends on dpinger results and the user can set loss/latency parameter (may depend on the line - DSL, fibre, cable, provider ...).

Another aspect regarding existing connections when gateway status changes. I have seen many discussions in the forum. I assume a connection stays on its initial route/gateway until it dies/times out. There is no enforcement yet to switch the gateway - correct? This would be a new feature.

With respect to an intuitive solution, I feel this is implemented more from a technical rather than the user perspective: The gateways (groups) are defined in System: Gateway: Single (Group). Gateway switching option is in System: Settings: General. Multi-WAN/Gateway Monitoring options are in Firewall: Settings: Advanced. May be the gateway specific options can be consolidated under System: Gateway: Options

Just my thoughts Dirk

AdSchellevis commented 1 year ago

Let's try to keep this simple for now, if latencylow and losslow aren't used, remove those and simplify as Franco suggested. A larger overhaul of this section isn't planned on our end at the moment and likely doesn't have priority either. The previous program (apinger ) did require different parameters, which lost their usefulness back in 2018 when we moved to dpinger it seems.

https://github.com/opnsense/core/blob/8cf31215f55d14c47d5a44f836ef0a1a2ea9333c/src/etc/inc/gwlb.inc#L111-L122

The upstream documentation for dpinger (https://github.com/dennypage/dpinger) notes loss and latency triggers, which relate to the high numbers on our end.

As dpinger doesn't report anything else then:

<Average Latency in μs> <Standard Deviation in μs> <Percentage of Loss>

According to their documentation, down doesn't really exist in this case, but might be best described as high loss and high latency (in which case that option can safely be removed from the gateway).

When the trigger levels match the actual ones dpinger supports, logically the groups themselves will be configured properly as well when choosing only these (high latency, high loss, down (both) ).

Best regards,

Ad

iqt4 commented 1 year ago

@AdSchellevis: [Let's try to keep this simple ...] --> agreed

Most likely the following - as already implemented - satisfies the requirements.

Member Down: Triggers when the gateway is higher than the threshold for high packet loss or latency.

All other Packet Loss/High Latency possibilities are not needed. In consequence, Trigger Level is no longer an option.

I'm not in the position to take decisions but I would conclude the following:

Consider Member Down equal to dpinger alarm state (either latency or loss)
Option Trigger Level to be removed from System: Gateways: Group (including code clean-up)
Migration of Packet Loss/High Latency configurations
Adapt documentation accordingly

Cheers Dirk

fichtner commented 1 year ago

Funny part about all of this is that the code we have is not wrong and works on the metrics given by dpinger, just that its alarm policy is unworkable. We might be better off disabling the -C option and write our own small handler instead to trigger /usr/local/etc/rc.syshook monitor.

Sounds good?!

Cheers, Franco

fichtner commented 1 year ago

@iqt4 maybe you can take a look at c5b8dc4 and share your thoughts?

Thanks, Franco

iqt4 commented 1 year ago

@fichtner sorry Franco I have been ooo for a while. As far as I understand, instead of using the dpinger trigger (which does not cover all the trigger cases of a gateway group) you have written the standalone function gateway_alarm.php. I have reviewed this function in https://github.com/opnsense/core/commit/c5b8dc483638253cf24f440cde54e939c4b13482 and made a few remarks in the code - hope you can read it.

Overall, I assume this design does not thoroughly solve the issue. gateway_alarm.php is run per gateway not per gateway group. The trigger to switch a gateway is per design a feature of the gateway group.

May be running gateway_alarm.php just once is sufficient containing following steps:

Initialize (read config for gw groups, default gw switching...)
Endless loop start
Read gateway status
Analyse default gw switching (alarm yes|no)
Loop over gw groups and analyse transition (alarm yes|no)
If alarm == yes, call /usr/local/etc/rc.syshook monitor
Wait minimum of all the dpinger intervals (which can be different per gateway)
Endless loop end

If relevant config (default gw switching or gw groups) changes, restart gateway_alarm.php.

My thoughts as requested - Cheers Dirk

fichtner commented 1 year ago

Thanks for your valuable feedback! For now I would hope to avoid the moving goalpost too much and seek a quick replacement for the current limitation you originally reported (which still is the main reason to change something here). The POC was ok, but handling each daemon separately isn’t very elegant so we decided to merge the polling into a single service. The only time this service is not required is when not gateways are being monitored through dpinger. Oterhwise it will (regardless of the used settings) trigger the monitor facility as before including now missing triggers.

However, the monitoring facility is already too late to react operationally (losing too much context) so we are going move the code triggers into the php alarm script as well. This would avoid spurious alarm-based reloads and keep the monitor syshook for everyone else using it.

first goal is to do the right thing for gateway switching and getting it into master, then work on the gateway groups part further.

but I’d like to avoid multiple threads and wait intervals per monitor as it would increase complexity and likely introduce new bugs that may be unnecessary for the overall functionality during operation.

Cheers, Franco

fichtner commented 1 year ago

@iqt4 I've pushed the work to master branch and it will be featured in the development version of 23.1.8 (not the community one). Essentially it's one additional service being spawned watching over all the dpinger reports and triggering the alarm script conditionally while dpinger no longer gets the opportunity to report an alarm itself. If you can look at it in the next weeks that would be helpful. It'll likely go into 23.7 with a bit more tweaking until then.

Cheers, Franco

iqt4 commented 1 year ago

@fichtner Sorry for late response. How can I get the code? root@OPNsense:~ # opnsense-patch ...

fichtner commented 1 year ago

Switch to development release type in firmware GUI, save and check for updates and install.

iqt4 commented 1 year ago

Thanks for the explanation. I have tested the functionality on today's development version (OPNsense 23.7.b_107-amd64) for all four trigger levels (a Gateway group with two members: Tier 1 and Tier2). Didn't find any peculiarities! One observation you may consider regarding logging. As dpinger is no longer the driver for gateway switching, it would be helpful if the new script would log some information.

fichtner commented 1 year ago

Sounds promising, thanks!

It logs both “ALERT” and “MONITOR” messages to gateway log. The first is the same as before on triggering the action to reload routes/filter and the second is the same for all gateway transitions that don’t trigger an action.

opnsense / core

system: adjust to dpinger reality in latency/loss handing #6231