opnsense / core

OPNsense GUI, API and systems backend
https://opnsense.org/
BSD 2-Clause "Simplified" License
3.22k stars 718 forks source link

Gateway monitoring says link is down whereas link actually works (probably not OPNsense related) #7635

Open deajan opened 1 month ago

deajan commented 1 month ago

Important notices

Before you add a new report, we ask you kindly to acknowledge the following:

The following bug has been noticed and experienced from OPNSense 24.1.6 to 24.1.10_3 (current).

I have a multi-WAN setup with two links. As of today, I tend to use 1.1.1.1, 1.1.1.3, 4.4.4.4, 8.8.8.8 and 9.9.9.9 as monitor IP for gateway monitoring. So far so good, my seconary WAN link seems down according to gateway monitoring:

image

Still, I can use the link, and can also ping via OPNSense interface/diagnostic page: image

I have checked that the monitor IP 1.1.1.1 is not bound to any interface in DNS page: image Nor bound to in interface in the routing table: image

Updating OPNSense and rebooting doesn't help. Gateway monitor settings are pretty basic: image

My ISP blocks ICMP pings to the gateway, hence the reason I use 1.1.1.1 as IP for gateway monitoring.

Is there any reason that the gateway is marked offline, whereas it can ping 1.1.1.1 ? Perhaps dpinger works like traceroute and doesn't like the fact that the first hop isn't pingable ?

[EDIT] I can confirm that the gateway that is marked offline works, since a test from VM behind the OPNsense shows that the offline gateway is used (I'm using a rule with that explicit gateway):

curl 'http://api.ipify.org?format=json'
{"ip":"172.25.XX.XX"}

To Reproduce

Sorry, there's no reproducer that I can suggest here.

Expected behavior

Since OPNSense can ping 1.1.1.1 through the gateway which uses that IP as monitoring IP, I would expect the gateway to be online intead of offline.

Software version used and hardware type if relevant, e.g.:

OPNsense 24.1.10_3 (amd64) on KVM with virtio NICs

deajan commented 1 month ago

I realized that my above diagnostic is a bit botched.

So I added a rule forcing ICMP to use the offline marked gateway for one of my machines behind OPNSense. From that machine, pinging 1.1.1.1 with that rule still works, so I can confirm that something isn't right with the way dpinger thinks my gateway is offline, when monitor IP is 1.1.1.1.

I've also tried the following from the OPNsense console

Ping works

ping -S 172.25.XX.XX 1.1.1.1

PING 1.1.1.1 (1.1.1.1) from 172.252.236.42: 56 data bytes
64 bytes from 1.1.1.1: icmp_seq=0 ttl=53 time=21.643 ms

Dpinger fails from my second WAN which is marked offline

dpinger -f -B 172.25.X.X -s 1s -l 4s -t 60s -d 1 1.1.1.1

send_interval 500ms  loss_interval 2000ms  time_period 60000ms  report_interval 1000ms  data_len 0  alert_interval 1000ms  latency_alarm 0ms  loss_alarm 0%  alarm_hold 10000ms  dest_addr 1.1.1.1  bind_addr 172.25.XX.XX identifier ""
0 0 0
0 0 100

But dpinger works from my primary WAN

dpinger -f -B 194.87.XX.XX  -s 1s -l 4s -t 60s -d 1.1.1.1
send_interval 1000ms  loss_interval 4000ms  time_period 60000ms  report_interval 1000ms  data_len 1  alert_interval 1000ms  latency_alarm 0ms  loss_alarm 0%  alarm_hold 10000ms  dest_addr 8.8.8.8  bind_addr 194.87.XX.XX  identifier ""
26746 0 0
26672 179 0
26722 92 0

Investigating further, it seems that when dpinger's payload is set to 1, it fails. If I set the payload to 4 or higher, it works. I could confirm this by using ping:

The following works:

ping -S 172.25.XX.XX -s 4 1.1.1.1

The following doesn't;

ping -S 172.25.XX.XX -s 1 1.1.1.1

Playing with the packet size, I realized that every packet size between 4 and 172 bytes works, and others don't. This only happens on the secondary link, the primary link accepts usual packet sizes.

So end of diagnostic: It seems that there's something fishy with the link itself, and not OPNSense. Sorry for the noise.

I have yet to find what the problem is at my datacenter. If anyone has a clue, I will gladly take it ;)

dwkirw commented 1 month ago

Please let me know if you found anything - I'm having the exact same problem but only since 24.7 came along.

deajan commented 1 month ago

@dwkirw I've pushed the diagnostics. Whenever I use values for ping -s n where 4 <= n <= 172, I get to ping. My datacenter guys told me that they have no filter whatsoever, but on the second link they provide I don't have that problem, so to be honest I doubt that this is an OPNSense problem. Btw, I'm running OPNsense 24.1.10_3-amd64 on this unit, so I had this problem prior to 24.7.

What exactly is your problem ? Did you try adding a payload to the gateway monitor ? Does it work ?

dwkirw commented 1 month ago

Thank you - I've had a better read of what you wrote and tried some of the stuff. Also, disregard the only on 24.7 thing - I've only had a backup WAN about 6 weeks and likely havent noticed this I have not played around with dpinger before. From the primary wan (which is currently marked as down) i get stuff such as this.
dpinger -f -B 139.5.x.x -s 1s -l 4s -t 60s -d 1 1.1.1.1 7650 0 0 7288 371 0 7370 316 0 7289 324 0

From backup (currently showing Online) i get this dpinger -f -B 192.168.1.2 -s 1s -l 4s -t 60s -d 1 1.1.1.1 36686 0 0 33661 3030 0 35521 3611 0 34479 3608 0

My primary is PPPoE, its IP changes fairly often. Heres its settings. The IP has been left blank for this reason. image. While the primary wan is appearing to be down but still working if i go to System>Configuration>Gateways and hit the Apply button it goes to Online again. While its marked Offline my port forwards are flakey though sometimes working/not...

If you can see any dumb things I've missed or should try I'd appreciate knowing about them, ta!