opnsense / core

OPNsense GUI, API and systems backend
https://opnsense.org/
BSD 2-Clause "Simplified" License
3.36k stars 754 forks source link

Gateway monitor route not re-added in dual WAN setup #5956

Closed ashceryth closed 2 years ago

ashceryth commented 2 years ago

Important notices

Before you add a new report, we ask you kindly to acknowledge the following:

Describe the bug

I'm running an OPNsense box in a dual WAN failover setup with gateway switching enabled. Since updating to 22.7 there seem to be issues with the re-adding of static routes for gateway monitoring IPs.

I have two WAN gateways (setup via DHCP) configured, monitoring 8.8.8.8 on primary WAN and 8.8.4.4 on secondary WAN. As soon as the monitoring IPs are configured, host routes are added to the routing table using the correspondig gateways. However, if I unplug the ethernet cable for WAN1, the route for 8.8.8.8 via WAN1 gateway gets removed. Reconnecting the ethernet cable does not recreate the route and the WAN1 gateway stays "Offline" and does not come up again. Therefore no fallback occurs. Looking for a workaround I manually created two static routes: 8.8.8.8 via WAN1 gateway and 8.8.4.4 via WAN2 gateway. Now WAN1 is marked "Online" again as soon as the WAN2 interface is up again and the packetloss is below the threshold.

To Reproduce

Steps to reproduce the behavior:

  1. Configure dual WAN with gateway switching
  2. Set monitoring IPs to 8.8.8.8 for WAN1 and 8.8.4.4 for WAN2
  3. Disconnect WAN1
  4. Wait a few moments
  5. Reconnect WAN2
  6. Check gateways page and see that WAN1 gateway stays offline

Expected behavior

WAN1 gateway should come up after the WAN1 link is up again.

Describe alternatives you considered

Creating static routes for the monitoring IPs.

Screenshots

If applicable, add screenshots to help explain your problem.

Relevant log files

Please let me know if you need any.

Additional context

Add any other context about the problem here.

Environment

Software version used and hardware type if relevant, e.g.:

OPNsense 22.7.1-amd64 FreeBSD 13.1-RELEASE OpenSSL 1.1.1q 5 Jul 2022

fichtner commented 2 years ago

Thanks for the report. I’ll propose a patch on Monday… I have an idea what we could change.

fichtner commented 2 years ago

@ashceryth Just to be sure: are both WANs DHCP? You indicated as much. What I don't understand is why this won't trigger a gateway renewal through rc.newwanip? Maybe it's because the resulting IP is the same... something we've been working on for #5933

If you are interested I could try to prepare a functional patch for you, currently commits are bit clustered.

Cheers, Franco

ashceryth commented 2 years ago

Yes, both are DHCP. WAN1 has a static lease from my ISP, so IP never changes. WAN2 gets its lease from a LTE router in front of it (via passthrough), so the IP stays the same as long as the LTE router stays connected (or gets a new IP assigned by the operator). My tests were just performed by unplugging WAN1, waiting a few moments and plugging in the cable again.

Sure, I'm willing to test as soon as you offer a patch. Please take your time, my workaround with static routes for the monitoring IPs works fine so far.

Thanks!

fichtner commented 2 years ago

@ashceryth ok, this was interesting and the essence of the change is e8d42b6

# opnsense-patch e8d42b6

How does this look?

Cheers, Franco

ashceryth commented 2 years ago

Just applied the patch and tested. It indeed resolved the issue!

Thanks!

fichtner commented 2 years ago

Thanks, I suppose this changed somewhere in 22.1.x for rc.newwanip handling so I'm retagging as bug. We will discuss inclusion for 22.7.2.

Cheers, Franco

ashceryth commented 2 years ago

@fichtner Just a quick question, after applying the change my hooks in /usr/local/etc/rc.syshook.d/monitor/ don't run as soon as WAN1 is up again as there are no gateway alarms upon restore anymore. I suppose I need to hook them in /usr/local/etc/inc/plugins.inc.d/ to trigger on newwanip events now?

fichtner commented 2 years ago

the patch makes sure 'monitor' reconfiguration is called, but it would also seem that dpinger which is in charge of gateway failure will not signal it anymore because it gets exited I suppose. Meh...

ashceryth commented 2 years ago

Ok, for now I added my hooks in /usr/local/etc/inc/plugins.inc.d/ to trigger on newwanip. Works as a workaround.

ashceryth commented 2 years ago

@fichtner I also noticed that disabling a gateway in the GUI does neither trigger a monitor event nor a newwanip event.

fichtner commented 2 years ago

@ashceryth needed to think about this for a bit... I think that monitor is not here to monitor "link" events and that is more in line now with how it should work. monitoring (alerts) are for connectivity errors beyond the immediate link.

ashceryth commented 2 years ago

Ack, just wanted to know if that's desired behaviour. It would just be nice to be able to trigger actions when one gateway goes down, independent of the reason. So, may it be because of dpinger checks or manual deactivation. But I guess thats another topic and nothing to be discussed here.

Thanks!

fichtner commented 2 years ago

Ok, just as an idea: similar to what carp and monitor does, a link-event rc.syshook facility could be useful. consolidation on an upper level is a bit more difficult and probably messy except pushing the same script into multiple native rc.syshook facilities.

Close then? :)

ashceryth commented 2 years ago

Yes, sounds good to me! Maybe - as we are already talking about it - also a rc.syshook facility for newwanip would be great :)

Thanks!

ashceryth commented 2 years ago

@fichtner Since it didn't make its way into 22.7.2, are you planning to include e8d42b6 in one of the next releases?

fichtner commented 2 years ago

You mean 22.7.2? Also see https://www.reddit.com/r/opnsense/comments/wqq1cd/comment/iknw6lm/?utm_source=reddit&utm_medium=web2x&context=3

fichtner commented 2 years ago

So far it has been behaving locally. Maybe you can report back next week if this is all working as you expect and we can certainly meet inclusion for 22.7.3, ok?

ashceryth commented 2 years ago

Sorry, typo, 22.7.2 of course.

Sounds good to me! I reapplied the patch after the update. So far everything has been behaving as expected. Will report back next week :)

Thanks!

tong2x commented 2 years ago

monitoring this issue too, have applied the patch I can confirm patch works as expected.

when reconnecting my 5G/4G router from telco, the gateway monitor will change status to "online" correctly once "internet" connection is restored(roughly within 20seconds). will retest at different time and condition also, but seems to good! becuase before patch, once the gateway is set to offline, it is permanently/indefinitely offline even when connection is restored.

RedVortex commented 2 years ago

I have this issue as well. It happens by itself, every time, a little less than 2 hours after bootup of opnsense. It also happens if I pull the network cable and replug it to flap the interface.

I tested the patch and it unfortunately doesn't fix it for me, I'm running 22.7.2

Sometimes after a flap, the gateway goes DOWN right away. But sometimes it takes 1 dhcp renewal between the flap and the gateway DOWN, I'm not exactly sure why (the logs below shows it happening 1 DHCP renewal after the flap).

Since this is Starlink, the DHCP renewal occurs very often (150 seconds since the lease time is 300 seconds) and yes they also push an option that dhclient doesn't support so it's annoying in the logs but it's just informal....

Here's the syslog (I cleaned it up a bit and spaced the times for readability but did not remove any lines)

13:56:56 dhclient 54780 - unknown dhcp option value 0x52
13:56:56 dhclient 24376 - Creating resolv.conf
13:59:26 dhclient 54780 - unknown dhcp option value 0x52
13:59:26 dhclient 40200 - Creating resolv.conf
14:01:46 opnsense 3836 - /usr/local/etc/rc.linkup: DEVD: Ethernet detached event for dynamic opt10(igb0)
14:01:46 dhclient 54780 - connection closed
14:01:46 dhclient 54780 - exiting.
14:01:46 opnsense 3836 - /usr/local/etc/rc.linkup: Clearing states for stale opt10 route on igb0
14:01:46 opnsense 11432 - /usr/local/etc/rc.filter_configure: ROUTING: keeping current default gateway '10.11.18.17'
14:01:46 opnsense 11432 - /usr/local/etc/rc.filter_configure: ROUTING: keeping current default gateway 'WAN3.HEIPV6GIF.123.456::1'
14:01:50 opnsense 69776 - /usr/local/etc/rc.linkup: DEVD: Ethernet attached event for dynamic opt10(igb0)
14:01:50 dhclient 73402 - unknown dhcp option value 0x52
14:01:50 dhclient 77953 - New IP Address (igb0): WAN2.IP.789.012
14:01:50 dhclient 79218 - New Subnet Mask (igb0): 255.192.0.0
14:01:50 dhclient 81163 - New Broadcast Address (igb0): 100.127.255.255
14:01:50 dhclient 83036 - New Routers (igb0): 100.64.0.1
14:01:50 dhclient 83855 - New Classless Static Routes (igb0):  192.168.100.1/32 0.0.0.0 34.120.255.244/32 0.0.0.0 default 100.64.0.1
14:01:50 dhclient 86789 - Creating resolv.conf
14:01:50 opnsense 69776 - /usr/local/etc/rc.linkup: ROUTING: entering configure using 'opt10'
14:01:50 opnsense 89779 - /usr/local/etc/rc.newwanip: IPv4 renewal is starting on 'igb0'
14:01:50 opnsense 89779 - /usr/local/etc/rc.newwanip: On (IP address: WAN2.IP.789.012) (interface: Starlink[opt10]) (real interface: igb0).
14:01:50 opnsense 69776 - /usr/local/etc/rc.linkup: ROUTING: IPv4 default gateway set to wan
14:01:50 opnsense 69776 - /usr/local/etc/rc.linkup: ROUTING: skipping IPv4 default route
14:01:50 opnsense 69776 - /usr/local/etc/rc.linkup: ROUTING: IPv6 default gateway set to opt9
14:01:50 opnsense 69776 - /usr/local/etc/rc.linkup: ROUTING: skipping IPv6 default route
14:01:50 opnsense 69776 - plugins_configure ipsec (,opt10)
14:01:50 opnsense 69776 - plugins_configure ipsec (execute task : ipsec_configure_do(,opt10))
14:01:50 opnsense 69776 - plugins_configure dhcp ()
14:01:50 opnsense 69776 - plugins_configure dhcp (execute task : dhcpd_dhcp_configure())
14:01:50 opnsense 89779 - /usr/local/etc/rc.newwanip: ROUTING: entering configure using 'opt10'
14:01:51 opnsense 69776 - plugins_configure dns ()
14:01:51 opnsense 69776 - plugins_configure dns (execute task : dnsmasq_configure_do())
14:01:51 opnsense 69776 - plugins_configure dns (execute task : unbound_configure_do())
14:01:51 opnsense 89779 - /usr/local/etc/rc.newwanip: ROUTING: IPv4 default gateway set to wan
14:01:51 opnsense 89779 - /usr/local/etc/rc.newwanip: ROUTING: skipping IPv4 default route
14:01:51 opnsense 89779 - /usr/local/etc/rc.newwanip: ROUTING: IPv6 default gateway set to opt9
14:01:51 opnsense 89779 - /usr/local/etc/rc.newwanip: ROUTING: skipping IPv6 default route
14:01:51 opnsense 89779 - plugins_configure monitor ()
14:01:51 opnsense 89779 - plugins_configure monitor (execute task : dpinger_configure_do())
14:01:51 opnsense 89779 - /usr/local/etc/rc.newwanip: Chose to bind BELL35_PPPOE on WAN1.IP.123.456 since we could not find a proper match.
14:01:51 opnsense 89779 - /usr/local/etc/rc.newwanip: Removing static route for monitor 8.8.4.4 via 10.11.18.17
14:01:51 opnsense 89779 - /usr/local/etc/rc.newwanip: Adding static route for monitor 8.8.4.4 via 10.11.18.17
14:01:51 opnsense 89779 - /usr/local/etc/rc.newwanip: Removing static route for monitor 8.8.8.8 via 100.64.0.1
14:01:51 opnsense 89779 - /usr/local/etc/rc.newwanip: Adding static route for monitor 8.8.8.8 via 100.64.0.1
14:01:51 opnsense 89779 - /usr/local/etc/rc.newwanip: The BELL36_DHCP monitor address is empty, skipping.
14:01:51 opnsense 89779 - /usr/local/etc/rc.newwanip: ROUTING: keeping current default gateway '10.11.18.17'
14:01:51 opnsense 89779 - /usr/local/etc/rc.newwanip: ROUTING: keeping current default gateway 'WAN3.HEIPV6GIF.123.456::1'
14:01:52 php 89779 - plugins_configure vpn (,opt10)
14:01:52 php 89779 - plugins_configure vpn (execute task : ipsec_configure_do(,opt10))
14:01:52 php 89779 - plugins_configure vpn (execute task : openvpn_configure_do(,opt10))
14:01:52 php 89779 - /usr/local/etc/rc.newwanip: Resyncing OpenVPN instances for interface Starlink.
14:01:52 php 89779 - plugins_configure newwanip (,opt10)
14:01:52 php 89779 - plugins_configure newwanip (execute task : dyndns_configure_do(,opt10))
14:01:52 php 89779 - plugins_configure newwanip (execute task : ntpd_configure_do())
14:01:52 php 89779 - plugins_configure newwanip (execute task : opendns_configure_do())
14:01:52 php 89779 - plugins_configure newwanip (execute task : openssh_configure_do(,opt10))
14:01:52 php 89779 - plugins_configure newwanip (execute task : unbound_configure_do(,opt10))
14:01:52 opnsense 46564 - /usr/local/etc/rc.filter_configure: ROUTING: keeping current default gateway '10.11.18.17'
14:01:52 opnsense 46564 - /usr/local/etc/rc.filter_configure: ROUTING: keeping current default gateway 'WAN3.HEIPV6GIF.123.456::1'
14:01:53 php 89779 - plugins_configure newwanip (execute task : vxlan_configure_do())
14:01:53 php 89779 - plugins_configure newwanip (execute task : webgui_configure_do(,opt10))

14:04:21 dhclient 73402 - unknown dhcp option value 0x52
14:04:21 dhclient 96307 - Creating resolv.conf

14:05:09 opnsense 97772 - /usr/local/etc/rc.filter_configure: Ignore down inet gateways : STARLINK_DHCP
14:05:09 opnsense 97772 - /usr/local/etc/rc.filter_configure: ROUTING: keeping current default gateway '10.11.18.17'
14:05:09 opnsense 97772 - /usr/local/etc/rc.filter_configure: Ignore down inet6 gateways : STARLINK_DHCP
14:05:09 opnsense 97772 - /usr/local/etc/rc.filter_configure: ROUTING: keeping current default gateway 'WAN3.HEIPV6GIF.123.456::1'

14:06:51 dhclient 73402 - unknown dhcp option value 0x52
14:06:51 dhclient 5884 - Creating resolv.conf
14:09:21 dhclient 73402 - unknown dhcp option value 0x52
14:09:21 dhclient 1439 - Creating resolv.conf

And gateway log

14:01:51 dpinger 30932 - exiting on signal 15
14:01:51 dpinger 33709 - exiting on signal 15
14:01:51 dpinger 33271 - exiting on signal 15
14:01:51 dpinger 16764 - send_interval 1000ms  loss_interval 2000ms  time_period 60000ms  report_interval 0ms  data_len 0  alert_interval 1000ms  latency_alarm 500ms  loss_alarm 20%  dest_addr 8.8.4.4  bind_addr WAN1.IP.123.456  identifier "BELL35_PPPOE "
14:01:51 dpinger 18200 - send_interval 1000ms  loss_interval 2000ms  time_period 60000ms  report_interval 0ms  data_len 0  alert_interval 1000ms  latency_alarm 500ms  loss_alarm 75%  dest_addr 8.8.8.8  bind_addr WAN2.IP.789.012  identifier "STARLINK_DHCP "
14:01:51 dpinger 19663 - send_interval 1000ms  loss_interval 2000ms  time_period 60000ms  report_interval 0ms  data_len 0  alert_interval 1000ms  latency_alarm 500ms  loss_alarm 20%  dest_addr WAN3.HEIPV6GIF.123.456::1  bind_addr WAN3.HEIPV6GIF.123.456::2  identifier "HE_IPV6_TUNNELV6 "

14:05:09 dpinger 18200 - STARLINK_DHCP 8.8.8.8: Alarm latency 36630us stddev 6972us loss 77%
14:05:09 dpinger 96162 - GATEWAY ALARM: STARLINK_DHCP (Addr: 8.8.8.8 Alarm: 1 RTT: 36630us RTTd: 6972us Loss: 77%)

The patch is in place properly it seems and I rebooted after applying it.

root@fw:~ # diff /usr/local/etc/inc/interfaces.inc /usr/local/etc/inc/interfaces.inc.orig
888d887
<             @unlink("/tmp/{$realif}_oldip"); /* XXX dhclient cannot signal a release */
3119d3117
<     @unlink("/tmp/{$wanif}_oldip"); /* XXX dhclient cannot signal a release */
tong2x commented 2 years ago

@RedVortex after reconnecting, it may take a few time for it fully go online in may case. it slowly decreases the loss, until it is tagged online.

try going to settings-gateway, edit your starkink gateway bit dont change anything and just save again and see if it goes online. (you also try to change the monitor IP for a new one, and see if that works)

fichtner commented 2 years ago

Despite the patch after more valuable feedback we might have to deal with some inconsistencies in route reconfiguration where sometimes dpinger reconfiguration is not called. One of those spots where it was called correctly got optimized out and now (because it was called twice!) but now we miss that extra call. I'll propose another patch shortly.

fichtner commented 2 years ago

So 20a60050ee improves this from the rc.linkup side allowing interface reconfiguration to grab the relevant dpinger gateways as well and increase effectiveness of the monitor reload by piggy-backing system_routing_configure() always. Caveat: it doesn't apply to 22.7.x at the moment.

Cheers, Franco

ashceryth commented 2 years ago

Thanks for all your efforts!

Just for clarification: What do you mean by "it doesn't apply to 22.7.x at the moment"? Just that it isn't included in any current release or that we shouldn't test it with 22.7.2?

fichtner commented 2 years ago

opnsense-patch fails due to overlapping changes, but here is a backport that works:

# opnsense-patch 3482a15f2d
tong2x commented 2 years ago

installed patch #3482a15f2d

resolves issue in my case. gateway goes online.

Multi WAN setup, set as primary/tier1, 3 ISP, all of which are NAT DHCP IPs to opnsense (192.168.x.x) primary is a 5G/4G router (2 other are fiber internet, deliberately reversed so I can test and know that the 5G router disconnected, my main line use the fiber as their primary connection) that disconnect alot due to nature of signal. forcefully disconnecting unit 5 times and reconnecting, all the time gateway went online, rebooted the router 3 times and everything opnsense detect at the adapter going down at line going online after reboot.

ashceryth commented 2 years ago

Also installed the patch and did some tests. Working fine so far!

RedVortex commented 2 years ago

The problem persists for me but the logs are a bit different now. The behaviour seems different how the interface flap is handled and I do not see any logs about my WAN1 (Bell - PPPoE - GW: 10.11.18.17) anymore it seems, only my WAN2 now (Starlink - DHCP - GW: 100.64.0.1).

There is also one log line that seems incomplete...

plugins_configure monitor (,STARLINK_DHCP)

The end result is the same though, the gateway is marked as DOWN after the interface flap and

Syslog

09:50:08 dhclient 78388 - unknown dhcp option value 0x52
09:50:08 dhclient 20014 - Creating resolv.conf

09:52:38 dhclient 78388 - unknown dhcp option value 0x52
09:52:38 dhclient 68469 - Creating resolv.conf

09:54:55 opnsense 68060 - /usr/local/etc/rc.linkup: DEVD: Ethernet detached event for dynamic opt10(igb0)
09:54:55 dhclient 78388 - connection closed
09:54:55 dhclient 78388 - exiting.
09:54:55 opnsense 68060 - /usr/local/etc/rc.linkup: Clearing states for stale opt10 route on igb0
09:54:55 opnsense 74859 - /usr/local/etc/rc.filter_configure: ROUTING: keeping current default gateway '10.11.18.17'
09:54:55 opnsense 74859 - /usr/local/etc/rc.filter_configure: ROUTING: keeping current default gateway '2001:470:1c:4::1'
09:54:59 opnsense 16262 - /usr/local/etc/rc.linkup: DEVD: Ethernet attached event for dynamic opt10(igb0)
09:54:59 dhclient 18263 - unknown dhcp option value 0x52
09:54:59 dhclient 24564 - New IP Address (igb0): WAN2.IP.123.456
09:54:59 dhclient 27103 - New Subnet Mask (igb0): 255.192.0.0
09:54:59 dhclient 30164 - New Broadcast Address (igb0): 100.127.255.255
09:54:59 dhclient 31558 - New Routers (igb0): 100.64.0.1
09:54:59 dhclient 33346 - New Classless Static Routes (igb0):  192.168.100.1/32 0.0.0.0 34.120.255.244/32 0.0.0.0 default 100.64.0.1
09:54:59 dhclient 37088 - Creating resolv.conf
09:54:59 opnsense 16262 - /usr/local/etc/rc.linkup: ROUTING: entering configure using 'opt10'
09:54:59 opnsense 39337 - /usr/local/etc/rc.newwanip: IPv4 renewal is starting on 'igb0'
09:54:59 opnsense 39337 - /usr/local/etc/rc.newwanip: On (IP address: WAN2.IP.123.456) (interface: Starlink[opt10]) (real interface: igb0).
09:54:59 opnsense 16262 - /usr/local/etc/rc.linkup: ROUTING: IPv4 default gateway set to wan
09:54:59 opnsense 16262 - /usr/local/etc/rc.linkup: ROUTING: skipping IPv4 default route
09:54:59 opnsense 16262 - /usr/local/etc/rc.linkup: ROUTING: IPv6 default gateway set to opt9
09:54:59 opnsense 16262 - /usr/local/etc/rc.linkup: ROUTING: skipping IPv6 default route
09:54:59 opnsense 16262 - plugins_configure monitor (,STARLINK_DHCP)
09:54:59 opnsense 16262 - plugins_configure monitor (execute task : dpinger_configure_do(,STARLINK_DHCP))
09:55:00 opnsense 39337 - /usr/local/etc/rc.newwanip: ROUTING: entering configure using 'opt10'
09:55:00 opnsense 16262 - /usr/local/etc/rc.linkup: Removing static route for monitor 8.8.8.8 via 100.64.0.1
09:55:00 opnsense 16262 - /usr/local/etc/rc.linkup: Adding static route for monitor 8.8.8.8 via 100.64.0.1
09:55:00 opnsense 16262 - plugins_configure ipsec (,opt10)
09:55:00 opnsense 16262 - plugins_configure ipsec (execute task : ipsec_configure_do(,opt10))
09:55:00 opnsense 16262 - plugins_configure dhcp ()
09:55:00 opnsense 16262 - plugins_configure dhcp (execute task : dhcpd_dhcp_configure())
09:55:00 opnsense 39337 - /usr/local/etc/rc.newwanip: ROUTING: IPv4 default gateway set to wan
09:55:00 opnsense 39337 - /usr/local/etc/rc.newwanip: ROUTING: skipping IPv4 default route
09:55:00 opnsense 39337 - /usr/local/etc/rc.newwanip: ROUTING: IPv6 default gateway set to opt9
09:55:00 opnsense 39337 - /usr/local/etc/rc.newwanip: ROUTING: skipping IPv6 default route
09:55:00 opnsense 16262 - plugins_configure dns ()
09:55:00 opnsense 16262 - plugins_configure dns (execute task : dnsmasq_configure_do())
09:55:00 opnsense 16262 - plugins_configure dns (execute task : unbound_configure_do())
09:55:00 opnsense 39337 - plugins_configure monitor (,STARLINK_DHCP)
09:55:00 opnsense 39337 - plugins_configure monitor (execute task : dpinger_configure_do(,STARLINK_DHCP))
09:55:00 opnsense 39337 - /usr/local/etc/rc.newwanip: Removing static route for monitor 8.8.8.8 via 100.64.0.1
09:55:00 opnsense 39337 - /usr/local/etc/rc.newwanip: Adding static route for monitor 8.8.8.8 via 100.64.0.1
09:55:01 opnsense 39337 - /usr/local/etc/rc.newwanip: ROUTING: keeping current default gateway '10.11.18.17'
09:55:01 opnsense 39337 - /usr/local/etc/rc.newwanip: ROUTING: keeping current default gateway '2001:470:1c:4::1'
09:55:01 php 39337 - plugins_configure vpn (,opt10)
09:55:01 php 39337 - plugins_configure vpn (execute task : ipsec_configure_do(,opt10))
09:55:01 php 39337 - plugins_configure vpn (execute task : openvpn_configure_do(,opt10))
09:55:01 php 39337 - /usr/local/etc/rc.newwanip: Resyncing OpenVPN instances for interface Starlink.
09:55:01 php 39337 - plugins_configure newwanip (,opt10)
09:55:01 php 39337 - plugins_configure newwanip (execute task : dyndns_configure_do(,opt10))
09:55:02 php 39337 - plugins_configure newwanip (execute task : ntpd_configure_do())
09:55:02 php 39337 - plugins_configure newwanip (execute task : opendns_configure_do())
09:55:02 php 39337 - plugins_configure newwanip (execute task : openssh_configure_do(,opt10))
09:55:02 php 39337 - plugins_configure newwanip (execute task : unbound_configure_do(,opt10))
09:55:02 opnsense 38010 - /usr/local/etc/rc.filter_configure: ROUTING: keeping current default gateway '10.11.18.17'
09:55:02 opnsense 38010 - /usr/local/etc/rc.filter_configure: ROUTING: keeping current default gateway '2001:470:1c:4::1'
09:55:02 php 39337 - plugins_configure newwanip (execute task : vxlan_configure_do())
09:55:03 php 39337 - plugins_configure newwanip (execute task : webgui_configure_do(,opt10))

09:57:29 dhclient 18263 - unknown dhcp option value 0x52
09:57:29 dhclient 2339 - Creating resolv.conf

09:58:17 opnsense 63815 - /usr/local/etc/rc.filter_configure: Ignore down inet gateways : STARLINK_DHCP
09:58:17 opnsense 63815 - /usr/local/etc/rc.filter_configure: ROUTING: keeping current default gateway '10.11.18.17'
09:58:17 opnsense 63815 - /usr/local/etc/rc.filter_configure: Ignore down inet6 gateways : STARLINK_DHCP
09:58:17 opnsense 63815 - /usr/local/etc/rc.filter_configure: ROUTING: keeping current default gateway '2001:470:1c:4::1'

09:59:59 dhclient 18263 - unknown dhcp option value 0x52
09:59:59 dhclient 39365 - Creating resolv.conf

10:02:29 dhclient 18263 - unknown dhcp option value 0x52
10:02:29 dhclient 45848 - Creating resolv.conf

Gateway

09:55:00 dpinger 20610 - exiting on signal 15
09:55:00 dpinger 58757 - send_interval 1000ms  loss_interval 2000ms  time_period 60000ms  report_interval 0ms  data_len 0  alert_interval 1000ms  latency_alarm 500ms  loss_alarm 75%  dest_addr 8.8.8.8  bind_addr WAN2.IP.123.456  identifier "STARLINK_DHCP "
09:55:00 dpinger 58757 - exiting on signal 15
09:55:00 dpinger 3768 - send_interval 1000ms  loss_interval 2000ms  time_period 60000ms  report_interval 0ms  data_len 0  alert_interval 1000ms  latency_alarm 500ms  loss_alarm 75%  dest_addr 8.8.8.8  bind_addr WAN2.IP.123.456  identifier "STARLINK_DHCP "

09:58:16 dpinger 3768 - STARLINK_DHCP 8.8.8.8: Alarm latency 46503us stddev 9573us loss 76%
09:58:16 dpinger 62516 - GATEWAY ALARM: STARLINK_DHCP (Addr: 8.8.8.8 Alarm: 1 RTT: 46503us RTTd: 9573us Loss: 76%)
RedVortex commented 2 years ago

I also tested to go in gateway, edit STARLINK_DHCP and just click save to trigger everything to get back UP. As before, the gateway gets flagged UP right after this but the gateway got flagged DOWN again a few minutes later, not sure why but the interface seems to have flapped again. I don't see anything in the STARLINK logs in the apps though in regards to this. Eveen though the interface was up right away after the flap and I was able to ping manually Starlink's gateway, it remained down according to dpinger.

Here's the syslog of everything that happened

11:45:01 dhclient 18263 - unknown dhcp option value 0x52
11:45:01 dhclient 13102 - Creating resolv.conf

11:45:53 configctl 38570 - event @ 1660923952.80 msg: Aug 19 11:45:52 config[26781]: [2022-08-19T11:45:52][INFO] config-event: new_config /conf/backup/config-1660923952.7902.xml
11:45:53 configctl 38570 - event @ 1660923952.80 exec: system event config_changed
11:45:55 opnsense 26781 - /system_gateways.php: ROUTING: entering configure using defaults
11:45:55 opnsense 26781 - /system_gateways.php: ROUTING: IPv4 default gateway set to wan
11:45:55 opnsense 26781 - /system_gateways.php: ROUTING: setting IPv4 default route to 10.11.18.17
11:45:55 opnsense 26781 - /system_gateways.php: ROUTING: keeping current default gateway '10.11.18.17'
11:45:55 opnsense 26781 - /system_gateways.php: ROUTING: IPv6 default gateway set to opt9
11:45:55 opnsense 26781 - /system_gateways.php: ROUTING: setting IPv6 default route to 2001:470:1c:4::1
11:45:55 opnsense 26781 - /system_gateways.php: ROUTING: keeping current default gateway '2001:470:1c:4::1'
11:45:55 opnsense 26781 - plugins_configure monitor (,)
11:45:55 opnsense 26781 - plugins_configure monitor (execute task : dpinger_configure_do(,))
11:45:55 opnsense 26781 - /system_gateways.php: Chose to bind BELL35_PPPOE on WAN1.IP.111.111 since we could not find a proper match.
11:45:55 opnsense 26781 - /system_gateways.php: Removing static route for monitor 8.8.4.4 via 10.11.18.17
11:45:55 opnsense 26781 - /system_gateways.php: Adding static route for monitor 8.8.4.4 via 10.11.18.17
11:45:55 opnsense 26781 - /system_gateways.php: Removing static route for monitor 8.8.8.8 via 100.64.0.1
11:45:55 opnsense 26781 - /system_gateways.php: Adding static route for monitor 8.8.8.8 via 100.64.0.1
11:45:55 opnsense 26781 - /system_gateways.php: The BELL36_DHCP monitor address is empty, skipping.
11:45:55 opnsense 78380 - /usr/local/etc/rc.filter_configure: ROUTING: keeping current default gateway '10.11.18.17'
11:45:55 opnsense 78380 - /usr/local/etc/rc.filter_configure: ROUTING: keeping current default gateway '2001:470:1c:4::1'
11:45:56 opnsense 1028 - /usr/local/etc/rc.filter_configure: ROUTING: keeping current default gateway '10.11.18.17'
11:45:56 opnsense 1028 - /usr/local/etc/rc.filter_configure: ROUTING: keeping current default gateway '2001:470:1c:4::1'
11:45:58 opnsense 35259 - /usr/local/etc/rc.dyndns: Dynamic DNS: updating cache file /var/cache/dyndns_wan_mydnshost.mydomain.com_0.cache: WAN1.IP.111.111
11:45:58 opnsense 35259 - /usr/local/etc/rc.dyndns: Dynamic DNS: (Success) IP Address Updated Successfully!

11:46:11 opnsense 76622 - /usr/local/etc/rc.filter_configure: ROUTING: keeping current default gateway '10.11.18.17'
11:46:11 opnsense 76622 - /usr/local/etc/rc.filter_configure: ROUTING: keeping current default gateway '2001:470:1c:4::1'
11:46:13 opnsense 94415 - /usr/local/etc/rc.dyndns: Dynamic DNS: updating cache file /var/cache/dyndns_wan_mydnshost.mydomain.com_0.cache: WAN1.IP.111.111
11:46:13 opnsense 94415 - /usr/local/etc/rc.dyndns: Dynamic DNS: (Success) IP Address Updated Successfully!

11:47:32 dhclient 18263 - unknown dhcp option value 0x52
11:47:32 dhclient 10952 - Creating resolv.conf

11:49:51 opnsense 13978 - /usr/local/etc/rc.linkup: DEVD: Ethernet detached event for dynamic opt10(igb0)
11:49:51 dhclient 18263 - connection closed
11:49:51 dhclient 18263 - exiting.
11:49:52 opnsense 13978 - /usr/local/etc/rc.linkup: Clearing states for stale opt10 route on igb0
11:49:52 opnsense 20849 - /usr/local/etc/rc.filter_configure: ROUTING: keeping current default gateway '10.11.18.17'
11:49:52 opnsense 20849 - /usr/local/etc/rc.filter_configure: ROUTING: keeping current default gateway '2001:470:1c:4::1'

11:49:55 opnsense 7891 - /usr/local/etc/rc.linkup: DEVD: Ethernet attached event for dynamic opt10(igb0)
11:49:56 dhclient 9983 - unknown dhcp option value 0x52
11:49:56 dhclient 15295 - New IP Address (igb0): WAN2.IP.222.222
11:49:56 dhclient 17839 - New Subnet Mask (igb0): 255.192.0.0
11:49:56 dhclient 19968 - New Broadcast Address (igb0): 100.127.255.255
11:49:56 dhclient 22025 - New Routers (igb0): 100.64.0.1
11:49:56 dhclient 23156 - New Classless Static Routes (igb0):  192.168.100.1/32 0.0.0.0 34.120.255.244/32 0.0.0.0 default 100.64.0.1
11:49:56 dhclient 27025 - Creating resolv.conf
11:49:56 opnsense 7891 - /usr/local/etc/rc.linkup: ROUTING: entering configure using 'opt10'
11:49:56 opnsense 30031 - /usr/local/etc/rc.newwanip: IPv4 renewal is starting on 'igb0'
11:49:56 opnsense 30031 - /usr/local/etc/rc.newwanip: On (IP address: WAN2.IP.222.222) (interface: Starlink[opt10]) (real interface: igb0).
11:49:56 opnsense 7891 - /usr/local/etc/rc.linkup: ROUTING: IPv4 default gateway set to wan
11:49:56 opnsense 7891 - /usr/local/etc/rc.linkup: ROUTING: skipping IPv4 default route
11:49:56 opnsense 7891 - /usr/local/etc/rc.linkup: ROUTING: IPv6 default gateway set to opt9
11:49:56 opnsense 7891 - /usr/local/etc/rc.linkup: ROUTING: skipping IPv6 default route
11:49:56 opnsense 7891 - plugins_configure monitor (,STARLINK_DHCP)
11:49:56 opnsense 7891 - plugins_configure monitor (execute task : dpinger_configure_do(,STARLINK_DHCP))
11:49:56 opnsense 7891 - /usr/local/etc/rc.linkup: Removing static route for monitor 8.8.8.8 via 100.64.0.1
11:49:56 opnsense 7891 - /usr/local/etc/rc.linkup: Adding static route for monitor 8.8.8.8 via 100.64.0.1
11:49:56 opnsense 7891 - plugins_configure ipsec (,opt10)
11:49:56 opnsense 7891 - plugins_configure ipsec (execute task : ipsec_configure_do(,opt10))
11:49:56 opnsense 7891 - plugins_configure dhcp ()
11:49:56 opnsense 7891 - plugins_configure dhcp (execute task : dhcpd_dhcp_configure())
11:49:56 opnsense 30031 - /usr/local/etc/rc.newwanip: ROUTING: entering configure using 'opt10'
11:49:56 opnsense 7891 - plugins_configure dns ()
11:49:56 opnsense 7891 - plugins_configure dns (execute task : dnsmasq_configure_do())
11:49:56 opnsense 7891 - plugins_configure dns (execute task : unbound_configure_do())
11:49:56 opnsense 30031 - /usr/local/etc/rc.newwanip: ROUTING: IPv4 default gateway set to wan
11:49:56 opnsense 30031 - /usr/local/etc/rc.newwanip: ROUTING: skipping IPv4 default route
11:49:56 opnsense 30031 - /usr/local/etc/rc.newwanip: ROUTING: IPv6 default gateway set to opt9
11:49:56 opnsense 30031 - /usr/local/etc/rc.newwanip: ROUTING: skipping IPv6 default route
11:49:56 opnsense 30031 - plugins_configure monitor (,STARLINK_DHCP)
11:49:56 opnsense 30031 - plugins_configure monitor (execute task : dpinger_configure_do(,STARLINK_DHCP))
11:49:57 opnsense 30031 - /usr/local/etc/rc.newwanip: Removing static route for monitor 8.8.8.8 via 100.64.0.1
11:49:57 opnsense 30031 - /usr/local/etc/rc.newwanip: Adding static route for monitor 8.8.8.8 via 100.64.0.1
11:49:57 opnsense 30031 - /usr/local/etc/rc.newwanip: ROUTING: keeping current default gateway '10.11.18.17'
11:49:57 opnsense 30031 - /usr/local/etc/rc.newwanip: ROUTING: keeping current default gateway '2001:470:1c:4::1'
11:49:58 php 30031 - plugins_configure vpn (,opt10)
11:49:58 php 30031 - plugins_configure vpn (execute task : ipsec_configure_do(,opt10))
11:49:58 php 30031 - plugins_configure vpn (execute task : openvpn_configure_do(,opt10))
11:49:58 php 30031 - /usr/local/etc/rc.newwanip: Resyncing OpenVPN instances for interface Starlink.
11:49:58 php 30031 - plugins_configure newwanip (,opt10)
11:49:58 php 30031 - plugins_configure newwanip (execute task : dyndns_configure_do(,opt10))
11:49:58 php 30031 - plugins_configure newwanip (execute task : ntpd_configure_do())
11:49:58 opnsense 1247 - /usr/local/etc/rc.filter_configure: ROUTING: keeping current default gateway '10.11.18.17'
11:49:58 opnsense 1247 - /usr/local/etc/rc.filter_configure: ROUTING: keeping current default gateway '2001:470:1c:4::1'
11:49:58 php 30031 - plugins_configure newwanip (execute task : opendns_configure_do())
11:49:58 php 30031 - plugins_configure newwanip (execute task : openssh_configure_do(,opt10))
11:49:58 php 30031 - plugins_configure newwanip (execute task : unbound_configure_do(,opt10))
11:49:59 php 30031 - plugins_configure newwanip (execute task : vxlan_configure_do())
11:49:59 php 30031 - plugins_configure newwanip (execute task : webgui_configure_do(,opt10))

11:52:26 dhclient 9983 - unknown dhcp option value 0x52
11:52:26 dhclient 85181 - Creating resolv.conf

11:53:15 opnsense 8298 - /usr/local/etc/rc.filter_configure: Ignore down inet gateways : STARLINK_DHCP
11:53:15 opnsense 8298 - /usr/local/etc/rc.filter_configure: ROUTING: keeping current default gateway '10.11.18.17'
11:53:15 opnsense 8298 - /usr/local/etc/rc.filter_configure: Ignore down inet6 gateways : STARLINK_DHCP
11:53:15 opnsense 8298 - /usr/local/etc/rc.filter_configure: ROUTING: keeping current default gateway '2001:470:1c:4::1'

11:54:56 dhclient 9983 - unknown dhcp option value 0x52
11:54:56 dhclient 58594 - Creating resolv.conf

gateway logs

11:45:55 dpinger 7026 - exiting on signal 15
11:45:55 dpinger 21149 - exiting on signal 15
11:45:55 dpinger 3768 - exiting on signal 15
11:45:55 dpinger 73839 - send_interval 1000ms  loss_interval 2000ms  time_period 60000ms  report_interval 0ms  data_len 0  alert_interval 1000ms  latency_alarm 500ms  loss_alarm 20%  dest_addr 8.8.4.4  bind_addr WAN1.IP.111.111  identifier "BELL35_PPPOE "
11:45:55 dpinger 76989 - send_interval 1000ms  loss_interval 2000ms  time_period 60000ms  report_interval 0ms  data_len 0  alert_interval 1000ms  latency_alarm 500ms  loss_alarm 75%  dest_addr 8.8.8.8  bind_addr WAN2.IP.222.222  identifier "STARLINK_DHCP "
11:45:55 dpinger 77730 - send_interval 1000ms  loss_interval 2000ms  time_period 60000ms  report_interval 0ms  data_len 0  alert_interval 1000ms  latency_alarm 500ms  loss_alarm 20%  dest_addr 2001:470:1c:4::1  bind_addr WAN3:IPV6ONLYGIF::333  identifier "HE_IPV6_TUNNELV6 "
11:45:56 dpinger 73839 - BELL35_PPPOE 8.8.4.4: Alarm latency 1787us stddev 0us loss 50%
11:45:56 dpinger 95312 - GATEWAY ALARM: BELL35_PPPOE (Addr: 8.8.4.4 Alarm: 1 RTT: 1787us RTTd: 0us Loss: 50%)

11:46:11 dpinger 73839 - BELL35_PPPOE 8.8.4.4: Clear latency 5350us stddev 13207us loss 0%
11:46:11 dpinger 75981 - GATEWAY ALARM: BELL35_PPPOE (Addr: 8.8.4.4 Alarm: 0 RTT: 5350us RTTd: 13207us Loss: 0%)

11:49:56 dpinger 76989 - exiting on signal 15
11:49:56 dpinger 40700 - send_interval 1000ms  loss_interval 2000ms  time_period 60000ms  report_interval 0ms  data_len 0  alert_interval 1000ms  latency_alarm 500ms  loss_alarm 75%  dest_addr 8.8.8.8  bind_addr WAN2.IP.222.222  identifier "STARLINK_DHCP "
11:49:56 dpinger 40700 - exiting on signal 15
11:49:57 dpinger 66254 - send_interval 1000ms  loss_interval 2000ms  time_period 60000ms  report_interval 0ms  data_len 0  alert_interval 1000ms  latency_alarm 500ms  loss_alarm 75%  dest_addr 8.8.8.8  bind_addr WAN2.IP.222.222  identifier "STARLINK_DHCP "

11:53:14 dpinger 66254 - STARLINK_DHCP 8.8.8.8: Alarm latency 42949us stddev 7174us loss 77%
11:53:14 dpinger 6826 - GATEWAY ALARM: STARLINK_DHCP (Addr: 8.8.8.8 Alarm: 1 RTT: 42949us RTTd: 7174us Loss: 77%)
RedVortex commented 2 years ago

I went into gateways, edited and saved/applied again and this time, the gateway remains UP.

Syslog

12:04:56 dhclient 9983 - unknown dhcp option value 0x52
12:04:56 dhclient 52833 - Creating resolv.conf

12:07:16 configctl 38570 - event @ 1660925236.04 msg: Aug 19 12:07:16 myhost.mydomain.com config[22542]: [2022-08-19T12:07:16-04:00][INFO] config-event: new_config /conf/backup/config-1660925236.0311.xml
12:07:16 configctl 38570 - event @ 1660925236.04 exec: system event config_changed
12:07:18 opnsense 59956 - /system_gateways.php: ROUTING: entering configure using defaults
12:07:18 opnsense 59956 - /system_gateways.php: ROUTING: IPv4 default gateway set to wan
12:07:18 opnsense 59956 - /system_gateways.php: ROUTING: setting IPv4 default route to 10.11.18.17
12:07:18 opnsense 59956 - /system_gateways.php: ROUTING: keeping current default gateway '10.11.18.17'
12:07:18 opnsense 59956 - /system_gateways.php: ROUTING: IPv6 default gateway set to opt9
12:07:18 opnsense 59956 - /system_gateways.php: ROUTING: setting IPv6 default route to 2001:470:1c:4::1
12:07:18 opnsense 59956 - /system_gateways.php: ROUTING: keeping current default gateway '2001:470:1c:4::1'
12:07:18 opnsense 59956 - plugins_configure monitor (,)
12:07:18 opnsense 59956 - plugins_configure monitor (execute task : dpinger_configure_do(,))
12:07:18 opnsense 59956 - /system_gateways.php: Chose to bind BELL35_PPPOE on WAN1.IP.111.111 since we could not find a proper match.
12:07:18 opnsense 59956 - /system_gateways.php: Removing static route for monitor 8.8.4.4 via 10.11.18.17
12:07:18 opnsense 59956 - /system_gateways.php: Adding static route for monitor 8.8.4.4 via 10.11.18.17
12:07:18 opnsense 59956 - /system_gateways.php: Removing static route for monitor 8.8.8.8 via 100.64.0.1
12:07:18 opnsense 59956 - /system_gateways.php: Adding static route for monitor 8.8.8.8 via 100.64.0.1
12:07:18 opnsense 59956 - /system_gateways.php: The BELL36_DHCP monitor address is empty, skipping.
12:07:19 opnsense 10897 - /usr/local/etc/rc.filter_configure: ROUTING: keeping current default gateway '10.11.18.17'
12:07:19 opnsense 10897 - /usr/local/etc/rc.filter_configure: ROUTING: keeping current default gateway '2001:470:1c:4::1'

12:07:26 dhclient 9983 - unknown dhcp option value 0x52
12:07:26 dhclient 57477 - Creating resolv.conf

I'll let it go for some while to see if it remains up (I expect it will) and maybe test a connect/disconnect of the cable to see if that replicates the issue again (I also expect the problem to happen again).

RedVortex commented 2 years ago

A few minutes ago, interface flag occurred. Initially the gateway remained up but was marked down after dhcp renewal a few minutes later. It remains down even though it should be up since it pings and the interface is working.

syslog

13:42:28 dhclient 9983 - unknown dhcp option value 0x52
13:42:28 dhclient 820 - Creating resolv.conf

13:44:48 opnsense 48546 - /usr/local/etc/rc.linkup: DEVD: Ethernet detached event for dynamic opt10(igb0)
13:44:48 dhclient 9983 - connection closed
13:44:48 dhclient 9983 - exiting.
13:44:48 opnsense 48546 - /usr/local/etc/rc.linkup: Clearing states for stale opt10 route on igb0
13:44:48 opnsense 53606 - /usr/local/etc/rc.filter_configure: ROUTING: keeping current default gateway '10.11.18.17'
13:44:48 opnsense 53606 - /usr/local/etc/rc.filter_configure: ROUTING: keeping current default gateway '2001:470:1c:4::1'

13:44:52 opnsense 94609 - /usr/local/etc/rc.linkup: DEVD: Ethernet attached event for dynamic opt10(igb0)
13:44:52 dhclient 97705 - unknown dhcp option value 0x52
13:44:52 dhclient 3176 - New IP Address (igb0): WAN2.IP.222.222
13:44:52 dhclient 5639 - New Subnet Mask (igb0): 255.192.0.0
13:44:52 dhclient 7528 - New Broadcast Address (igb0): 100.127.255.255
13:44:52 dhclient 9720 - New Routers (igb0): 100.64.0.1
13:44:52 dhclient 11684 - New Classless Static Routes (igb0):  192.168.100.1/32 0.0.0.0 34.120.255.244/32 0.0.0.0 default 100.64.0.1
13:44:52 dhclient 14951 - Creating resolv.conf
13:44:52 opnsense 94609 - /usr/local/etc/rc.linkup: ROUTING: entering configure using 'opt10'
13:44:52 opnsense 18396 - /usr/local/etc/rc.newwanip: IPv4 renewal is starting on 'igb0'
13:44:52 opnsense 18396 - /usr/local/etc/rc.newwanip: On (IP address: WAN2.IP.222.222 ) (interface: Starlink[opt10]) (real interface: igb0).
13:44:52 opnsense 94609 - /usr/local/etc/rc.linkup: ROUTING: IPv4 default gateway set to wan
13:44:52 opnsense 94609 - /usr/local/etc/rc.linkup: ROUTING: skipping IPv4 default route
13:44:52 opnsense 94609 - /usr/local/etc/rc.linkup: ROUTING: IPv6 default gateway set to opt9
13:44:52 opnsense 94609 - /usr/local/etc/rc.linkup: ROUTING: skipping IPv6 default route
13:44:52 opnsense 94609 - plugins_configure monitor (,STARLINK_DHCP)
13:44:52 opnsense 94609 - plugins_configure monitor (execute task : dpinger_configure_do(,STARLINK_DHCP))
13:44:52 opnsense 18396 - /usr/local/etc/rc.newwanip: ROUTING: entering configure using 'opt10'
13:44:53 opnsense 94609 - /usr/local/etc/rc.linkup: Removing static route for monitor 8.8.8.8 via 100.64.0.1
13:44:53 opnsense 94609 - /usr/local/etc/rc.linkup: Adding static route for monitor 8.8.8.8 via 100.64.0.1
13:44:53 opnsense 94609 - plugins_configure ipsec (,opt10)
13:44:53 opnsense 94609 - plugins_configure ipsec (execute task : ipsec_configure_do(,opt10))
13:44:53 opnsense 94609 - plugins_configure dhcp ()
13:44:53 opnsense 94609 - plugins_configure dhcp (execute task : dhcpd_dhcp_configure())
13:44:53 opnsense 18396 - /usr/local/etc/rc.newwanip: ROUTING: IPv4 default gateway set to wan
13:44:53 opnsense 18396 - /usr/local/etc/rc.newwanip: ROUTING: skipping IPv4 default route
13:44:53 opnsense 18396 - /usr/local/etc/rc.newwanip: ROUTING: IPv6 default gateway set to opt9
13:44:53 opnsense 18396 - /usr/local/etc/rc.newwanip: ROUTING: skipping IPv6 default route
13:44:53 opnsense 94609 - plugins_configure dns ()
13:44:53 opnsense 94609 - plugins_configure dns (execute task : dnsmasq_configure_do())
13:44:53 opnsense 94609 - plugins_configure dns (execute task : unbound_configure_do())
13:44:53 opnsense 18396 - plugins_configure monitor (,STARLINK_DHCP)
13:44:53 opnsense 18396 - plugins_configure monitor (execute task : dpinger_configure_do(,STARLINK_DHCP))
13:44:53 opnsense 18396 - /usr/local/etc/rc.newwanip: Removing static route for monitor 8.8.8.8 via 100.64.0.1
13:44:53 opnsense 18396 - /usr/local/etc/rc.newwanip: Adding static route for monitor 8.8.8.8 via 100.64.0.1
13:44:53 opnsense 18396 - /usr/local/etc/rc.newwanip: ROUTING: keeping current default gateway '10.11.18.17'
13:44:53 opnsense 18396 - /usr/local/etc/rc.newwanip: ROUTING: keeping current default gateway '2001:470:1c:4::1'
13:44:54 php 18396 - plugins_configure vpn (,opt10)
13:44:54 php 18396 - plugins_configure vpn (execute task : ipsec_configure_do(,opt10))
13:44:54 php 18396 - plugins_configure vpn (execute task : openvpn_configure_do(,opt10))
13:44:54 php 18396 - /usr/local/etc/rc.newwanip: Resyncing OpenVPN instances for interface Starlink.
13:44:54 php 18396 - plugins_configure newwanip (,opt10)
13:44:54 php 18396 - plugins_configure newwanip (execute task : dyndns_configure_do(,opt10))
13:44:54 php 18396 - plugins_configure newwanip (execute task : ntpd_configure_do())
13:44:55 opnsense 92442 - /usr/local/etc/rc.filter_configure: ROUTING: keeping current default gateway '10.11.18.17'
13:44:55 opnsense 92442 - /usr/local/etc/rc.filter_configure: ROUTING: keeping current default gateway '2001:470:1c:4::1'
13:44:55 php 18396 - plugins_configure newwanip (execute task : opendns_configure_do())
13:44:55 php 18396 - plugins_configure newwanip (execute task : openssh_configure_do(,opt10))
13:44:55 php 18396 - plugins_configure newwanip (execute task : unbound_configure_do(,opt10))
13:44:56 php 18396 - plugins_configure newwanip (execute task : vxlan_configure_do())
13:44:56 php 18396 - plugins_configure newwanip (execute task : webgui_configure_do(,opt10))

13:47:22 dhclient 97705 - unknown dhcp option value 0x52
13:47:22 dhclient 13705 - Creating resolv.conf

13:48:10 opnsense 21819 - /usr/local/etc/rc.filter_configure: Ignore down inet gateways : STARLINK_DHCP
13:48:10 opnsense 21819 - /usr/local/etc/rc.filter_configure: ROUTING: keeping current default gateway '10.11.18.17'
13:48:10 opnsense 21819 - /usr/local/etc/rc.filter_configure: Ignore down inet6 gateways : STARLINK_DHCP
13:48:10 opnsense 21819 - /usr/local/etc/rc.filter_configure: ROUTING: keeping current default gateway '2001:470:1c:4::1'

13:49:52 dhclient 97705 - unknown dhcp option value 0x52
13:49:52 dhclient 48320 - Creating resolv.conf

gateway

13:44:52 dpinger 9786 - exiting on signal 15
13:44:53 dpinger 32566 - send_interval 1000ms  loss_interval 2000ms  time_period 60000ms  report_interval 0ms  data_len 0  alert_interval 1000ms  latency_alarm 500ms  loss_alarm 75%  dest_addr 8.8.8.8  bind_addr WAN2.IP.222.222   identifier "STARLINK_DHCP "
13:44:53 dpinger 32566 - exiting on signal 15
13:44:53 dpinger 57398 - send_interval 1000ms  loss_interval 2000ms  time_period 60000ms  report_interval 0ms  data_len 0  alert_interval 1000ms  latency_alarm 500ms  loss_alarm 75%  dest_addr 8.8.8.8  bind_addr WAN2.IP.222.222   identifier "STARLINK_DHCP "
13:48:10 dpinger 57398 - STARLINK_DHCP 8.8.8.8: Alarm latency 50309us stddev 15776us loss 77%
13:48:10 dpinger 19617 - GATEWAY ALARM: STARLINK_DHCP (Addr: 8.8.8.8 Alarm: 1 RTT: 50309us RTTd: 15776us Loss: 77%)

Not sure if this is useful but maybe, I would have expected maybe to have dpinger also restarted after the dhcp renew or something. Or maybe it download get restarted on renew at 13:47, I'm not sure...

root    57398   0.0  0.0  17728  2628  -  Is   13:44      0:00.08 /usr/local/bin/dpinger -f -S -r 0 -i STARLINK_DHCP -B WAN2.IP.222.222 -p /var/run/dpinger_STARLINK_DHCP.pid -u /var/run/dpinger_STARLINK_DHCP.sock -C /usr/local/etc/rc.syshook monitor -s 1s -l 2s -t 60s -A 1s -D 500 -L 75 -d 0 8.8.8.8

Exact time of the dpginger startup during the event.

root@fw:~ # stat -x /var/run/dpinger_STARLINK_DHCP.*

  File: "/var/run/dpinger_STARLINK_DHCP.pid"
  Size: 6            FileType: Regular File
  Mode: (0640/-rw-r-----)         Uid: (    0/    root)  Gid: (    0/   wheel)
Device: 0,128   Inode: 11958748    Links: 1
Access: Fri Aug 19 14:10:47 2022
Modify: Fri Aug 19 13:44:53 2022
Change: Fri Aug 19 13:44:53 2022
 Birth: Fri Aug 19 13:44:53 2022

  File: "/var/run/dpinger_STARLINK_DHCP.sock"
  Size: 0            FileType: Socket
  Mode: (0666/srw-rw-rw-)         Uid: (    0/    root)  Gid: (    0/   wheel)
Device: 0,128   Inode: 11958754    Links: 1
Access: Fri Aug 19 13:44:53 2022
Modify: Fri Aug 19 13:44:53 2022
Change: Fri Aug 19 13:44:53 2022
 Birth: Fri Aug 19 13:44:53 2022
tong2x commented 2 years ago

@RedVortex this will be a long shot but, try 1.1.1.1 or 1.0.0.1 as monitor IP. what is the average ping time using Starlink link, just wanted to know as I am planning to get one when it is available in my area.

fichtner commented 2 years ago

@RedVortex Patrick, I'm sort of wondering if the 75% loss on the link and the resulting monitoring reload messes with this. For testing, could you disable gateway monitoring for Starlink to see if the behaviour disappears?

Cheers, Franco

RedVortex commented 2 years ago

Sorry about that part, I should have been more clear. The logs only show 75% because that's how this link is configured. Default low/high triggers are 10/20% to mark a GW down but for Starlink it's not enough since it's a satellite link so I changed those to 25/75%.

What you see is when dpinger triggers at 75% but pdinger doesn't log anything else afterwards. The loss is 100% and dpinger show 0ms RTT and RTTd, in others words, it think it is fully down even though it is up if I try a manual ping like this

root@fw:~ # ping -c 10 -S 100.67.159.247 8.8.8.8
PING 8.8.8.8 (8.8.8.8) from 100.67.159.247: 56 data bytes
64 bytes from 8.8.8.8: icmp_seq=0 ttl=118 time=45.894 ms
64 bytes from 8.8.8.8: icmp_seq=1 ttl=118 time=40.508 ms
64 bytes from 8.8.8.8: icmp_seq=2 ttl=118 time=45.474 ms
64 bytes from 8.8.8.8: icmp_seq=3 ttl=118 time=39.536 ms
64 bytes from 8.8.8.8: icmp_seq=4 ttl=118 time=50.298 ms
64 bytes from 8.8.8.8: icmp_seq=5 ttl=118 time=39.697 ms
64 bytes from 8.8.8.8: icmp_seq=6 ttl=118 time=45.989 ms
64 bytes from 8.8.8.8: icmp_seq=7 ttl=118 time=38.811 ms
64 bytes from 8.8.8.8: icmp_seq=8 ttl=118 time=43.074 ms
64 bytes from 8.8.8.8: icmp_seq=9 ttl=118 time=42.049 ms

--- 8.8.8.8 ping statistics ---
10 packets transmitted, 10 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 38.811/43.133/50.298/3.523 ms

dpinger is running like this

/usr/local/bin/dpinger -f -S -r 0 -i STARLINK_DHCP -B 100.67.159.247 -p /var/run/dpinger_STARLINK_DHCP.pid -u /var/run/dpinger_STARLINK_DHCP.sock -C /usr/local/etc/rc.syshook monitor -s 1s -l 2s -t 60s -A 1s -D 500 -L 75 -d 0 8.8.8.8

In other words, dpinger shoud show something like RTT: 43ms, RTTd: 3.5ms and packet loss: 0% but it show 0ms, 0ms, 100% and then marks it as down.

I'm not sure exactly why that is but my feeling is that since it doesn't get restarted or something after the interface flaps or something else, it loses it's "binding" to the source address or interface maybe and starts failing. That's why if I restart only dpinger manually, everything starts to work normally and the gateway is marked back up right away as well since it was really up all this time.

And yes, if monitoring is disabled the gateway remains up all the time, since dpinger doesn't run with a configuration like this.

I'm in the process of checking how I could restart automatically dpinger when a DHCP renew occurs since the scripts don't do it and that seems to be part of the problem. Right now I have to do it manually in order for dpinger to continue to work properly after an interface flap.

What's weird is that once I restart dpinger manually after it starts failing, I don't have to do it anymore even though I still have dhcp renewal every 2m30s from Starlink.

So the sequence of what I do for testing is this:

I start from a clean state where Starlink is working and the gateway is marked up. Then I disconnect/reconnect the Starlink network WAN cable (1000Base-T) connected in a Intel 4 ports card. That action leads to this in the logs

opnsense 95002 - /usr/local/etc/rc.linkup: DEVD: Ethernet detached event for dynamic opt10(igb0)
dhclient 97705 - connection closed
dhclient 97705 - exiting.
opnsense 95002 - /usr/local/etc/rc.linkup: Clearing states for stale opt10 route on igb0
opnsense 10913 - /usr/local/etc/rc.filter_configure: ROUTING: keeping current default gateway '10.11.18.17'
opnsense 10913 - /usr/local/etc/rc.filter_configure: ROUTING: keeping current default gateway '2001:470:1c:4::1'
opnsense 85079 - /usr/local/etc/rc.linkup: DEVD: Ethernet attached event for dynamic opt10(igb0)

Then the interface comes back up, gateway too. dpinger gets restarted and keeps the gateway marked as UP. This part seems to work correctly since the patches. From what I remember, it used to start to fail at this point before the patches. Now, 2m30s later, there is a DHCP renewal that happens and this is where dpinger starts to fail for some reason. The IP doesn't change, it just gets renewed. but dpinger marks the gateway down after a few monitors and ends up at 0ms RTT and RTTd and 100% packet loss. Pinging the gateway manually (like bove) from the Starlink interface shows it pings properly and the latency shows the traffic goes through Starlink (else it would be around 1-2 on my primary Bell provider). So the interface and gateway is really up but dpinger doesn't "see" it this way. So I go in the UI and click on the restart icon on the Starlink dpinger and then a new dpinger process gets started and the gateway becomes UP with the normal stats:

Name | RTT | RTTd | Loss | Status
STARLINK_DHCP   100.64.0.1 | 46.9 ms | 9.9 ms | 0.0 % | Online
RedVortex commented 2 years ago

I also did a tcpdump on igb0 (Starlink) and I'm not seeing any ICMP flow though it (when the gateway is marked as down). Usually I see this (when it is marked as up)

root@fw:~ # tcpdump -i igb0 icmp -A -s0 -nnn -vvv
tcpdump: listening on igb0, link-type EN10MB (Ethernet), capture size 262144 bytes
14:32:08.489535 IP (tos 0x0, ttl 64, id 8839, offset 0, flags [none], proto ICMP (1), length 28)
    100.67.159.247 > 8.8.8.8: ICMP echo request, id 57349, seq 38, length 8
E..."...@.D.dC.............&
14:32:08.528861 IP (tos 0x0, ttl 118, id 0, offset 0, flags [none], proto ICMP (1), length 28)
    8.8.8.8 > 100.67.159.247: ICMP echo reply, id 57349, seq 38, length 8
E.......v.0.....dC.........&..................
14:32:09.510368 IP (tos 0x0, ttl 64, id 61665, offset 0, flags [none], proto ICMP (1), length 28)
    100.67.159.247 > 8.8.8.8: ICMP echo request, id 57349, seq 39, length 8
E.......@.u.dC.............'
14:32:09.555526 IP (tos 0x0, ttl 118, id 0, offset 0, flags [none], proto ICMP (1), length 28)
    8.8.8.8 > 100.67.159.247: ICMP echo reply, id 57349, seq 39, length 8
E.......v.0.....dC.........'..................
14:32:10.512466 IP (tos 0x0, ttl 64, id 48519, offset 0, flags [none], proto ICMP (1), length 28)
    100.67.159.247 > 8.8.8.8: ICMP echo request, id 57349, seq 40, length 8
E.......@...dC.............(

I just found out that the traffic is going through the default gateway (Bell, my other WAN) instead of going through Starlink.

So it seems like there is something at play here in regards to the routes maybe or something.

I had no static route for 8.8.8.8 during the gateway DOWN. But when I restart dpinger I have this route that gets installed

root@fw:~ # netstat -rn | grep 8.8.8.8 8.8.8.8 100.64.0.1 UGHS igb0

and I see that ICMP flows through igb0 again and dpinger marks the gateway as UP.

The weird part...

When the DHCP renewal of Starlink comes in 2m30s later, this route goes away. But for some reason, dpinger continues to send the traffic on the right interface (igb0), as it should unlike what happens after the dhcp renewal after an interface flap.

I think this may all be related to something else you and I have been discussing on another "thread" in regards to DNS / DHCP and ignoring DHCP-provided DNS etc... Especially since Starlink pushes 8.8.8.8 in its DHCP DNS.

When I restart dpinger to get it going I see this

opnsense 96008 - /status_services.php: Removing static route for monitor 8.8.8.8 via 100.64.0.1
opnsense 96008 - /status_services.php: Adding static route for monitor 8.8.8.8 via 100.64.0.1

Personally, I don't think dpinger/opnsense needs to add a route like that since dpinger can bind to the right source IP address and thus, I think, it should bypass the routing table when it sends its traffic.

Now, as to why after interface flap and all, even when dpinger uses the right source address, the traffic goes the the default gateway instead of Starlink, I'm at loss... Unsure if it's a natting issue, kernel issue, something else, etc...

I wouldn't be surprised that if I was to use something else than 8.8.8.8 or 1.1.1.1, this situation would not happen (those are the 2 DNS servers Starlink pushes

Here's a Starlink DHCP response

14:50:25.747902 IP (tos 0x0, ttl 64, id 37389, offset 0, flags [DF], proto UDP (17), length 389)
    100.64.0.1.67 > 100.67.159.247.68: [no cksum] BOOTP/DHCP, Reply, length 361, xid 0x9a6bf025, Flags [none] (0x0000)
      Client-IP 100.67.159.247
      Your-IP 100.67.159.247
      Server-IP 10.10.10.10
      Gateway-IP 192.168.100.100
      Client-Ethernet-Address 90:e2:ba:35:36:20
      Vendor-rfc1048 Extensions
        Magic Cookie 0x63825363
        DHCP-Message Option 53, length 1: ACK
        Subnet-Mask Option 1, length 4: 255.192.0.0
        Server-ID Option 54, length 4: 100.64.0.1
        Default-Gateway Option 3, length 4: 100.64.0.1
        Lease-Time Option 51, length 4: 300
        Domain-Name-Server Option 6, length 8: 1.1.1.1,8.8.8.8
        Classless-Static-Route Option 121, length 23: (192.168.100.1/32:0.0.0.0),(34.120.255.244/32:0.0.0.0),(default:100.64.0.1)
        MTU Option 26, length 2: 1500
        Agent-Information Option 82, length 24:
          Circuit-ID SubOption 1, length 4: ^@^@^CM-^M
          Unknown SubOption 5, length 4:
        0x0000:  6440 0001
          Unknown SubOption 151, length 8:
        0x0000:  0100 2b06 0000 2b06
          Unknown SubOption 152, length 0:
        END Option 255, length 0
        PAD Option 0, length 0, occurs 28
RedVortex commented 2 years ago

I confirm the suspicion... I configured Starlink gateway monitoring to use 9.9.9.9 instead of 8.8.8.8 and the issue is not happening anymore.

The downside is that all traffic to 9.9.9.9 now goes through Starlink since a route (that remains this time, it doesn't go away) get added on Starilnk

root@fw:~ # netstat -rn | grep igb0
9.9.9.9            100.64.0.1         UGHS       igb0
34.120.255.244     link#4             UHS        igb0
100.64.0.0/10      link#4             U          igb0
192.168.100.1      link#4             UHS        igb0

So, I have a "workaround" possible until we can figure out a better way to those DHCP DNS don't play with the routes even though I checked to ignore them in the configs (System: Settings: General) it seems like something is messing around.

I also tested using dpinger to ping 8.8.8.8 using Starlink as source and it works without having to put a static route to 8.8.8.8 to igb0 (Starlink). So I'm quite sure opnsense/dpinger should not be inserting a static route for monitoring and this could probably fix this whole issue since this useless route would probably not get removed later on and mess with the routing table / dpinger traffic. As to why removing the route changes dpinger's traffc even though the source IP remains the same, I'm not sure, still wondering about a kernel, nat or something else in the background that could interfere. That is the "real" bug but we would not trigger it I think if a static route would not be added to the monitoring IP.

RedVortex commented 2 years ago

I also tested this:

I added a static route to 8.8.8.8/32 on igb0 (Starlink). I see the static route properly using netstat -rn After Starlink DHCP renewal 2m30s later, the route is gone.

Could it be that since I have "Allow DNS server list to be overridden by DHCP/PPP on WAN" unchecked and 8.8.8.8 comes up as a DNS in the DHCP renewal, opnsense makes sure there is no route to 8.8.8.8 using this provider (maybe as cleanup or something) and thus removes the route ?

The same thing happens if I use 1.1.1.1 which is the second DNS Starlink pushes, the route gets removed on dhcp renewal.

fichtner commented 2 years ago

After Starlink DHCP renewal 2m30s later, the route is gone.

That's because Starlink insisted on this route and bringing down the interface flushes the host route. Overlaps are terrible, especially for ISPs insisting to push widespread servers. For now ifctl(8) flushes DNS routes belonging to the interface even though it may not have registered them. It's still impossible for the system to know which one came from where and when it can be stripped and when it cannot be. In such situations avoiding using the nameservers provided by the provider is the best thing to do.

Nevertheless we can improve this further. I'm still working on a "hollistic" view for host routes being used in all 4 subsystems that request/imply them, but it will take 1-2 more stable release cycles to come to anything concrete.

On devel you can already list all to-be-registered DNS routes with their source and interface:

# configctl system list nameservers debug
[
    {
        "host": "fd00::de39:6fff:fed8:ee9a",
        "gateway": "fe80::de39:6fff:fed8:ee9a%igb1",
        "interface": "wan",
        "source": "interface"
    },
    {
        "host": "192.168.178.1",
        "gateway": "192.168.178.1",
        "interface": "wan",
        "source": "interface"
    }
]

Cheers, Franco

fichtner commented 2 years ago

@RedVortex maybe we can try to leave static routes where they are and ignore the consequence of users flipping the DNS server override setting... 79546ef

Not sure what your patches look like at the moment, let me make clear which ones are useful:

# opnsense-revert -r 22.7.2 opnsense
# opnsense-patch e8d42b6 3482a15f2d 79546ef

Cheers, Franco

RedVortex commented 2 years ago

@RedVortex maybe we can try to leave static routes where they are and ignore the consequence of users flipping the DNS server override setting... 79546ef

Now that I understand properly what's happening, I'm not sure we should change it. I'm guessing some people know about how this works and they expect this behaviour to remain this way. Changing this could lead to some other issues for them and they may actually consider it a bug from their point of view.

So even though this patch would make "me" happy :-) I think it may not be a good idea to change this behaviour for everyone. Now that I know how it behaves, I can "workaround" it and I can also expect it in the future.

The WAN interface bug is fixed which was the important part of this ticket and that seems to work well and I think it is good to ship for the next release. It's been working perfectly on my side since I changed the monitoring to something that Starlink doesn't push in its DHCP/DNS.

The DNS route thing is, I believe, another topic we could discuss outside of this one. My take on this is that we should either have this configurable (add and remove static route for learned DNS on DHCP renewal, interface flap, release, etc...) on top of the setting that allows to override DNS by the WAN learned ones. This way, we would not add routes nor remove them from what is learned by DHCP for DNS if the user doesn't want this behaviour. It's still important that DHCP learned routes be added though as this is expected to work. Then again, one could argue that it would also be nice to have a setting that would also block them, so that someone doesn't push you a bad route or something that would break your routes or worse, re-route your traffic. I believe this is something that should be configurable for DNS added routes and even maybe for DHCP pushed routes one day, as a feature request maybe. If someone really wants a route added for a DHCP-learned DNS, it should be a decision, not something enforced. On top of it, security-wise, it may be a good idea to give the ability to prevent that from happening for DNS so a provider don't start getting all your DNS queries without your knowing because you thought your 8.8.8.8 was going to a safe provider but your second provider is now getting it because he pushed you a DNS through DHCP. That and the traffic to DNS constantly moving from one provider to the other as your DHCP renewals happen if they both push you 8.8.8.8 by any bad luck. Same goes for DHCP-pushed routes if you ask me. I can definitely see a huge security implication here that cannot be, for now, fixed through configuration. You can get around it partly with firewall rules, maybe... Forcing gateways and denying bad networks, etc... maybe...

Like I said, I think we should keep the current behaviour for now, it's safer, expected behaviour and well understood now.

I just didn't expect this to impact dpinger. And I'm still not sure why dpinger starts flowing the traffic to the wrong interface, it shouldn't be affect by a route change since it sends the ping using the source address of the right interface which should bypass all the routes and prevent it from being affected by the routes added/removed. I'm trying to replicate this issue using a foreground manual dpinger and I cannot replicate it. The one that I run manually continues to use the right interface and the gateway remains up even as routes are added/removed by dhcp renewal and interfaces flapping. However, the one started by opnsense fails. I'll try to figure this one as this may be a bug and if it would not behave like that, the DNS routes being added/removed would not be an issue either like it is for me (or was before I changed my monitoring IP). But I need to do more testing to try to figure out exactly happens with this one...

Thanks a lot for all the time you (and all the others in this thread) spent on this and reading my loooong replies and helping me figure this one out, I appreciate it.

fichtner commented 2 years ago

@RedVortex in essence the option "Allow DNS server list to be overridden by DHCP/PPP on WAN" is what you describe here to avoid ISP server registration "unbeknownst" to you. And even when you enable the option you have the opportunity to turn it off for specific interfaces like Starlink in your case and leave it enabled for the other WAN.

I might still consider 79546eff0bb7 as a larger overhaul. After all it would make sense to not strip them here and instead defer the duty to system_routing_configure() altogether like we did for dpinger routes. So whenever routes are renewed all of them are updated deterministically which is what we want in the first place. Bugs and overlaps may still exist but are a lot easier to handle, fix and avoid. :) Also see #5989

The two discussed patches have been applied to stable/22.7 and will be available in 22.7.3.

fichtner commented 2 years ago

There is also #2517 which seems quite relevant here and with the lack of RELEASE in dhclient we do need to improve this for 23.1.

ashceryth commented 2 years ago

@fichtner As promised: No issues so far, everything running fine with your patch.

Cheers!

fichtner commented 2 years ago

Thanks, already moved both things to stable/22.7 for 22.7.3 next week 😊

tong2x commented 1 year ago

there seems to be a recurrence of the issue but for now I'm not quite sure to replicate. connection is a fiber internet, it is always on. but not as frequent as the issue before (also not as easily replicable)

what I notice is that if the connection, goes down for a "long" time, the gateway will somehow be tagged down indefinitely. I tried restating the system:gateway service. but it is still tagged as down. so again, I click edit, do nothing and just click save. the connection will be good again.

i'm on OPNsense 22.7.10_2-amd64