openwrt / packages

Community maintained packages for OpenWrt. Documentation for submitting pull requests is in CONTRIBUTING.md
GNU General Public License v2.0
3.95k stars 3.46k forks source link

mwan3: interface cannot detect link problems after mwan ifdown/up #18087

Open luizluca opened 2 years ago

luizluca commented 2 years ago

Maintainer: @feckert Environment: openwrt 21.02.1 mwan3 2.10.13-1

Description:

I have two interfaces, both with IPv4 and IPv6 configurations (wan4a, wan4b, wan6a, wan6b). Sometimes the tracker stucks at the "connecting" or "online" state, even if all trackers are down. I could trigger that behavior with:

# mwan3 ifdown wan6b
# mwan3 ifup wan6b
<externally block internet traffic through wan6b>

After that, wan4b was detected as down but not wan6b. What I see is that /var/run/mwan3track/wan6b/STATUS is always "connecting" even with all /var/run/mwan3track/wan6b/TRACK_* marked as "down". LOST, ONLINE, TIME and TURN are still increasing. SCORE is zero.

It does not matter if it is an IPv4 or IPv6 interface. Restarting mwan3 reverts the issue.

I have reproduced it in a virtual env (gns3). It is easy to test or debug if needed.

feckert commented 2 years ago

It is still not clear to me why wan4b should also go down if wan6b was triggered to get down.

This would be the call stack : https://github.com/openwrt/packages/blob/master/net/mwan3/files/usr/sbin/mwan3#L47 https://github.com/openwrt/packages/blob/master/net/mwan3/files/lib/mwan3/mwan3.sh#L993 https://github.com/openwrt/packages/blob/master/net/mwan3/files/lib/mwan3/mwan3.sh#L1005 https://github.com/openwrt/packages/blob/master/net/mwan3/files/etc/hotplug.d/iface/15-mwan3#L76 https://github.com/openwrt/packages/blob/master/net/mwan3/files/etc/hotplug.d/iface/15-mwan3#L82 https://github.com/openwrt/packages/blob/master/net/mwan3/files/usr/sbin/mwan3track#L210 https://github.com/openwrt/packages/blob/master/net/mwan3/files/usr/sbin/mwan3track#L29 https://github.com/openwrt/packages/blob/master/net/mwan3/files/usr/sbin/mwan3track#L31

I would imagine that by quickly executing mwan3 ifdown <interface> followed by anmwan3 ifup <interface>, the IFDOWN_EVENT variable would be set back to 0. Therefore, the mwan3track script is not processed correctly. As the processing of the script requires a certain amount of time.

The Flags for IFDOWN_EVENT and IFUP_EVENT are evaluated there. Maybe they are in a wrong condition?

https://github.com/openwrt/packages/blob/master/net/mwan3/files/usr/sbin/mwan3track#L385 https://github.com/openwrt/packages/blob/master/net/mwan3/files/usr/sbin/mwan3track#L392 https://github.com/openwrt/packages/blob/master/net/mwan3/files/usr/sbin/mwan3track#L398

luizluca commented 2 years ago

It is still not clear to me why wan4b should also go down if wan6b was triggered to get down.

  • Can you check whether an mwan3track is also running for all your interfaces wan4a, wan6a, wan4b, and wan6b??

It looks like they are running. I see new subshell process related to the failed interfaces and pings being called.

  • How much time elapses between the mwan3 ifdown <interace> and mwan3 ifup <interface> comand?Perhaps the execution of the commands was entered too quickly one after the other.

Yes, 1 or 2 seconds. Anyway, I believe the code should withstand even a zero delay with multiple ifdown/ifup calls. A router under heavy load or swapping might get really slow. Sometimes I could reproduce the issue with a 5s delay.

I'm trying to isolate at least one sequence of events that causes the issue:

Thu Mar 17 23:28:02 2022 user.info mwan3track[3193]: Check (ping) success for target "2801:84:0:2::10" on interface wan6a (eth3). Current score: 8
Thu Mar 17 23:28:03 2022 user.info mwan3track[3193]: Check (ping) success for target "2606:4700:4700::1111" on interface wan6a (eth3). Current score: 8
Thu Mar 17 23:28:03 2022 user.info mwan3track[3192]: Check (ping) success for target "1.1.1.1" on interface wan4a (eth3). Current score: 8
Thu Mar 17 23:28:03 2022 user.info mwan3track[3192]: Check (ping) success for target "1.0.0.1" on interface wan4a (eth3). Current score: 8
Thu Mar 17 23:28:12 2022 user.notice mwan3-hotplug[8991]: Execute ifdown event on interface wan4a (unknown)

While still increasing score, I did a "mwan3 ifdown wan4a."

Thu Mar 17 23:28:13 2022 user.info mwan3track[3192]: Detect ifdown event on interface wan4a (eth3)
Thu Mar 17 23:28:13 2022 user.notice mwan3track[3192]: Interface wan4a (eth3) is offline
Thu Mar 17 23:28:16 2022 user.info mwan3-hotplug[8991]: Connection tracking flushed for interface 'wan4a' on action 'ifdown'
Thu Mar 17 23:28:16 2022 user.notice mwan3-hotplug[9090]: Execute disconnected event on interface wan4a (eth3)

5s after, it received the ifup event. In my case, I used a "sleep 2" between calls. The rest of delay is from mwan3 processing.

Thu Mar 17 23:28:21 2022 user.notice mwan3-hotplug[9599]: Execute ifup event on interface wan4a (eth3)
Thu Mar 17 23:28:25 2022 user.info mwan3track[3192]: Detect ifup event on interface wan4a (eth3)
Thu Mar 17 23:28:25 2022 user.info mwan3-hotplug[9599]: Connection tracking flushed for interface 'wan4a' on action 'ifup'
Thu Mar 17 23:28:25 2022 user.info mwan3track[3192]: Check (ping) success for target "1.1.1.1" on interface wan4a (eth3). Current score: 0
Thu Mar 17 23:28:25 2022 user.info mwan3track[3192]: Check (ping) success for target "1.0.0.1" on interface wan4a (eth3). Current score: 0
Thu Mar 17 23:28:25 2022 user.notice mwan3track[3192]: Interface wan4a (eth3) is connecting
Thu Mar 17 23:28:26 2022 user.notice mwan3track[3192]: Interface wan4a (eth3) is online
Thu Mar 17 23:28:26 2022 user.info mwan3track[3192]: Check (ping) success for target "1.1.1.1" on interface wan4a (eth3). Current score: 1
Thu Mar 17 23:28:27 2022 user.info mwan3track[3192]: Check (ping) success for target "1.0.0.1" on interface wan4a (eth3). Current score: 1
Thu Mar 17 23:28:27 2022 user.notice mwan3track[3192]: Interface wan4a (eth3) is connecting

Wasn't it already online 1s ago? This is the last mention of wan4a. Now I break the link used by wan4a and wan6a:

Thu Mar 17 23:28:36 2022 user.info mwan3track[3193]: Check (ping) failed for target "2801:84:0:2::10" on interface wan6a (eth3). Current score: 11
Thu Mar 17 23:28:38 2022 user.info mwan3track[3193]: Check (ping) failed for target "2606:4700:4700::1111" on interface wan6a (eth3). Current score: 11
Thu Mar 17 23:28:40 2022 user.info mwan3track[3193]: Check (ping) failed for target "2606:4700:4700::1001" on interface wan6a (eth3). Current score: 11
Thu Mar 17 23:28:42 2022 user.info mwan3track[3193]: Check (ping) failed for target "2001:4860:4860::8888" on interface wan6a (eth3). Current score: 11
Thu Mar 17 23:28:44 2022 user.info mwan3track[3193]: Check (ping) failed for target "2001:4860:4860::8844" on interface wan6a (eth3). Current score: 11
Thu Mar 17 23:28:47 2022 user.info mwan3track[3193]: Check (ping) failed for target "2620:0:ccd::2" on interface wan6a (eth3). Current score: 11
Thu Mar 17 23:28:49 2022 user.info mwan3track[3193]: Check (ping) failed for target "2620:0:ccc::2" on interface wan6a (eth3). Current score: 11
Thu Mar 17 23:28:51 2022 user.info mwan3track[3193]: Check (ping) failed for target "2600:1419:d000:2a5::e86" on interface wan6a (eth3). Current score: 11
Thu Mar 17 23:28:51 2022 user.notice mwan3track[3193]: Interface wan6a (eth3) is disconnecting
Thu Mar 17 23:28:58 2022 user.info mwan3track[3193]: Check (ping) failed for target "2801:84:0:2::10" on interface wan6a (eth3). Current score: 10
Thu Mar 17 23:29:00 2022 user.info mwan3track[3193]: Check (ping) failed for target "2606:4700:4700::1111" on interface wan6a (eth3). Current score: 10
Thu Mar 17 23:29:02 2022 user.info mwan3track[3193]: Check (ping) failed for target "2606:4700:4700::1001" on interface wan6a (eth3). Current score: 10
Thu Mar 17 23:29:04 2022 user.info mwan3track[3193]: Check (ping) failed for target "2001:4860:4860::8888" on interface wan6a (eth3). Current score: 10
Thu Mar 17 23:29:07 2022 user.info mwan3track[3193]: Check (ping) failed for target "2001:4860:4860::8844" on interface wan6a (eth3). Current score: 10
Thu Mar 17 23:29:09 2022 user.info mwan3track[3193]: Check (ping) failed for target "2620:0:ccd::2" on interface wan6a (eth3). Current score: 10
Thu Mar 17 23:29:11 2022 user.info mwan3track[3193]: Check (ping) failed for target "2620:0:ccc::2" on interface wan6a (eth3). Current score: 10
Thu Mar 17 23:29:13 2022 user.info mwan3track[3193]: Check (ping) failed for target "2600:1419:d000:2a5::e86" on interface wan6a (eth3). Current score: 10
Thu Mar 17 23:29:20 2022 user.info mwan3track[3193]: Check (ping) failed for target "2801:84:0:2::10" on interface wan6a (eth3). Current score: 9
Thu Mar 17 23:29:22 2022 user.info mwan3track[3193]: Check (ping) failed for target "2606:4700:4700::1111" on interface wan6a (eth3). Current score: 9
Thu Mar 17 23:29:24 2022 user.info mwan3track[3193]: Check (ping) failed for target "2606:4700:4700::1001" on interface wan6a (eth3). Current score: 9
Thu Mar 17 23:29:26 2022 user.info mwan3track[3193]: Check (ping) failed for target "2001:4860:4860::8888" on interface wan6a (eth3). Current score: 9
Thu Mar 17 23:29:28 2022 user.info mwan3track[3193]: Check (ping) failed for target "2001:4860:4860::8844" on interface wan6a (eth3). Current score: 9
Thu Mar 17 23:29:30 2022 user.info mwan3track[3193]: Check (ping) failed for target "2620:0:ccd::2" on interface wan6a (eth3). Current score: 9
Thu Mar 17 23:29:32 2022 user.info mwan3track[3193]: Check (ping) failed for target "2620:0:ccc::2" on interface wan6a (eth3). Current score: 9
Thu Mar 17 23:29:34 2022 user.info mwan3track[3193]: Check (ping) failed for target "2600:1419:d000:2a5::e86" on interface wan6a (eth3). Current score: 9
Thu Mar 17 23:29:34 2022 user.notice mwan3track[3193]: Interface wan6a (eth3) is offline
Thu Mar 17 23:29:35 2022 user.notice mwan3-hotplug[11532]: Execute disconnected event on interface wan6a (eth3)

wan6a was correctly disconnected. Meanwhile, there is still a running "/bin/sh /usr/sbin/mwan3track wan4a" (pid 3192) spawning subprocesses but no more logs and wan4a was still "online". wan4a tracker state files shows a zero SCORE but a "connecting" STATE.

In that state, I executed a new ifdown/ifup sequence but with 10s delay. I got a couple of new logs but the interface was still stuck at online:

Thu Mar 17 23:56:32 2022 user.notice mwan3-hotplug[19264]: Execute ifdown event on interface wan4a (unknown)

BTW, is device="unknown" expected? mwan3_interface_hotplug_shutdown() declares a device local var but never set it.
https://github.com/openwrt/packages/blob/f8c8348f32a91ce1f60d9fe09d3c9fce607b0fe9/net/mwan3/files/lib/mwan3/mwan3.sh#L995

Thu Mar 17 23:56:33 2022 user.info mwan3track[3192]: Detect ifdown event on interface wan4a (eth3)
Thu Mar 17 23:56:33 2022 user.info mwan3track[3192]: ifdown event on interface wan4a (eth3) finished
Thu Mar 17 23:56:35 2022 user.notice mwan3track[3192]: Interface wan4a (eth3) is offline
Thu Mar 17 23:56:36 2022 user.info mwan3-hotplug[19264]: Connection tracking flushed for interface 'wan4a' on action 'ifdown'
Thu Mar 17 23:56:37 2022 user.notice mwan3-hotplug[19502]: Execute disconnected event on interface wan4a (eth3)
Thu Mar 17 23:56:48 2022 user.notice mwan3-hotplug[19927]: Execute ifup event on interface wan4a (eth3)
Thu Mar 17 23:56:52 2022 user.info mwan3track[3192]: Detect ifup event on interface wan4a (eth3)
Thu Mar 17 23:56:52 2022 user.info mwan3-hotplug[19927]: Connection tracking flushed for interface 'wan4a' on action 'ifup'
Thu Mar 17 23:56:52 2022 user.info mwan3track[3192]: ifup event on interface wan4a (eth3) finished

It looks like "mwan3 ifdown/ifup" while the interface is "connecting" (increasing score) might trigger the issue. However, I cannot say it is the only way to trigger that issue cause I'm almost sure I saw a stuck tracker with other state (I believe it was "online").