openwrt / mt76

mac80211 driver for MediaTek MT76x0e, MT76x2e, MT7603, MT7615, MT7628 and MT7688
737 stars 340 forks source link

broadcasts delivery problem #598

Open bol-van opened 2 years ago

bol-van commented 2 years ago

I experience the following glitch on my TP-Link Archer c6u (MT7621) I still haven't figured out a condition to reproduce the probelm. It just happens. In the evening it works, in the morning it doesn't. It stops working without any specific action from my side. Just leave it for enough long time and it will eventually happen.

wifi clients with power saving enabled (iw dev wlan0 set power_save on) connected to 2.4 GHz wifi (5 GHz not checked, dont have any compatible devices) stop receiving broadcast frames until they send something broadcasts include arp and ndp frames

so, if you have a PC connected via LAN and android via wifi and they are bridged

arp -d ping ip_of_android ..nothing.. ..nothing.. ..nothing..

android does not receive arps on wlan0 - checked with tcpdump in android root console router sees outgoing arp on wlan interface without replies

but if you send something from android it starts receiving broadcasts for the short time PC caches mac address and then ping goes normally until cache is cleaned or expired

restarting wifi interfaces does not help (openwrt command 'wifi') only module reload helps or reboot

#!/bin/sh

wifi down
/etc/init.d/wpad stop
rmmod mt7603e
rmmod mt7615e
rmmod mt7615_common
rmmod mt76_connac_lib
rmmod mt76
rmmod mac80211
rmmod cfg80211
rmmod compat

kmodloader
/etc/init.d/wpad start
wifi
bol-van commented 2 years ago

I tested this on 2 mediatek based androids. I'm unsure if its related to client wireless chipset. I don't have any qualcomm based devices to test at the moment I have a PC with wireless USB realtek adapter but PC does not power save

ryderlee1110 commented 2 years ago

Did you include this one https://patchwork.kernel.org/project/linux-wireless/patch/20210906083559.9109-1-nbd@nbd.name/ ?

bol-van commented 2 years ago

I'm now on today's snapshot with 5.10 kernel Yes, that patch is there Also moved mt76 backport to the latest commit

bol-van commented 2 years ago

Typical androids always send/receive something because they are cloud connected via google apps This may be the key factor not allowing to reproduce this issue My both 2 androids are cleaned from google apps, do not run whatsapp and similar apps that always connected and use network. Simple solution would be to install a firewall to block all the connectivity, send pings and monitor arp cache

bol-van commented 2 years ago

Another way to reproduce the problem.

Windows PC with 802.11ac wireless usb adapter Realtek 8812BU connected to mt7615 5G AP VHT80 run "metageek inssider 4" program on PC wait 10-20 second close the program try on the router console : arping -b nothing, nothing, nothing, ... Sometimes it recovers, sometimes not. If not - only wifi interface restart on the router helps. PC reboot or reconnection does not help. Problem does not reproduce on 2.4G mt7603

bol-van commented 2 years ago

TPLINK c6u <> realtek 8812BU on windows pc , using 5 ghz VHT 80

bad condition. AP client IP is 192.168.4.88 unicasts are delivered normally because pings are working if mac address is cached but not broadcasts. after clearing neighbor cache pings not working anymore

root@router:~# ping 192.168.4.88
PING 192.168.4.88 (192.168.4.88) 56(84) bytes of data.
64 bytes from 192.168.4.88: icmp_seq=1 ttl=64 time=295 ms
64 bytes from 192.168.4.88: icmp_seq=2 ttl=64 time=314 ms
64 bytes from 192.168.4.88: icmp_seq=3 ttl=64 time=311 ms
64 bytes from 192.168.4.88: icmp_seq=4 ttl=64 time=329 ms
64 bytes from 192.168.4.88: icmp_seq=5 ttl=64 time=345 ms
^C
--- 192.168.4.88 ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 294.801/318.808/345.332/17.048 ms

root@router:~# arping -b 192.168.4.88
ARPING 192.168.4.88 from 192.168.4.1 br-lan
^CSent 76 probes (76 broadcast(s))
Received 0 response(s)

root@router:~# ip neigh del 192.168.4.88 dev br-lan

root@router:~# ping 192.168.4.88
PING 192.168.4.88 (192.168.4.88) 56(84) bytes of data.
From 192.168.4.1 icmp_seq=1 Destination Host Unreachable
From 192.168.4.1 icmp_seq=2 Destination Host Unreachable
From 192.168.4.1 icmp_seq=3 Destination Host Unreachable
^C

good condition after wifi restart on the router

root@router:~# arping -b 192.168.4.88
ARPING 192.168.4.88 from 192.168.4.1 br-lan
Unicast reply from 192.168.4.88 [XX:XX:XX:XX:XX:XX]  3.928ms
Unicast reply from 192.168.4.88 [XX:XX:XX:XX:XX:XX]  3.773ms
Unicast reply from 192.168.4.88 [XX:XX:XX:XX:XX:XX]  3.210ms
^CSent 3 probes (3 broadcast(s))
Received 3 response(s)

root@router:~# ip neigh del 192.168.4.88 dev br-lan

root@router:~# ping 192.168.4.88
PING 192.168.4.88 (192.168.4.88) 56(84) bytes of data.
64 bytes from 192.168.4.88: icmp_seq=1 ttl=128 time=10.0 ms
64 bytes from 192.168.4.88: icmp_seq=2 ttl=128 time=3.25 ms
64 bytes from 192.168.4.88: icmp_seq=3 ttl=128 time=3.66 ms
^C
--- 192.168.4.88 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 3.253/5.649/10.039/3.108 ms

If I change wireless adapter on the PC (D-link DWA-160) and reconnect - the problem does not go away ! It's solely in the router state, not client state router is subject to denial of service

also problem reproduces with DWA-160 instead of 8812BU from the very beginning

bol-van commented 2 years ago

I've set up monitor station between android client and tplink 2,4G mt7603 AP It's linux with wireshark and a 2.4G wifi adapter in monitor mode and has WPA keys loaded so It can decipher frames if it has captured wpa handshake before

I've found that in bad state AP actually sends broadcast frames. They are received by monitor. But client station does not see them. It does not send acknowledgements back to AP as it does in good AP state. All broadcasts sent by AP have "more data" flag set in their 802.11 header. I could not find any significant differences in sent by AP broadcasts in good and bad state

That's why I'd like to ask. What causes wifi clients to stop receiving broadcasts until AP driver is restarted and phy is reset ? I have bad and good state captures and can send them privately to devs willing to analyze this case

JFtico commented 2 years ago

I've set up monitor station between android client and tplink 2,4G mt7603 AP

Thanks for all the detail provided and for chasing this down to produce usable captures. Hopefully, the devs will be able to get to the bottom of this issue and fix it. This one has been plaguing MT76 for a while.

bol-van commented 2 years ago

It looks like setting dtim_period to 1 instead of default 2 helps mitigate the problem Devices become more responsive to broadcasts

JFtico commented 2 years ago

It looks like setting dtim_period to 1 instead of default 2 helps mitigate the problem Devices become more responsive to broadcasts

Interesting, as I use DTIM 3 on my configs, as that is what the mobile OS vendors have been recommending due to low-power modes possibly missing the beacons if they are too frequent and the AP might drop the client for being unresponsive.

This being related to power modes, there is a hack referenced in this commit that disables SMPS support for the 7603 that seemed to also help. I'm testing this in a 19.07.8 based build to see if it does work. https://github.com/openwrt/mt76/pull/583/commits/0c5c5dde171acee6565abd4d975f63ebb9e6e8b2

bol-van commented 2 years ago

I checked in monitor my mt7603 already broadcasts beacons with SMPS disabled in HT capabilities As all of the neighbor APs do

bol-van commented 2 years ago

I had time to dig deeply into my captures. I also captured session on ath9k AP (which works ideally with dtim_period=2) Problem seem to have roots in wrong TIM sequence To deliver broadcasts AP should send beacon frame with tim count 0 and multicast bit set in bitmap control field. Multicast bit is less significant. So, 0 = no multicast, 1 = multicast present. Then AP should immediately send broadcasts Look at these pictures and you will understand WHY

ATH9k - good state

tim_ath9k_good

mt7603 - good state

tim_mt76_good

mt7603 - bad state

tim_mt76_bad

ryderlee1110 commented 2 years ago

Does it work after enabling multicast_to_unicast ?

bol-van commented 2 years ago

You are right, I was running with multicast_to_unicast=0, the same is on the ath9k AP In the morning I switched this to 1 back. It worked well until evening. In the next morning it does not work again. Same as described on the pictures above

bol-van commented 2 years ago

The arping test I did in the last morning was from ethernet connected PC bridged with WLAN through the router After 5 minutes I did the same test from the router itself and it worked again Then again did the test from the PC and it also worked

So, I suppose something makes DTIM out-of-sync either for the short time/specific condition or until mt7603e.ko is reloaded. Possibly setting multicast_to_unicast to 1 helps to mitigate but i'm not sure yet. Need more time to test Problem exists anyway, and setting multicast_to_unicast=0 and arping-ing not from the router may help to reproduce it. Wireless client should be in power save mode and should not be constantly communicating with network. This is the case on a typical android device with cloud/whatsapp/telegram/... If you have this kind of device you can try a firewall to stop constant app activity

bol-van commented 2 years ago

After 1.5 hours - not working again - neither from the router nor from the PC. Not self-recovering

ARPING 192.168.4.29 from 192.168.4.1 br-lan
Unicast reply from 192.168.4.29 [XX:XX:XX:XX:XX:XX]  512.180ms
Unicast reply from 192.168.4.29 [XX:XX:XX:XX:XX:XX]  108.326ms
Unicast reply from 192.168.4.29 [XX:XX:XX:XX:XX:XX]  245.856ms
^CSent 156 probes (156 broadcast(s))
Received 3 response(s)
config device
        option name 'br-lan'
        option type 'bridge'
        option multicast_to_unicast '1'
        option multicast_querier '0'
        list ports 'lan1'
        list ports 'lan2'
        list ports 'lan3'
        list ports 'lan4'
rany2 commented 1 year ago

@bol-van multicast_to_unicast should be set on config of type wifi-iface not network device.

rany2 commented 1 year ago

also it's now multicast_to_unicast_all not multicast_to_unicast, multicast_to_unicast does not do anything anymore.

bol-van commented 1 year ago

Anyway, problem is not fixed yet