openwrt / mt76

mac80211 driver for MediaTek MT76x0e, MT76x2e, MT7603, MT7615, MT7628 and MT7688
752 stars 343 forks source link

MT7981/MT7975 AX STA doesn't receive messages from AP after sometime #867

Open trunneml opened 8 months ago

trunneml commented 8 months ago

Describe the bug

On my Banana Pi R3 running 23.05.2 my two iPhones are loosing network access after some time. This occurs multiple times on a day, when the phone is in standby. They are the only AX clients in my wifi. All other clients/STA are AC and working fine.

The wifi the connection is shown on the router and the phone and signal is still good. But the phone can't access any website or luci (100% package loose) and is not pingable. Turning Wifi off and on helps and it is working again for some hours.

I tried

option dtim_period '3'
option wpa_group_rekey '86400'
option disassoc_low_ack '0'

But it didn't help. Syslog doesn't show anything special. See: https://github.com/openwrt/openwrt/issues/14824#issuecomment-2003037670

When that issue occurs, the iPhone with the connection problem no longer has HE-MCS and HE-NSS attributes in LUCI.

grafik

Running tcpdump when that error occurs shows that the AP still receives packages from the iPhone and answers them (for examle DNS Request) but it seems that the iPhone doesn't receives them:

3:22:17.430825 IP (tos 0x0, ttl 64, id 55241, offset 0, flags [none], proto UDP (17), length 57)
    192.168.1.70.63557 > 192.168.1.1.53: 19527+ HTTPS? openwrt.org. (29)
13:22:17.431386 IP (tos 0x0, ttl 64, id 18739, offset 0, flags [DF], proto UDP (17), length 124)
    192.168.1.1.53 > 192.168.1.70.63557: 19527 0/1/0 (96)
13:22:17.433705 IP (tos 0x0, ttl 64, id 29105, offset 0, flags [none], proto UDP (17), length 57)
    192.168.1.70.53183 > 192.168.1.1.53: 33776+ AAAA? openwrt.org. (29)
13:22:17.433705 IP (tos 0x0, ttl 64, id 54214, offset 0, flags [none], proto UDP (17), length 57)
    192.168.1.70.63373 > 192.168.1.1.53: 51319+ A? openwrt.org. (29)
13:22:17.434177 IP (tos 0x0, ttl 64, id 18740, offset 0, flags [DF], proto UDP (17), length 85)
    192.168.1.1.53 > 192.168.1.70.53183: 33776 1/0/0 openwrt.org. AAAA 2a03:b0c0:3:d0::1a51:c001 (57)
13:22:17.434232 IP (tos 0x0, ttl 64, id 18741, offset 0, flags [DF], proto UDP (17), length 73)
    192.168.1.1.53 > 192.168.1.70.63373: 51319 1/0/0 openwrt.org. A 64.226.122.113 (45)
13:22:17.494128 IP6 (hlim 255, next-header ICMPv6 (58) payload length: 32) 2003:c3:b732:32f1:b894:735d:7743:5a2d > ff02::1:ff00:43a: [icmp6 sum ok] ICMP6, neighbor solicitation, length 32, who has 2003:c3:b732:32f1::43a
      source link-address option (1), length 8 (1): b0:8c:75:ec:15:e7
13:22:17.494139 IP6 (hlim 255, next-header ICMPv6 (58) payload length: 32) 2003:c3:b732:32f1:b894:735d:7743:5a2d > ff02::1:ff00:43a: [icmp6 sum ok] ICMP6, neighbor solicitation, length 32, who has 2003:c3:b732:32f1::43a
      source link-address option (1), length 8 (1): b0:8c:75:ec:15:e7
13:22:17.494129 IP6 (hlim 255, next-header ICMPv6 (58) payload length: 32) fd6e:281:90d1:0:8b3:8f5:ab9:c10a > ff02::1:ff74:91bf: [icmp6 sum ok] ICMP6, neighbor solicitation, length 32, who has fd6e:281:90d1:0:eb:c855:3d74:91bf
      source link-address option (1), length 8 (1): b0:8c:75:ec:15:e7
13:22:17.494204 IP6 (hlim 255, next-header ICMPv6 (58) payload length: 32) fd6e:281:90d1:0:8b3:8f5:ab9:c10a > ff02::1:ff74:91bf: [icmp6 sum ok] ICMP6, neighbor solicitation, length 32, who has fd6e:281:90d1:0:eb:c855:3d74:91bf
      source link-address option (1), length 8 (1): b0:8c:75:ec:15:e7
13:22:17.494261 IP6 (hlim 255, next-header ICMPv6 (58) payload length: 32) 2003:c3:b732:32f1:b894:735d:7743:5a2d > ff02::1:ff73:adb9: [icmp6 sum ok] ICMP6, neighbor solicitation, length 32, who has 2003:c3:b732:32f1:1850:2bb1:c973:adb9
      source link-address option (1), length 8 (1): b0:8c:75:ec:15:e7
13:22:17.494267 IP6 (hlim 255, next-header ICMPv6 (58) payload length: 32) 2003:c3:b732:32f1:b894:735d:7743:5a2d > ff02::1:ff73:adb9: [icmp6 sum ok] ICMP6, neighbor solicitation, length 32, who has 2003:c3:b732:32f1:1850:2bb1:c973:adb9
      source link-address option (1), length 8 (1): b0:8c:75:ec:15:e7
13:22:17.495423 IP6 (hlim 255, next-header ICMPv6 (58) payload length: 32) fe80::1c31:a960:b2f3:52d9 > ff02::1:ff75:3675: [icmp6 sum ok] ICMP6, neighbor solicitation, length 32, who has fe80::108d:a1ca:fe75:3675
      source link-address option (1), length 8 (1): b0:8c:75:ec:15:e7
13:22:17.495427 IP6 (hlim 255, next-header ICMPv6 (58) payload length: 32) fe80::1c31:a960:b2f3:52d9 > ff02::1:ff75:3675: [icmp6 sum ok] ICMP6, neighbor solicitation, length 32, who has fe80::108d:a1ca:fe75:3675
      source link-address option (1), length 8 (1): b0:8c:75:ec:15:e7
13:22:17.740743 IP6 (hlim 255, next-header ICMPv6 (58) payload length: 32) 2003:c3:b732:32f1:5c58:4eb9:8503:f8cf > 2003:c3:b732:32f1:b894:735d:7743:5a2d: [icmp6 sum ok] ICMP6, neighbor advertisement, length 32, tgt is 2003:c3:b732:32f1::43a, Flags [solicited, override]
      destination link-address option (2), length 8 (1): a4:83:e7:d8:19:73
13:22:17.740756 IP6 (hlim 255, next-header ICMPv6 (58) payload length: 32) 2003:c3:b732:32f1:5c58:4eb9:8503:f8cf > 2003:c3:b732:32f1:b894:735d:7743:5a2d: [icmp6 sum ok] ICMP6, neighbor advertisement, length 32, tgt is 2003:c3:b732:32f1::43a, Flags [solicited, override]
      destination link-address option (2), length 8 (1): a4:83:e7:d8:19:73
13:22:17.740743 IP6 (hlim 255, next-header ICMPv6 (58) payload length: 32) fd6e:281:90d1:0:eb:c855:3d74:91bf > fd6e:281:90d1:0:8b3:8f5:ab9:c10a: [icmp6 sum ok] ICMP6, neighbor advertisement, length 32, tgt is fd6e:281:90d1:0:eb:c855:3d74:91bf, Flags [solicited, override]
      destination link-address option (2), length 8 (1): a4:83:e7:d8:19:73
13:22:17.740769 IP6 (hlim 255, next-header ICMPv6 (58) payload length: 32) fd6e:281:90d1:0:eb:c855:3d74:91bf > fd6e:281:90d1:0:8b3:8f5:ab9:c10a: [icmp6 sum ok] ICMP6, neighbor advertisement, length 32, tgt is fd6e:281:90d1:0:eb:c855:3d74:91bf, Flags [solicited, override]
      destination link-address option (2), length 8 (1): a4:83:e7:d8:19:73
13:22:17.740743 IP6 (hlim 255, next-header ICMPv6 (58) payload length: 32) 2003:c3:b732:32f1:5c58:4eb9:8503:f8cf > 2003:c3:b732:32f1:b894:735d:7743:5a2d: [icmp6 sum ok] ICMP6, neighbor advertisement, length 32, tgt is 2003:c3:b732:32f1:1850:2bb1:c973:adb9, Flags [solicited, override]
      destination link-address option (2), length 8 (1): a4:83:e7:d8:19:73
13:22:17.740774 IP6 (hlim 255, next-header ICMPv6 (58) payload length: 32) 2003:c3:b732:32f1:5c58:4eb9:8503:f8cf > 2003:c3:b732:32f1:b894:735d:7743:5a2d: [icmp6 sum ok] ICMP6, neighbor advertisement, length 32, tgt is 2003:c3:b732:32f1:1850:2bb1:c973:adb9, Flags [solicited, override]
      destination link-address option (2), length 8 (1): a4:83:e7:d8:19:73
13:22:17.741164 IP6 (hlim 255, next-header ICMPv6 (58) payload length: 32) fe80::108d:a1ca:fe75:3675 > fe80::1c31:a960:b2f3:52d9: [icmp6 sum ok] ICMP6, neighbor advertisement, length 32, tgt is fe80::108d:a1ca:fe75:3675, Flags [solicited, override]
      destination link-address option (2), length 8 (1): a4:83:e7:d8:19:73
13:22:17.741167 IP6 (hlim 255, next-header ICMPv6 (58) payload length: 32) fe80::108d:a1ca:fe75:3675 > fe80::1c31:a960:b2f3:52d9: [icmp6 sum ok] ICMP6, neighbor advertisement, length 32, tgt is fe80::108d:a1ca:fe75:3675, Flags [solicited, override]
      destination link-address option (2), length 8 (1): a4:83:e7:d8:19:73

Maybe this bug is related to: https://github.com/openwrt/mt76/issues/792

OpenWrt version

r23630-842932a63d and r23806-03a3a729ec

OpenWrt release

23.05.2 and 23.05-SNAPSHOT

OpenWrt target/subtarget

mediatek/filogic

Device

Bananapi BPI-R3 (MT7981/MT7975P)

trunneml commented 8 months ago

There is comment that some corrupted Ethernet Frames could freeze RX DMA on that chip: https://github.com/openwrt/openwrt/issues/13198#issuecomment-1777269938

brada4 commented 8 months ago

In the same discussion - other suspected trigger is when other STA disconnects and iphone(s) freeze.

trunneml commented 8 months ago

In the same discussion - other suspected trigger is when other STA disconnects and iphone(s) freeze.

No other STA disconnected or joined, when that error occurred the last two times.

trunneml commented 8 months ago

Short feedback. My Banana Pi R3 is running 5G in AC mode now for about 4 days and no disconnection of my iphones. I think the bug is somewhere in the AX Mode. Can someone tell me how to debug that stuff?

mrkiko commented 7 months ago

Are you able to reproduce this issue on OpenWrt main SNAPSHOT?

Delphius7 commented 6 months ago

I have the same behavior with 2 iPhones and a GL-MT6000 (MT7986AV) about every 3-4 days (then sometimes also 4-5x per day). I cannot ping anything on the network from the phones and cannot ping the phones either. But they are still connected to the wifi. Disable/Enable wifi on the phones solves the issue temporarily.

I have configured the router to use AX@80MHz and I am using the latest snapshot with kernel 6.6 but had the issue for months now even with older snapshots using kernel 5.

Unfortunately I cannot yet reliably reproduce it. Does anyone have an idea what I can do?

trunneml commented 6 months ago

As my iPhones are the only AX devices I own, I just turned of AX in OpenWRT. So the 5GHz Channel is running in AC mode. Since then no more problems.

Fail-Safe commented 6 months ago

GL-MT6000 (MT7986AV) about every 3-4 days (then sometimes also 4-5x per day). I cannot ping anything on the network from the phones and cannot ping the phones either. But they are still connected to the wifi. Disable/Enable wifi on the phones solves the issue temporarily.

I've noticed this behavior as well with 2 out of my three GL-MT6000s. This, despite my otherwise optimistic sounding post about stability here.

It is a very strange issue in that it only affects an STA or two at any time. Further, it seems to be only one MT6000 + a "problem" STA. If I force the stalled STA off the borked MT6000 to where the STA associates to a neighboring MT6000, the STA will pick up and begin communication again. But if it roams back to the borked MT6000, the STA just goes "cold"--stays associated but gives no sign of two-way data communication.

Given the challenge of trying to isolate the what and when around this issue, I've had a hard time articulating it enough to even consider opening an issue. But I'm glad to see this issue has been opened here and will be following intently now.

If I can provide any other supporting info, or test any patches, I'm very willing to help. Thanks!

Fail-Safe commented 6 months ago

What mt76 firmware versions are you all running on your devices?

At the present time for me:

# dmesg | grep Firmware
[   11.707186] platform 15010000.wed: MTK WED WO Firmware Version: DEV_000000, Build Time: 20240507160523
[   12.355218] mt798x-wmac 18000000.wifi: WM Firmware Version: ____000000, Build Time: 20240507160318
[   12.439591] mt798x-wmac 18000000.wifi: WA Firmware Version: DEV_000000, Build Time: 20240507160509
trunneml commented 6 months ago

I'm on 2023.05-Snapshot:

root@OpenWrt:~# dmesg | grep Firmware
[   18.233083] mt798x-wmac 18000000.wifi: WM Firmware Version: ____000000, Build Time: 20221012174805
[   18.480859] mt798x-wmac 18000000.wifi: WA Firmware Version: DEV_000000, Build Time: 2022101217493
Delphius7 commented 6 months ago

I am on snapshot r26399-17ca4cccc6

# dmesg | grep Firmware
[   12.149544] mt798x-wmac 18000000.wifi: WM Firmware Version: ____000000, Build Time: 20221012174725
[   12.231740] mt798x-wmac 18000000.wifi: WA Firmware Version: DEV_000000, Build Time: 20221012174937
filippz commented 6 months ago

I'm having something similar with a pair of TUF-AX4200 connected via mesh and I have some clients that suffer from the same issue while other seem to be working fine.

Possibly the same issue appears occasionally between routers - the mesh is up but no traffic goes trough and tcpdump shows that ping packets from one of them is received by the second one and it does reply to it but the first one doesn't receive the reply. Luckily, I noticed that mesh becomes broken immediately if I do iw scan but I simply haven't figured out what's actually broken let alone how to fix it. I even tried custom builds from https://github.com/pesa1234/mt76/ with different firmwares/patches adapted/taken from https://git01.mediatek.com/plugins/gitiles/openwrt/feeds/mtk-openwrt-feeds/ but the issue persists to the point that I reverted back to TP-Link C2600 so I can test the issue without causing internet outages in my home.

Can anyone with mesh setup try iw scan on mesh interface and see if it breaks for them? I suppose that it's the same issue we just can trigger at will so it should be easier to figure out what's going on.

DanielRIOT commented 6 months ago

@filippz I Think I'm seeing something similar on my Cudy WR3000V1 devices ( MT7981 ) Mesh is configured and runs fine for about 10 minutes before mesh peers become unreachable ( right after I get "nl80211: wpa_driver_nl80211_event_receive->nl_recvmsgs failed: -5" on the mesh peers. The device with the ethernet to mesh bridge seems to keep working, and I can see DHCP requests from Mesh peers on my DHCP server, but teh replies do not seem to reach them ) a "wifi up radio1" lets it work again for a few minutes

 wireless.wmesh5=wifi-iface
 wireless.wmesh5.device='radio1'
 wireless.wmesh5.mode='mesh'
 wireless.wmesh5.mesh_id='M5-MESH'
 wireless.wmesh5.encryption='sae'
 wireless.wmesh5.key='meshmeshmesh'
 wireless.wmesh5.network='mesh5'
 wireless.wmesh5.mesh_fwding='0'
 wireless.wmesh5.mesh_rssi_threshold='0'

mesh fwding is off because BATMAN-ADV will handle routing for me.. The same system config works fine when I run it on a TP-Link EAP225 outdoor ( ath10k instead of MT76)

I wonder if whatever bug @PolynomialDivision was experiencing here is also related.. https://forum.openwrt.org/t/mt76-wireless-driver-debugging/154514/51?u=ddk

filippz commented 6 months ago

@filippz I Think I'm seeing something similar on my Cudy WR3000V1 devices ( MT7981 ) Mesh is configured and runs fine for about 10 minutes before mesh peers become unreachable ( right after I get "nl80211: wpa_driver_nl80211_event_receive->nl_recvmsgs failed: -5" on the mesh peers.

Does it stop working immediately if you do iw scan on either side? For me mesh would work at least for some hours and sometimes days. I tried generating more traffic with iperf3 but I couldn't trigger the issue. I still believe that it breaks the more traffic it gets but not that quickly.

The device with the ethernet to mesh bridge seems to keep working, and I can see DHCP requests from Mesh peers on my DHCP server, but teh replies do not seem to reach them

I tried using arping and noticed that I can get a reply to a broadcast (most of the time - not always) but not unicast (https://github.com/openwrt/openwrt/issues/13880#issuecomment-2028800436) You can try arping / arping -b to see how it behaves for you.

DanielRIOT commented 5 months ago

Hi @filippz .. you are correct : I have a deamon that runs a scan every 10 minutes ( ubus call iwinfo scan {\"device\":\"$phynum\" } ) as part of a periodic survey on the system...

disabling it and running my tests again now

filippz commented 4 months ago

...a scan every 10 minutes ( ubus call iwinfo scan {"device":"$phynum" } ) ...

I tried building master with Mediatek feed and scan also breaks the mesh. I still can't say if the loosing network after some time is actually the same bug we just trigger differently (with scan on the mesh) or this is another bug altogether.

DanielRIOT commented 4 months ago

after removing my "survey deamon" i have 7 CUDYWR300V1 ( MT7981 ) devices that have been up for the entire weekend without issue ( Mesh on 2.4Ghz and 5 Ghz ) with AP on 2.4Ghz ( Vif ) so it sees that scanning breaks something

filippz commented 4 months ago

@DanielRIOT - is your mesh now stable over longer periods?

DanielRIOT commented 4 months ago

Hi @filippz , yes it is, the mesh ( 802.11S / mesh point ) is stable as long as I do not do any wifi scans ( iw scan or ubus calls to iwinfo scan or the like ) I have updated my built to Master last Thursday and the behavior is the same as before.

lukasz1992 commented 4 months ago

Maybe you could try adding these patches: https://github.com/lukasz1992/openwrt/blob/v23.05.4-lukasz1992/package/kernel/mt76/patches/104-mt7915-add-missing-flush.patch https://github.com/lukasz1992/openwrt/blob/v23.05.4-lukasz1992/package/kernel/mt76/patches/131-partially-move-channel-change-code-to-core.patch https://github.com/lukasz1992/openwrt/blob/v23.05.4-lukasz1992/package/kernel/mt76/patches/132-add-separate-tx-scheduling-queue-for-off-channel-tx.patch

compile and check if it helps?

filippz commented 4 months ago

@lukasz1992 Thanks for taking a look - sadly the issue persists.

I've used official OpenWRT repo, v23.05.4 tag, added your patches to package/kernel/mt76/patches and while building I saw Applying .../package/kernel/mt76/patches/104-mt7915-add-missing-flush.patch using plaintext lines so I guess patches have been applied successfully. For me broken scan is not a deal breaker but mesh for me fails with the same symptoms after a while so I'm assuming that scan is just another way of triggering the issue.

lukasz1992 commented 4 months ago

And this? https://github.com/cmonroe/feed-wifi-master/commit/5fdb4113bd85d89bbe577c6cf7bb24b8576d73c9

filippz commented 4 months ago

And this? cmonroe/feed-wifi-master@5fdb411

I tried to build it, but with it fails to apply - by manually looking at code I'd say that in v23.05.4 kernel+backports || local->scanning was not present so this patch would not help. In any case seems to me that the changed code is not mt76 specific so we would see the same issue on other drivers as well.

Delphius7 commented 3 months ago

It seems like I found a workaround that works for me. I have installed the custom build of pesa1234 (based on r27280-aa1c1b6e29) on my MT6000 and for a bit more that a week i had no issues anymore. Before this I had the issues with 2 iPhones 3-5x per day. https://github.com/pesa1234/MT6000_cust_build https://forum.openwrt.org/t/mt6000-custom-build-with-luci-and-some-optimization-kernel-6-6-x/185241

I am not sure what exactly in this build solves the issue and I would be happy to have it solved in a clean snapshot build as well. Maybe someone with more knowledge can figure this out?

filippz commented 2 months ago

Hi @filippz , yes it is, the mesh ( 802.11S / mesh point ) is stable as long as I do not do any wifi scans ( iw scan or ubus calls to iwinfo scan or the like ) I have updated my built to Master last Thursday and the behavior is the same as before.

This commit fixes WiFi scan on mesh.