openwrt / mt76

mac80211 driver for MediaTek MT76x0e, MT76x2e, MT7603, MT7615, MT7628 and MT7688
745 stars 342 forks source link

MT7603E 2.4GHz interface stability issues #719

Open dfateyev opened 1 year ago

dfateyev commented 1 year ago

I would like to bring up the topic of MT7603E stability in latest mt76 versions again. It's inspired with https://github.com/openwrt/openwrt/pull/11220 and the upcoming switch of ramips target to 5.15 kernel.

I have been testing for a while a WF3526P / ZBT WE1326 device, equipped with MT7603E+MT7612E, with "kernel-5.15" openwrt from master β€” and experience issues with 2.4Ghz (MT7603E) interface stability.

The problem is that the WiFi connection is stuck at some moment of time, more likely when the load is increasing and much data is transferred over the wireless connection. The issue can be easily reproduced with iperf3 running within a couple of minutes. When the issue occurs, a client stays connected to the AP (no visual changes), but no wireless traffic can pass the connection. There are no errors in logs both on the client side and on AP. The workaround is to reconnect to the AP, and it proceeds to work until the next connection hang.

There are several tickets on MT7603E: #669, #576, #419, #411, #391, #390, #375, etc., they're probably related, but show different symptoms. Also, worth to mention:

lukasz1992 commented 1 year ago

how about 21.02 branch?

I am afraid you need to do bisection on your own.

dfateyev commented 1 year ago

how about 21.02 branch?

I haven't tested 21.02, since after 23.xx release it will be out of support (facing the same situation as with 19.xx). I probably need, to gather more details.

I am afraid you need to do bisection on your own.

Probably, but hope that there is still some interest from developers and the community in these devices. I can test any improvements on this hardware, but mostly oriented towards fixes for master/23.xx versions, since for (very) older versions we already got it working.

khanjui commented 1 year ago

MT7603E sometimes stales (randomly?) on my MacBook Air 2020 Intel without giving any errors as described. Using MT76x0E (5 GHz) on the same device is fine. Using latest snapshot with 5.10 kernel.

Djfe commented 1 year ago

he asked you to test/bisect it with 21.02 to find out which openwrt version introduced the bug. that might reveal which change actually caused the bug and allows devs to fix it in 22 ans/or 23 :)

he didn't ask you to test 21.02 so it could be fixed there πŸ˜…

gillg commented 1 year ago

For information, since I use openwrt with my router I have huge instabilities with my bgn WiFi. I started 3 or 4 years ago so it was in version 19.x in my memories...

In AP mode it disappears randomly, crash, sometimes disconnect devices, etc. What I noticed is the fact when I enable an interface in client mode (with mobile phone connection sharing for example) it becomes a lot more stable magically...

dfateyev commented 1 year ago

There is a regression between openwrt-19.07 and openwrt-21.02, in regard of MT7603E support in mt76. I can reproduce connection issues on openwrt-21.02 snapshot after massive testing, but they rare. With newer OpenWRT versions, it's more unstable. I will probably stay with openwrt-21.02 on MT7603E.

choice77 commented 1 year ago

Edit: Don't worked, random 2.4ghz disconnects continue

I use snapshot image from 9 january 2023. OpenWrt SNAPSHOT r21728-fc33c41c21 / LuCI Master git-22.361.69865-deed682

And have the same problem. (on this setup, 2ghz, n, ht20, 6 channel)

I do two configs below and now, 12 hours later, i dont have any disconnection. (will continue testing)

1 - Disable multicast on phy0-ap0 interface To do this go to luci and interfaces > devices > phy0-ap0

2 - Change the wpa2 wifi criptography cipher from [auto] to [aes]

Ps: This bug dont show any warning, notices or information on log system. i dont know if this bug is a criptography key problem (Tkip or auto switch) or multicast flood on wireless clients.

Update: Don't worked, random 2.4ghz disconnects continue****

choice77 commented 1 year ago

What i discover:

My disconnections is not related with a number of 2ghz clients/sta connected to router. (i have only one device with 2.4ghz in my house)

I noticed that random disconnections is related to wear/poor signal conditions. But this disconnections donΒ΄t ocurred with archer c6 v3 official tplink firmware.

My theory: (i will test for 48 hours and inform , here) The option Time interval for rekeying GTK is too short in openwrt default (by driver). The field value is set with only 300 seconds. (example: on ddwrt this default is 3600 seconds).

On bad signal conditions, the excess of renewed key, can cause hang and disconnections on 2.4ghz wireless clients.

webysther commented 1 year ago

In my case how fix:

Now stable across all clients.

shown19 commented 1 year ago

MT7603e does not handle SMPS well, making 2.4GHz WiFi disappear or system crash or connection unstable or lost #576, but SMPS should be already disabled for #MT7603E

I don't have deeper knowledge of the system but what if after all this time, the solution to this is to enable SMPS but with refine code to work with MT7603E? I don't know if it's worth a try, though like I said my knowledge is limited.

Djfe commented 1 year ago

looking at the codebase SMPS is supported and enabled as far as I can tell. The code was added in Jan 2019 so it might be part of openwrt-19.07, but I haven't checked. https://github.com/openwrt/mt76/commit/fc31457cd99cb85c8cea9329eedc5edd80038f29

@dfateyev what made you think it was disabled? easyteacher closed their PRs before they were ever merged

shown19 commented 1 year ago

@Djfe Hello, I think he meant this?

sm disabled

if so, mine is disabled too, device is Newifi D2

Djfe commented 1 year ago

makes me curious whether it is also disabled on openwrt-19.07 (I don't own an affected device)

shown19 commented 1 year ago

makes me curious whether it is also disabled on openwrt-19.07 (I don't own an affected device)

by the way, I'm on the latest snapshot build now and this is also disabled in the stable release v22.03.5 but I'm not sure in the older version down to v19 .07 if it was still disabled. I might take a look at it if I got more spare time again.

shown19 commented 1 year ago

makes me curious whether it is also disabled on openwrt-19.07 (I don't own an affected device)

I don't have deeper knowledge about this but when I checked the code based on easyteacher info, since I'm learning how to compile also, I can actually see the code block about smps is enabled but unfortunately after flashing, I don't know why it isn't enabled. Maybe there's a conditional statement something that is disabled in some of mt76 devices? Sorry my knowledge is limited.

dfateyev commented 1 year ago

what made you think it was disabled? easyteacher closed their PRs before they were ever merged makes me curious whether it is also disabled on openwrt-19.07 (I don't own an affected device)

Beside the SM power save in "disabled" state, I didn't manage to trigger any SMPS related events while testing this board last year. The SMPS option already presents in v19.07: SM Power Save disabled. I still have one MT7603E under v19.07.

shown19 commented 1 year ago

@dfateyev hi, may I know what specific v19.07 of openwrt you're using?

dfateyev commented 1 year ago

may I know what specific v19.07 of openwrt you're using?

OpenWrt 19.07.10, r11427-9ce6aa9d8d, device ZBT WE1326 / WE3526.

malekairmaroc7 commented 1 year ago

This also applies to me. The 2.4GHz is very unstable causing connection crash and reconnect attempts. I am using a TP Link Archer C6 V3 (EU) running OpenWRT 22.03.5 (DISTRIB_DESCRIPTION: OpenWrt 22.03.5 r20134-5f15225c1e)

ShredRum commented 1 year ago

Same. Xiaomi Router 4A (R4AC) OpenWrt SNAPSHOT r23454-01885bc6a3 / LuCI Master git-23.158.78004-23a246e

malekairmaroc7 commented 1 year ago

Aren't there any alternative drivers?

lukasz1992 commented 1 year ago

they are, but incompatible by luci installed by default (there is mediatek module for luci where it works). Also uci2dat is needed to sync config with uci

malekairmaroc7 commented 1 year ago

I see. Too bad.

nbd168 commented 1 year ago

Please try latest OpenWrt master or 23.05 branch

dfateyev commented 1 year ago

@nbd168 I have been testing 23.05 branch on MT7603E for a week (commit c697057b from Aug 05, 2023).

The issue with 2.4Ghz stability still present: 802.11n 20MHz band WPA2 on AP, iperf3 from an AP client to a DMZ host leads to LA 0.8-0.9 on AP and WLAN connection stuck. I also disabled NAT and MSS clamping on AP, but it didn't improve the situation with LA and stability. There are no any relevant logs both on AP and client's side.

The good news is that legacy 802.11g mode is now fully stable: hammered it with iperf3 for days without a drop. It features a low bandwidth, LA on AP doesn't go beyond 0.4, and in general, makes the AP much less useful.

nbd168 commented 1 year ago

Please try this patch on top of current mt76: https://nbd.name/p/762e9946

dfateyev commented 1 year ago

Please try this patch on top of current mt76: https://nbd.name/p/762e9946

I applied the patch against mt76 master, and used it with "openwrt-23.05" build (commit b59d02be). I noticed a slightly decreased LA, but while loading the AP with iperf3 from 2 clients the AP crashed/restarted in 2-3h. Repeated the same test with BW load, and the AP went unresponsive in 2-3h again β€” this time w/o reboot, although LEDs are active, there is no WLAN in air and no LAN access. Seems, I cannot provide a crash log from the AP, sorry. During the load test, I also saw increasing beacon stuck count, similar to https://github.com/openwrt/mt76/issues/793#issuecomment-1680167853.

shown19 commented 1 year ago

AP went unresponsive in 2-3h again β€” this time w/o reboot, although LEDs are active, there is no WLAN in air and no LAN access

Oh I thought I was the only one, I also experienced this also and one of the reasons why I disabled the WIFI and used another access point.

DragonBluep commented 1 year ago

Can you show us the output of dmesg | grep -i mt76

dfateyev commented 1 year ago

Can you show us the output of dmesg | grep -i mt76

Unfortunately I cannot, since WLAN and LAN access to the AP is lost. It looks like AP is alive but unresponsive via network. I probably need a serial console, but it would require pin soldering, etc.

Djfe commented 1 year ago

but it's alive once you reboot? in that case make your openwrt device send it's logs over the network to another device over lan, that way the log should contain relevant parts of what goes wrong the next time it happens (even without serial access)

dfateyev commented 1 year ago

The patch above is already in master. I cleaned everything up, and re-tested "openwrt-23.05" with mt76 master. I see a big improvement: hammering the AP (both radio interfaces) from several iperf3 clients for 3 days, I saw no stuck with modern wireless clients. The LA on AP reached 2 and beyond, the throughput was up and down, but 2.4Ghz connection was persisted. The beacon stuck count was 0 after passing 1Tb of wireless traffic.

I still observe periodic 2.4GHz stuck from older WLAN cards (like Intel AC7265), but newer ones (like AX201) work stable. Another issue is performance: with default settings, I cannot get more than 45-50Mbits/sec via 2.4Ghz with WPA2+CCMP, with a proprietary driver I got more. In overall, it's a big step forward, and we could provide it to openwrt for further testing.

DragonBluep commented 1 year ago

I still observe periodic 2.4GHz stuck from older WLAN cards (like Intel AC7265), but newer ones (like AX201) work stable. Another issue is performance: with default settings, I cannot get more than 45-50Mbits/sec via 2.4Ghz with WPA2+CCMP, with a proprietary driver I got more. In overall, it's a big step forward, and we could provide it to openwrt for further testing.

Please ensure that you have turned off Bluetooth before the speed test. And Intel 7265AC rev.C is very bad. I have suffered a lot before.

dfateyev commented 1 year ago

Please ensure that you have turned off Bluetooth before the speed test.

I did test with Bluetooth disabled, experimented with 20/40MHz band, power was 20dBm max etc., but results always were 45-50Mbits/sec. I remember the proprietary driver gave me about 60+Mbits/sec on this device. Rather sad ~40Mbits/s numbers, considering the fact that from the same client I easily get 97-107Mbits/sec from an MT7915E 2.4Ghz OpenWRT AP nearby.

And Intel 7265AC rev.C is very bad. I have suffered a lot before.

Yes, I know, I just was interested to check 7265 cards for this case. They will be decommissioned soon.

DragonBluep commented 1 year ago

These are my test results several days before. I can get 150+ Mbps when the AP is 3 meters away from my computer. https://github.com/openwrt/mt76/issues/793#issuecomment-1682267025 https://github.com/openwrt/mt76/issues/793#issuecomment-1683783361

I have experienced the issue of Wi-Fi speed reduction caused by WPA2. But after I updated/rebooted the Windows OS, the problem disappeared.

Edit: My NIC is MT7921.

dfateyev commented 1 year ago

Proceeding with Intel AX201 client (working on Linux kernel 6.4.12), I noticed in client logs a lot of messages like:

[Wed Aug 30 16:43:28 2023] wlp0s20f3: authenticate with 78:a3:51:6a:xx:xx
[Wed Aug 30 16:43:28 2023] wlp0s20f3: 80 MHz not supported, disabling VHT
[Wed Aug 30 16:43:28 2023] wlp0s20f3: send auth to 78:a3:51:6a:xx:xx (try 1/3)
[Wed Aug 30 16:43:28 2023] wlp0s20f3: authenticated
[Wed Aug 30 16:43:28 2023] wlp0s20f3: associate with 78:a3:51:6a:xx:xx (try 1/3)
[Wed Aug 30 16:43:28 2023] wlp0s20f3: RX AssocResp from 78:a3:51:6a:xx:xx (capab=0x431 status=0 aid=1)
[Wed Aug 30 16:43:28 2023] wlp0s20f3: associated
[Wed Aug 30 16:43:37 2023] iwlwifi 0000:00:14.3: reached 20 old SN frames from 78:a3:51:6a:xx:xx on queue 8, stopping BA session on TID 0
[Wed Aug 30 16:43:38 2023] iwlwifi 0000:00:14.3: reached 20 old SN frames from 78:a3:51:6a:xx:xx on queue 8, stopping BA session on TID 0
[Wed Aug 30 16:43:39 2023] iwlwifi 0000:00:14.3: reached 20 old SN frames from 78:a3:51:6a:xx:xx on queue 8, stopping BA session on TID 0
[Wed Aug 30 16:43:56 2023] iwlwifi 0000:00:14.3: reached 20 old SN frames from 78:a3:51:6a:xx:xx on queue 10, stopping BA session on TID 0
[Wed Aug 30 16:43:56 2023] iwlwifi 0000:00:14.3: Unhandled alg: 0x703
[Wed Aug 30 16:43:56 2023] iwlwifi 0000:00:14.3: Unhandled alg: 0x703
[Wed Aug 30 16:43:56 2023] iwlwifi 0000:00:14.3: Unhandled alg: 0x703
[Wed Aug 30 16:43:56 2023] iwlwifi 0000:00:14.3: Unhandled alg: 0x703
[Wed Aug 30 16:43:56 2023] iwlwifi 0000:00:14.3: Unhandled alg: 0x703
[Wed Aug 30 16:44:02 2023] iwlwifi 0000:00:14.3: reached 20 old SN frames from 78:a3:51:6a:xx:xx on queue 10, stopping BA session on TID 0
[Wed Aug 30 16:44:06 2023] iwlwifi 0000:00:14.3: reached 20 old SN frames from 78:a3:51:6a:xx:xx on queue 10, stopping BA session on TID 0
[Wed Aug 30 16:44:29 2023] iwlwifi 0000:00:14.3: reached 20 old SN frames from 78:a3:51:6a:xx:xx on queue 8, stopping BA session on TID 0
...
[Wed Aug 30 16:54:18 2023] iwlwifi 0000:00:14.3: reached 20 old SN frames from 78:a3:51:6a:xx:xx on queue 6, stopping BA session on TID 0
[Wed Aug 30 16:54:18 2023] iwlwifi 0000:00:14.3: reached 20 old SN frames from 78:a3:51:6a:xx:xx on queue 6, stopping BA session on TID 0
[Wed Aug 30 16:54:19 2023] iwlwifi 0000:00:14.3: reached 20 old SN frames from 78:a3:51:6a:xx:xx on queue 6, stopping BA session on TID 0
[Wed Aug 30 16:54:19 2023] net_ratelimit: 8 callbacks suppressed
[Wed Aug 30 16:54:19 2023] iwlwifi 0000:00:14.3: Unhandled alg: 0x703
[Wed Aug 30 16:54:19 2023] iwlwifi 0000:00:14.3: Unhandled alg: 0x703
[Wed Aug 30 16:54:19 2023] iwlwifi 0000:00:14.3: Unhandled alg: 0x703
[Wed Aug 30 16:54:19 2023] iwlwifi 0000:00:14.3: Unhandled alg: 0x703
[Wed Aug 30 16:54:19 2023] iwlwifi 0000:00:14.3: Unhandled alg: 0x703
[Wed Aug 30 16:54:21 2023] iwlwifi 0000:00:14.3: reached 20 old SN frames from 78:a3:51:6a:xx:xx on queue 6, stopping BA session on TID 0
[Wed Aug 30 16:54:41 2023] iwlwifi 0000:00:14.3: reached 20 old SN frames from 78:a3:51:6a:xx:xx on queue 11, stopping BA session on TID 0
[Wed Aug 30 16:54:43 2023] iwlwifi 0000:00:14.3: reached 20 old SN frames from 78:a3:51:6a:xx:xx on queue 11, stopping BA session on TID 0
[Wed Aug 30 16:54:44 2023] iwlwifi 0000:00:14.3: reached 20 old SN frames from 78:a3:51:6a:xx:xx on queue 11, stopping BA session on TID 0
...
[Wed Aug 30 16:56:34 2023] net_ratelimit: 4 callbacks suppressed
[Wed Aug 30 16:56:34 2023] iwlwifi 0000:00:14.3: Unhandled alg: 0x703
[Wed Aug 30 16:56:34 2023] iwlwifi 0000:00:14.3: Unhandled alg: 0x703
[Wed Aug 30 16:56:34 2023] iwlwifi 0000:00:14.3: Unhandled alg: 0x703
[Wed Aug 30 16:56:34 2023] iwlwifi 0000:00:14.3: Unhandled alg: 0x703
[Wed Aug 30 16:56:38 2023] iwlwifi 0000:00:14.3: reached 20 old SN frames from 78:a3:51:6a:xx:xx on queue 3, stopping BA session on TID 0
[Wed Aug 30 16:56:38 2023] iwlwifi 0000:00:14.3: reached 20 old SN frames from 78:a3:51:6a:xx:xx on queue 3, stopping BA session on TID 0
[Wed Aug 30 16:56:41 2023] iwlwifi 0000:00:14.3: reached 20 old SN frames from 78:a3:51:6a:xx:xx on queue 10, stopping BA session on TID 0
[Wed Aug 30 16:56:43 2023] iwlwifi 0000:00:14.3: reached 20 old SN frames from 78:a3:51:6a:xx:xx on queue 10, stopping BA session on TID 0
[Wed Aug 30 16:56:47 2023] iwlwifi 0000:00:14.3: reached 20 old SN frames from 78:a3:51:6a:xx:xx on queue 3, stopping BA session on TID 0
[Wed Aug 30 16:56:49 2023] iwlwifi 0000:00:14.3: reached 20 old SN frames from 78:a3:51:6a:xx:xx on queue 3, stopping BA session on TID 0
...
[Wed Aug 30 17:06:57 2023] iwlwifi 0000:00:14.3: reached 20 old SN frames from 78:a3:51:6a:xx:xx on queue 5, stopping BA session on TID 0
[Wed Aug 30 17:06:58 2023] iwlwifi 0000:00:14.3: reached 20 old SN frames from 78:a3:51:6a:xx:xx on queue 4, stopping BA session on TID 0
[Wed Aug 30 17:07:00 2023] iwlwifi 0000:00:14.3: reached 20 old SN frames from 78:a3:51:6a:xx:xx on queue 5, stopping BA session on TID 0

The connection doesn't break (which is good), but slows down when it happens. The beacon stuck and MCU hang are 0. Still unsure either I have a signal interference with some Bluetooth or other devices, or there is another issue.

dfateyev commented 1 year ago

The frames retransmission issue above is the same as https://github.com/openwrt/mt76/issues/569. If I switch to a legacy mode, they disappear β€” and also 3 times WLAN speed drop, as expected.

malekairmaroc7 commented 11 months ago

I still notice unstable 2.4 GHz Wi-Fi on the TP Link Archer C6 (EU V3). It doesn't reconnect very often now, but the internet connection is pretty slow. Not sure if it's a still known driver issue.

Edit: the router is running OpenWRT 23.05.0.

lukasz1992 commented 11 months ago

@malekairmaroc7 create a new issue