openwrt / mt76

mac80211 driver for MediaTek MT76x0e, MT76x2e, MT7603, MT7615, MT7628 and MT7688
745 stars 343 forks source link

WI-FI is unstable at 2.4 GHz #793

Closed ShredRum closed 1 year ago

ShredRum commented 1 year ago

Hello, I have a Xiaomi router 4A (R4AC) with OpenWrt installed SNAPSHOT r23454-01885bc6a3 / LuCI Master git-23.158.78004-23a246e

From time to time, with a Wi-Fi load of 2.4 GHz, the network starts to disappear, after which it appears again after a couple of seconds. There is no information in the log other than the actual disconnection and connection of devices to Wi-Fi. I also managed to catch a driver crash once, but I don't think it could be related to the problem (it never showed up again).

Disabling WMM mode helps, but the network speed drops below 20 Mbps.

This problem does not appear on a 5 GHz network.

Below I will provide the crash log of the driver, but keep in mind that it is not reproducible, and appeared only 1 time.

Thu Jun 29 18:54:00 2023 kern.warn kernel: [ 210.656151] ------------[ cut here ]------------ Thu Jun 29 18:54:00 2023 kern.warn kernel: [ 210.660885] WARNING: CPU: 0 PID: 511 at target-mipsel_24kc_musl/linux-ramips_mt76x8/mt76-2023-05-13-969b7b5e/mt7603/mac.c:208 mt7603_filter_tx+0x178/0x180 [mt7603e] Thu Jun 29 18:54:00 2023 kern.warn kernel: [ 210.675870] Modules linked in: pppoe ppp_async nft_fib_inet nf_flow_table_ipv6 nf_flow_table_ipv4 nf_flow_table_inet pppox ppp_generic nft_reject_ipv6 nft_reject_ipv4 nft_reject_inet nft_reject nft_redir nft_quota nft_objref nft_numgen nft_nat nft_masq nft_log nft_limit nft_hash nft_flow_offload nft_fib_ipv6 nft_fib_ipv4 nft_fib nft_ct nft_counter nft_chain_nat nf_tables nf_nat nf_flow_table nf_conntrack mt76x2e mt76x2_common mt76x02_lib mt7603e mt76 mac80211 lzo cfg80211 slhc nfnetlink nf_reject_ipv6 nf_reject_ipv4 nf_log_syslog nf_defrag_ipv6 nf_defrag_ipv4 lzo_rle lzo_decompress lzo_compress libcrc32c crc_ccitt compat sha512_generic sha256_generic libsha256 seqiv jitterentropy_rng drbg hmac cmac crypto_acompress leds_gpio gpio_button_hotplug crc32c_generic Thu Jun 29 18:54:00 2023 kern.warn kernel: [ 210.744421] CPU: 0 PID: 511 Comm: napi/phy0-3 Not tainted 5.15.118 #0 Thu Jun 29 18:54:00 2023 kern.warn kernel: [ 210.750972] Stack : 00000000 00000000 81a39c7c 808e0000 80720000 8066c410 80e33d00 8071de83 Thu Jun 29 18:54:00 2023 kern.warn kernel: [ 210.759499] 808e33b4 000001ff 00000000 80061ae4 80665a7c 00000001 81a39c38 1a20d335 Thu Jun 29 18:54:00 2023 kern.warn kernel: [ 210.768015] 00000000 00000000 8066c410 81a39ad0 ffffefff 00000000 00000000 ffffffea Thu Jun 29 18:54:00 2023 kern.warn kernel: [ 210.776537] 00000000 81a39adc 000000d7 807242f8 808e0000 00000009 00000000 81a04688 Thu Jun 29 18:54:00 2023 kern.warn kernel: [ 210.785059] 00000009 00000000 00003a98 80000000 00000018 80340db8 00000000 808e0000 Thu Jun 29 18:54:00 2023 kern.warn kernel: [ 210.793577] ... Thu Jun 29 18:54:00 2023 kern.warn kernel: [ 210.796060] Call Trace: Thu Jun 29 18:54:00 2023 kern.warn kernel: [ 210.798535] [<8000702c>] show_stack+0x28/0xf0 Thu Jun 29 18:54:00 2023 kern.warn kernel: [ 210.802998] [<800261c0>] __warn+0x9c/0x124 Thu Jun 29 18:54:00 2023 kern.warn kernel: [ 210.807165] [<800262a4>] warn_slowpath_fmt+0x5c/0xac Thu Jun 29 18:54:00 2023 kern.warn kernel: [ 210.812230] [<81a04688>] mt7603_filter_tx+0x178/0x180 [mt7603e] Thu Jun 29 18:54:00 2023 kern.warn kernel: [ 210.818272] [<81a04818>] mt7603_wtbl_set_ps+0x12c/0x134 [mt7603e] Thu Jun 29 18:54:00 2023 kern.warn kernel: [ 210.824492] [<81a01a90>] mt7603_sta_ps+0x38/0x434 [mt7603e] Thu Jun 29 18:54:00 2023 kern.warn kernel: [ 210.830184] [<81a75984>] mt76_rx_poll_complete+0x520/0x638 [mt76] Thu Jun 29 18:54:00 2023 kern.warn kernel: [ 210.836417] [<81a72288>] mt76_dma_rx_poll+0x284/0x4fc [mt76] Thu Jun 29 18:54:00 2023 kern.warn kernel: [ 210.842204] [<803f773c>] __napi_poll+0x70/0x1f8 Thu Jun 29 18:54:00 2023 kern.warn kernel: [ 210.846817] [<803f7a00>] napi_threaded_poll+0x13c/0x188 Thu Jun 29 18:54:00 2023 kern.warn kernel: [ 210.852145] [<8004604c>] kthread+0x140/0x164 Thu Jun 29 18:54:00 2023 kern.warn kernel: [ 210.856505] [<80002478>] ret_from_kernel_thread+0x14/0x1c Thu Jun 29 18:54:00 2023 kern.warn kernel: [ 210.862005] Thu Jun 29 18:54:00 2023 kern.warn kernel: [ 210.863519] ---[ end trace 64b883a3276bd278 ]---

DragonBluep commented 1 year ago

@DragonBluep Could you reproduce the speed issue with newer firmware? I mean overwrite current mt7603_e2.bin with this: https://raw.githubusercontent.com/ptpt52/mt76/e67f2d76f15cb4120b28d7cb1f566dbff762b89f/firmware/mt7603_e2.bin

Yes, I always use the new firmware. Just in case, I also tested the old firmware in the mt76 repository.

It seems that it only affects the iperf3 results. I can get 90+ Mbps on https://www.speedtest.net/. My carrier only provides 100M bandwidth so I cannot verify it.

DragonBluep commented 1 year ago

try the latest 23.05 snapshot or just snapshot

@lukasz1992 @DragonBluep Sorry guys but is the patch unrelated to the issue we experience from the main post? Does it mean, I don't need this patch? or should I still be needing this for stability?

You need the patch and mt76 master branch to make sure you can get all necessary fixes. Changing fragmentation threshold may only works for mt7628.

lukasz1992 commented 1 year ago

@DragonBluep I wonder what parameters you pass to iperf3 client and server. In my case on mt7915e iperf3 gave reduced results - increasing a number of streams from 1 helped much.

iperf3 set on router is not the best solution, as it handles the traffic as client/server. Router generally only routes/nats Internet traffic, handing it could worse speed results.

DragonBluep commented 1 year ago

@DragonBluep I wonder what parameters you pass to iperf3 client and server. In my case on mt7915e iperf3 gave reduced results - increasing a number of streams from 1 helped much.

server on router/WAN side: iperf3 -s -D client Windows 11 with MT7921: iperf3 -c 192.x.x.x -R -P 4

iperf3 set on router is not the best solution, as it handles the traffic as client/server. Router generally only routes/nats Internet traffic, handing it could worse speed results.

I agree. But MT7621 can provide 600+ Mbps iperf3 server rate, so that is not important for MT7603. And I also tried run it on WAN side and did the same test on mt7612/mt7628. Not sure it's a driver bug or hostapd bug.

DragonBluep commented 1 year ago

@lukasz1992 I updated Windows system today and it seems that iperf3 issue has disappeared.

Another speed test, copy files via samba: encryption open wpa2
client1 -> lan -> mt7603 -> client2 173 Mbps 170 Mbps
client2 -> mt7603 -> lan -> client1 235 Mbps 234 Mbps
shown19 commented 1 year ago

try the latest 23.05 snapshot or just snapshot

@lukasz1992 @DragonBluep Sorry guys but is the patch unrelated to the issue we experience from the main post? Does it mean, I don't need this patch? or should I still be needing this for stability?

After testing the snapshot build I compiled dated August 18 with patch applied, I can see a big improvement in 2.4ghz and by far the best update received after years of experiencing instability on my Newifi D2 2.4ghz (mt7603e). No more CPU Warning and no more beacon stuck or hanging as far as I tested goes. Thank you guys.

a1

Sadly, for my 5ghz Wireless(Mt76x2e) , I still experiecing this issue every since from openwrt 19 to the current version and the only solution I could think of is to lessen the tx power from 20dbm to 13 or 12dbm so my guess would be a hardware fault, right?

issue1

DragonBluep commented 1 year ago

@shown19 MT7612 hardware restart issue is caused by USB3.0 port. Please focus on this issue as it goes beyond the topic here. https://github.com/openwrt/mt76/issues/457#issuecomment-769415292

shown19 commented 1 year ago

@shown19 MT7612 hardware restart issue is caused by USB3.0 port. Please focus on this issue as it goes beyond the topic here. #457 (comment)

I actually commented in that post dated July 21 and don't even have hdd inserted but still got that issue. Anyway, this is unrelated issue to this topic here. I will try to figure this out. Thank you.

nachalni commented 1 year ago

Hi, has anyone tried to do long term tests with MT7628 with most recent version of the driver ?

lukasz1992 commented 1 year ago

@nachalni no, we thought that you would do them :)

Joking - fix is too recent to have long-term test

nachalni commented 1 year ago

We did some tests with multiple clients running heavy http traffic for 24hrs to/from wan. Reset file output only had this one exception: RX PSE busy stuck: 1 MCU didn't hang, clients still connected to AP

dfateyev commented 1 year ago

I actually commented in that post dated July 21 and don't even have hdd inserted but still got that issue.

@shown19 just for the record, I have a similar board:

[   12.168762] pci 0000:00:01.0: enabling device (0000 -> 0003)
[   12.174542] mt76x2e 0000:01:00.0: enabling device (0000 -> 0002)
[   12.180790] mt76x2e 0000:01:00.0: ASIC revision: 76120044
[   13.022770] mt76x2e 0000:01:00.0: ROM patch build: 20141115060606a
[   13.071036] mt76x2e 0000:01:00.0: Firmware Version: 0.0.00
[   13.076604] mt76x2e 0000:01:00.0: Build: 1
[   13.080692] mt76x2e 0000:01:00.0: Build Time: 201607111443____
[   13.110034] mt76x2e 0000:01:00.0: Firmware running!

but remember having "Hardware restart was requested" in logs just once or twice since 19.07. I suspect in your case there may be HW interference or a short circuit, etc.

shown19 commented 1 year ago

but remember having "Hardware restart was requested" in logs just once or twice since 19.07. I suspect in your case there may be HW interference or a short circuit, etc.

that might be the case and also whenever I lowered the txpower from 20dbm to 12dbm, It doesn't log hardware restart requested so my guess also is maybe the 5ghz chipset cannot handle higher power anymore. I'm not experiencing it back then actually for months, it just so happened all of a sudden frequently now.

Linaro1985 commented 1 year ago

@DragonBluep @nbd168 I continue to test the mt7628 devices in hard conditions with many connected clients (about 20) on OpenWrt 22.03 snapshot. Sometimes wifi stops working and "Beacon stuck" counter increments in /sys/kernel/debug/ieee80211/phy0/mt76/reset until command wifi down && wifi up. After that wifi will continue to work.

Update: after a while Wi-Fi recovers itself

Djfe commented 1 year ago

Does your snapshot already contain the commit from three days ago that contains the latest fixes? https://github.com/openwrt/openwrt/commit/76b1e564d202c09d0035315eb6e958a9b0dd4491

Linaro1985 commented 1 year ago

Does your snapshot already contain the commit from three days ago that contains the latest fixes? openwrt/openwrt@76b1e56

Yes. It contains.

DragonBluep commented 1 year ago

@Linaro1985 Have you tested the main branch? Did the "Beacon stuck" counter increasing before the SSID disappears or after the SSID disappears?

In the previous PSE reset issue, about one minute after the SSID disappeared, the PSE reset counter increased by 1, and finally WiFi returned to normal state. It seems that the AP will enter an uncontrolled state in some conditions and watchdog doesn't catch the MCU hang.

Dahhyunnee commented 1 year ago

@Linaro1985 add this 3 lines in /etc/config/wireless - wifi-device

option frag '2346'
option rts '2347'
list ht_capab 'SMPS-STATIC'

Stable Wi-Fi using 23.05-SNAPSHOT r23400 & 22.03-SNAPSHOT r20213 :) R4A Gigabit - 23.05-SNAPSHOT r23400 R4A 100M - 22.03-SNAPSHOT r20213

Linaro1985 commented 1 year ago

add this 3 lines in /etc/config/wireless - wifi-device

@Dahhyunnee thanks! I try it. Have you tried without these parameters with the latest changes on openwrt 22.03/23.05/main?

Have you tested the main branch?

@DragonBluep only on devices that are near me and there is no any problems. But I have multiple access points in a remote office. They have many connected clients. After updating from version 19.07 to the latest 22.03 snapshot I have such a problem. Due to the fact that this is a work office, I can not put a version higher than 22.03 to check 23.05/main branches.

Did the "Beacon stuck" counter increasing before the SSID disappears or after the SSID disappears?

Good question, but I can't reproduce this moment by self. It may happen after some time. Wifi SSID disappears and no clients data exchange with AP.

I wrote a small script (launch by cron every 1 minute), for a temporary solution:

#!/bin/sh

bc=$(sed -n 's/        Beacon stuck: //p' /sys/kernel/debug/ieee80211/phy0/mt76/reset)
[ -f /tmp/bc ] && bcold=$(cat /tmp/bc) || bcold=0
[ $((bc-bcold)) -gt 50 ] && {
        logger -t "wifi" "reload because of beacon stuck detected"
        wifi down
        wifi up
}
echo "$bc" > /tmp/bc
Linaro1985 commented 1 year ago

@DragonBluep before "Beacon stuck" occurs in the log there are the following suspicious lines

Wed Aug 30 11:41:00 2023 cron.err crond[3755]: USER root pid 4627 cmd /root/wifi_check.sh
Wed Aug 30 11:41:06 2023 daemon.info hostapd: ap1003: STA c2:c8:25:c9:d2:57 IEEE 802.11: authenticated
Wed Aug 30 11:41:06 2023 daemon.info hostapd: ap1003: STA c2:c8:25:c9:d2:57 IEEE 802.11: associated (aid 12)
Wed Aug 30 11:41:07 2023 daemon.notice hostapd: ap1003: AP-STA-CONNECTED c2:c8:25:c9:d2:57
Wed Aug 30 11:41:07 2023 daemon.info hostapd: ap1003: STA c2:c8:25:c9:d2:57 RADIUS: starting accounting session EEDEBF5BA2B7E0DD
Wed Aug 30 11:41:07 2023 daemon.info hostapd: ap1003: STA c2:c8:25:c9:d2:57 WPA: pairwise key handshake completed (RSN)
Wed Aug 30 11:41:07 2023 daemon.notice hostapd: ap1003: EAPOL-4WAY-HS-COMPLETED c2:c8:25:c9:d2:57
Wed Aug 30 11:42:00 2023 cron.err crond[3755]: USER root pid 4638 cmd /root/wifi_check.sh
Wed Aug 30 11:42:07 2023 daemon.notice hostapd: ap1003: AP-STA-DISCONNECTED 9a:2c:da:e5:97:d3
Wed Aug 30 11:42:10 2023 daemon.notice hostapd: ap1001: AP-STA-DISCONNECTED 44:d7:91:0f:bb:23
Wed Aug 30 11:42:11 2023 daemon.info hostapd: ap1001: STA 44:d7:91:0f:bb:23 IEEE 802.11: authenticated
Wed Aug 30 11:42:11 2023 daemon.info hostapd: ap1001: STA 44:d7:91:0f:bb:23 IEEE 802.11: authenticated
Wed Aug 30 11:42:11 2023 daemon.info hostapd: ap1001: STA 44:d7:91:0f:bb:23 IEEE 802.11: authenticated
Wed Aug 30 11:42:11 2023 daemon.info hostapd: ap1001: STA 44:d7:91:0f:bb:23 IEEE 802.11: authenticated
Wed Aug 30 11:42:12 2023 daemon.info hostapd: ap1001: STA 44:d7:91:0f:bb:23 IEEE 802.11: authenticated
Wed Aug 30 11:42:39 2023 daemon.notice hostapd: ap1001: AP-STA-DISCONNECTED a0:88:b4:de:5a:64
Wed Aug 30 11:42:39 2023 daemon.info hostapd: ap1001: STA a0:88:b4:de:5a:64 IEEE 802.11: authenticated
Wed Aug 30 11:42:40 2023 daemon.info hostapd: ap1001: STA a0:88:b4:de:5a:64 IEEE 802.11: authenticated
Wed Aug 30 11:42:40 2023 daemon.info hostapd: ap1001: STA a0:88:b4:de:5a:64 IEEE 802.11: authenticated
Wed Aug 30 11:42:42 2023 daemon.info hostapd: ap1001: STA a0:88:b4:de:5a:64 IEEE 802.11: authenticated
Wed Aug 30 11:42:42 2023 daemon.info hostapd: ap1001: STA a0:88:b4:de:5a:64 IEEE 802.11: authenticated
Wed Aug 30 11:42:43 2023 daemon.info hostapd: ap1001: STA a0:88:b4:de:5a:64 IEEE 802.11: authenticated
Wed Aug 30 11:42:43 2023 daemon.info hostapd: ap1001: STA a0:88:b4:de:5a:64 IEEE 802.11: authenticated
Wed Aug 30 11:42:44 2023 daemon.info hostapd: ap1001: STA a0:88:b4:de:5a:64 IEEE 802.11: authenticated
Wed Aug 30 11:42:44 2023 daemon.info hostapd: ap1001: STA a0:88:b4:de:5a:64 IEEE 802.11: authenticated
Wed Aug 30 11:42:45 2023 daemon.info hostapd: ap1001: STA a0:88:b4:de:5a:64 IEEE 802.11: authenticated
Wed Aug 30 11:43:00 2023 cron.err crond[3755]: USER root pid 4641 cmd /root/wifi_check.sh
Wed Aug 30 11:43:00 2023 user.notice wifi: reload because of beacon stuck detected

/etc/config/wireless

config wifi-device 'radio0'
        option type 'mac80211'
        option path 'platform/10300000.wmac'
        option band '2g'
        option htmode 'HT40'
        option distance 'auto'
        option cell_density '2'
        option channel '2'

config wifi-iface 'ap_1001'
        option device 'radio0'
        option network 'wan'
        option mode 'ap'
        option ifname 'ap1001'
        option encryption 'psk2+ccmp'
        option disassoc_low_ack '0'
        option ssid '****'
        option key '**********'

config wifi-iface 'ap_1003'
        option device 'radio0'
        option network 'guest'
        option mode 'ap'
        option ifname 'ap1003'
        option ssid '***_Guest'
        option encryption 'psk2+ccmp'
        option disassoc_low_ack '0'
        option key '********'
        option isolate '1'

Maybe this will help in finding the cause.

DragonBluep commented 1 year ago

@Linaro1985 It seems that the SSID has already returned to normal state before crontab runs your script. These IEEE 802.11: authenticated logs indicate that the client has reconnected to the AP.

Linaro1985 commented 1 year ago

These IEEE 802.11: authenticated logs indicate that the client has reconnected to the AP.

I don't know why there are so many authenticated messages from one client. Client signal level is normal. And without script wifi will no longer work normally. I'll try to update the device to snapshot 23.05, but it will take time to make own build with configs.

DragonBluep commented 1 year ago

@Linaro1985 Not sure if this new firmware has some help, but it's worth trying. I cannot reproduce your problem in daily use, perhaps it only appears in complex scenarios. https://github.com/openwrt/mt76/blob/7c57e0f6ff8e94b333a6a117cb8f261bc6ae1a32/firmware/mt7628_e2.bin

dfateyev commented 1 year ago
list ht_capab 'SMPS-STATIC'

Just curious: does this option require a specific hostapd? I haven't found this option here, and setting it manually with default hostapd seems doesn't have an effect:

config wifi-device 'radio0'
    option type 'mac80211'
    option band '2g'
    list ht_capab 'SMPS-STATIC'
    option htmode 'HT20'
...

root@OpenWrt:~# cat /var/run/hostapd-phy0.conf | grep capab
ht_capab=[SHORT-GI-20][SHORT-GI-40][TX-STBC][RX-STBC1]
Dahhyunnee commented 1 year ago

Yes, it does not work anymore. I removed that option and this is my final configuration.

config wifi-device 'radio0' option type 'mac80211' option path '1e140000.pcie/pci0000:00/0000:00:01.0/0000:02:00.0' option band '2g' option htmode 'HT40' option channel '1' option country 'PH' option noscan '1' option txpower '14' option frag '2346' option rts '2347' option cell_density '3' option log_level '4'

Dahhyunnee commented 1 year ago

It was already disabled by default.

    Band 1:
            Capabilities: 0x1fe
                    HT20/HT40
                    SM Power Save disabled
                    RX Greenfield
                    RX HT20 SGI
                    RX HT40 SGI
                    TX STBC
                    RX STBC 1-stream
                    Max AMSDU length: 3839 bytes
                    No DSSS/CCK HT40
Dahhyunnee commented 1 year ago

2G is now stable but 5G is dropping under heavy load.

[299289.533871] mt76x2e 0000:01:00.0: Firmware Version: 0.0.00 [299289.533895] mt76x2e 0000:01:00.0: Build: 1 [299289.533905] mt76x2e 0000:01:00.0: Build Time: 201607111443____ [299289.552599] mt76x2e 0000:01:00.0: Firmware running! [299289.553670] ieee80211 phy1: Hardware restart was requested

HiGarfield commented 1 year ago

2G is now stable but 5G is dropping under heavy load.

[299289.533871] mt76x2e 0000:01:00.0: Firmware Version: 0.0.00 [299289.533895] mt76x2e 0000:01:00.0: Build: 1 [299289.533905] mt76x2e 0000:01:00.0: Build Time: 201607111443____ [299289.552599] mt76x2e 0000:01:00.0: Firmware running! [299289.553670] ieee80211 phy1: Hardware restart was requested

Please try to apply this patch https://github.com/openwrt/mt76/pull/816