openwrt / mt76

mac80211 driver for MediaTek MT76x0e, MT76x2e, MT7603, MT7615, MT7628 and MT7688
744 stars 343 forks source link

WI-FI is unstable at 2.4 GHz #793

Closed ShredRum closed 1 year ago

ShredRum commented 1 year ago

Hello, I have a Xiaomi router 4A (R4AC) with OpenWrt installed SNAPSHOT r23454-01885bc6a3 / LuCI Master git-23.158.78004-23a246e

From time to time, with a Wi-Fi load of 2.4 GHz, the network starts to disappear, after which it appears again after a couple of seconds. There is no information in the log other than the actual disconnection and connection of devices to Wi-Fi. I also managed to catch a driver crash once, but I don't think it could be related to the problem (it never showed up again).

Disabling WMM mode helps, but the network speed drops below 20 Mbps.

This problem does not appear on a 5 GHz network.

Below I will provide the crash log of the driver, but keep in mind that it is not reproducible, and appeared only 1 time.

Thu Jun 29 18:54:00 2023 kern.warn kernel: [ 210.656151] ------------[ cut here ]------------ Thu Jun 29 18:54:00 2023 kern.warn kernel: [ 210.660885] WARNING: CPU: 0 PID: 511 at target-mipsel_24kc_musl/linux-ramips_mt76x8/mt76-2023-05-13-969b7b5e/mt7603/mac.c:208 mt7603_filter_tx+0x178/0x180 [mt7603e] Thu Jun 29 18:54:00 2023 kern.warn kernel: [ 210.675870] Modules linked in: pppoe ppp_async nft_fib_inet nf_flow_table_ipv6 nf_flow_table_ipv4 nf_flow_table_inet pppox ppp_generic nft_reject_ipv6 nft_reject_ipv4 nft_reject_inet nft_reject nft_redir nft_quota nft_objref nft_numgen nft_nat nft_masq nft_log nft_limit nft_hash nft_flow_offload nft_fib_ipv6 nft_fib_ipv4 nft_fib nft_ct nft_counter nft_chain_nat nf_tables nf_nat nf_flow_table nf_conntrack mt76x2e mt76x2_common mt76x02_lib mt7603e mt76 mac80211 lzo cfg80211 slhc nfnetlink nf_reject_ipv6 nf_reject_ipv4 nf_log_syslog nf_defrag_ipv6 nf_defrag_ipv4 lzo_rle lzo_decompress lzo_compress libcrc32c crc_ccitt compat sha512_generic sha256_generic libsha256 seqiv jitterentropy_rng drbg hmac cmac crypto_acompress leds_gpio gpio_button_hotplug crc32c_generic Thu Jun 29 18:54:00 2023 kern.warn kernel: [ 210.744421] CPU: 0 PID: 511 Comm: napi/phy0-3 Not tainted 5.15.118 #0 Thu Jun 29 18:54:00 2023 kern.warn kernel: [ 210.750972] Stack : 00000000 00000000 81a39c7c 808e0000 80720000 8066c410 80e33d00 8071de83 Thu Jun 29 18:54:00 2023 kern.warn kernel: [ 210.759499] 808e33b4 000001ff 00000000 80061ae4 80665a7c 00000001 81a39c38 1a20d335 Thu Jun 29 18:54:00 2023 kern.warn kernel: [ 210.768015] 00000000 00000000 8066c410 81a39ad0 ffffefff 00000000 00000000 ffffffea Thu Jun 29 18:54:00 2023 kern.warn kernel: [ 210.776537] 00000000 81a39adc 000000d7 807242f8 808e0000 00000009 00000000 81a04688 Thu Jun 29 18:54:00 2023 kern.warn kernel: [ 210.785059] 00000009 00000000 00003a98 80000000 00000018 80340db8 00000000 808e0000 Thu Jun 29 18:54:00 2023 kern.warn kernel: [ 210.793577] ... Thu Jun 29 18:54:00 2023 kern.warn kernel: [ 210.796060] Call Trace: Thu Jun 29 18:54:00 2023 kern.warn kernel: [ 210.798535] [<8000702c>] show_stack+0x28/0xf0 Thu Jun 29 18:54:00 2023 kern.warn kernel: [ 210.802998] [<800261c0>] __warn+0x9c/0x124 Thu Jun 29 18:54:00 2023 kern.warn kernel: [ 210.807165] [<800262a4>] warn_slowpath_fmt+0x5c/0xac Thu Jun 29 18:54:00 2023 kern.warn kernel: [ 210.812230] [<81a04688>] mt7603_filter_tx+0x178/0x180 [mt7603e] Thu Jun 29 18:54:00 2023 kern.warn kernel: [ 210.818272] [<81a04818>] mt7603_wtbl_set_ps+0x12c/0x134 [mt7603e] Thu Jun 29 18:54:00 2023 kern.warn kernel: [ 210.824492] [<81a01a90>] mt7603_sta_ps+0x38/0x434 [mt7603e] Thu Jun 29 18:54:00 2023 kern.warn kernel: [ 210.830184] [<81a75984>] mt76_rx_poll_complete+0x520/0x638 [mt76] Thu Jun 29 18:54:00 2023 kern.warn kernel: [ 210.836417] [<81a72288>] mt76_dma_rx_poll+0x284/0x4fc [mt76] Thu Jun 29 18:54:00 2023 kern.warn kernel: [ 210.842204] [<803f773c>] __napi_poll+0x70/0x1f8 Thu Jun 29 18:54:00 2023 kern.warn kernel: [ 210.846817] [<803f7a00>] napi_threaded_poll+0x13c/0x188 Thu Jun 29 18:54:00 2023 kern.warn kernel: [ 210.852145] [<8004604c>] kthread+0x140/0x164 Thu Jun 29 18:54:00 2023 kern.warn kernel: [ 210.856505] [<80002478>] ret_from_kernel_thread+0x14/0x1c Thu Jun 29 18:54:00 2023 kern.warn kernel: [ 210.862005] Thu Jun 29 18:54:00 2023 kern.warn kernel: [ 210.863519] ---[ end trace 64b883a3276bd278 ]---

mrbbbaixue commented 1 year ago

Same issue on GL MT300N v2 with mt7628.

lukasz1992 commented 1 year ago

known issue, because of faulty ethernet driver

nbd168 commented 1 year ago

Please try latest OpenWrt master or 23.05 branch

DragonBluep commented 1 year ago

@nbd168 On MT7628, when I use P2P software download a big file (such as Windows image), I found that watchdog reset is constantly triggered. The trigger source seems to be RESET_CAUSE_RX_PSE_BUSY. I didn't test MT7603 but I remember it was always stable. https://github.com/openwrt/mt76/blob/c19b62fe6b68c3244e150248f250369504d3fd74/mt7603/mt7603.h#L91-L99

My debug patch:

diff --git a/mt7603/mac.c b/mt7603/mac.c
index 99ae0805..c7d8d851 100644
--- a/mt7603/mac.c
+++ b/mt7603/mac.c
@@ -1425,6 +1425,7 @@ static void mt7603_mac_watchdog_reset(struct mt7603_dev *dev)
    u32 mask = dev->mt76.mmio.irqmask;
    int i;

+   dev_err(dev->mt76.dev, "Watchdog Reset\n");
    ieee80211_stop_queues(dev->mt76.hw);
    set_bit(MT76_RESET, &dev->mphy.state);

@@ -1441,6 +1442,7 @@ static void mt7603_mac_watchdog_reset(struct mt7603_dev *dev)

    mt7603_beacon_set_timer(dev, -1, 0);

+   dev_err(dev->mt76.dev, "Reset Cause: %d\n", dev->cur_reset_cause);
    if (dev->reset_cause[RESET_CAUSE_RESET_FAILED] ||
        dev->cur_reset_cause == RESET_CAUSE_RX_PSE_BUSY ||
        dev->cur_reset_cause == RESET_CAUSE_BEACON_STUCK ||

Kernel log:

[  255.724995] mt76_wmac 10300000.wmac: Watchdog Reset
[  255.730047] mt76_wmac 10300000.wmac: Reset Cause: 4
[  256.874921] mt76_wmac 10300000.wmac: Watchdog Reset
[  256.879999] mt76_wmac 10300000.wmac: Reset Cause: 4
[  258.554963] mt76_wmac 10300000.wmac: Watchdog Reset
[  258.559996] mt76_wmac 10300000.wmac: Reset Cause: 4
[  260.204991] mt76_wmac 10300000.wmac: Watchdog Reset
[  260.210027] mt76_wmac 10300000.wmac: Reset Cause: 4
[  266.015052] mt76_wmac 10300000.wmac: Watchdog Reset
[  266.020140] mt76_wmac 10300000.wmac: Reset Cause: 4
[  267.145089] mt76_wmac 10300000.wmac: Watchdog Reset
[  267.150113] mt76_wmac 10300000.wmac: Reset Cause: 4
[  271.357276] mt76_wmac 10300000.wmac: Watchdog Reset
[  271.366310] mt76_wmac 10300000.wmac: Reset Cause: 4
[  273.365168] mt76_wmac 10300000.wmac: Watchdog Reset
[  273.370196] mt76_wmac 10300000.wmac: Reset Cause: 4
[  274.555158] mt76_wmac 10300000.wmac: Watchdog Reset
[  274.560188] mt76_wmac 10300000.wmac: Reset Cause: 4
[  276.566903] mt76_wmac 10300000.wmac: Watchdog Reset
[  276.576365] mt76_wmac 10300000.wmac: Reset Cause: 4
[  278.177274] mt76_wmac 10300000.wmac: Watchdog Reset
[  278.193835] mt76_wmac 10300000.wmac: Reset Cause: 4
[  281.685327] mt76_wmac 10300000.wmac: Watchdog Reset
[  281.690377] mt76_wmac 10300000.wmac: Reset Cause: 4
[  282.875285] mt76_wmac 10300000.wmac: Watchdog Reset
[  282.880320] mt76_wmac 10300000.wmac: Reset Cause: 4
[  285.325253] mt76_wmac 10300000.wmac: Watchdog Reset
[  285.330846] mt76_wmac 10300000.wmac: Reset Cause: 4
[  287.906450] mt76_wmac 10300000.wmac: Watchdog Reset
[  287.926172] mt76_wmac 10300000.wmac: Reset Cause: 4
[  290.345221] mt76_wmac 10300000.wmac: Watchdog Reset
[  290.350239] mt76_wmac 10300000.wmac: Reset Cause: 4
[  555.940101] mt76_wmac 10300000.wmac: Watchdog Reset
[  555.945122] mt76_wmac 10300000.wmac: Reset Cause: 4
[  560.230101] mt76_wmac 10300000.wmac: Watchdog Reset
[  560.235127] mt76_wmac 10300000.wmac: Reset Cause: 4
[  561.360088] mt76_wmac 10300000.wmac: Watchdog Reset
[  561.365111] mt76_wmac 10300000.wmac: Reset Cause: 4
DragonBluep commented 1 year ago

This function looks suspicious. https://github.com/openwrt/mt76/blob/c19b62fe6b68c3244e150248f250369504d3fd74/mt7603/mac.c#L1569 The vendor driver will check it 10 times before reset and will reset the pse counter if the 0x4244 register meets some conditions.

MT7603 vendor driver In mt7603_wifi\common\cmm_data_pci.c ``` BOOLEAN MonitorRxPse(RTMP_ADAPTER *pAd) { UINT32 RemapBase, RemapOffset; UINT32 Value; UINT32 RestoreValue; if (pAd->RxPseCheckTimes < 10) { /* Check RX FIFO if not ready */ RTMP_IO_WRITE32(pAd, 0x4244, 0x28000000); RTMP_IO_READ32(pAd, 0x4244, &Value); if ((Value & (1 << 8)) != 0) { pAd->RxPseCheckTimes = 0; return FALSE; } else { RTMP_IO_READ32(pAd, MCU_PCIE_REMAP_2, &RestoreValue); RemapBase = GET_REMAP_2_BASE(0x800c006c) << 19; RemapOffset = GET_REMAP_2_OFFSET(0x800c006c); RTMP_IO_WRITE32(pAd, MCU_PCIE_REMAP_2, RemapBase); RTMP_IO_WRITE32(pAd, 0x80000 + RemapOffset, 3); RTMP_IO_READ32(pAd, 0x80000 + RemapOffset, &Value); if(((Value & (0x8001 << 16)) == (0x8001 << 16)) || ((Value & (0xe001 << 16)) == (0xe001 << 16))) { pAd->RxPseCheckTimes++; RTMP_IO_WRITE32(pAd, MCU_PCIE_REMAP_2, RestoreValue); return FALSE; } else { pAd->RxPseCheckTimes = 0; RTMP_IO_WRITE32(pAd, MCU_PCIE_REMAP_2, RestoreValue); return FALSE; } } } else { pAd->RxPseCheckTimes = 0; return TRUE; } } ```
MT7628 vendor driver In mt7628_wifi\hw_ctrl\cmm_chip_mt.c ``` BOOLEAN MonitorRxPse(RTMP_ADAPTER *pAd) { UINT32 RemapBase, RemapOffset; UINT32 Value; UINT32 RestoreValue; #ifdef DMA_RESET_SUPPORT RTMP_IO_READ32(pAd, 0x816c, &Value); //AC if((Value & (1 << 2)) == (1 << 2)) { //let PSE reset done to clear //Value &= ~(1 << 2); //RTMP_IO_WRITE32(pAd, 0x816c, Value); pAd->ACHitCount ++; return TRUE; } if((Value & (1 << 3)) == (1 << 3)) { //let PSE reset done to clear //Value &= ~(1 << 3); //RTMP_IO_WRITE32(pAd, 0x816c, Value); pAd->MgtHitCount ++; return TRUE; } #endif /* DMA_RESET_SUPPORT */ if (pAd->RxPseCheckTimes < 10) { /* Check RX FIFO if not ready */ MAC_IO_WRITE32(pAd, 0x4244, 0x98000000); MAC_IO_READ32(pAd, 0x4244, &Value); if ((Value & (1 << 9)) != 0) { pAd->RxPseCheckTimes = 0; return FALSE; } else { MAC_IO_READ32(pAd, MCU_PCIE_REMAP_2, &RestoreValue); RemapBase = GET_REMAP_2_BASE(0x800c006c) << 19; RemapOffset = GET_REMAP_2_OFFSET(0x800c006c); MAC_IO_WRITE32(pAd, MCU_PCIE_REMAP_2, RemapBase); MAC_IO_WRITE32(pAd, 0x80000 + RemapOffset, 3); MAC_IO_READ32(pAd, 0x80000 + RemapOffset, &Value); if(((Value & (0x8001 << 16)) == (0x8001 << 16)) || ((Value & (0xe001 << 16)) == (0xe001 << 16)) || ((Value & (0x4001 << 16)) == (0x4001 << 16))) { if (((Value & (0x8001 << 16)) == (0x8001 << 16)) || ((Value & (0xe001 << 16)) == (0xe001 << 16))) { pAd->PSETriggerType1Count++; } if ((Value & (0x4001 << 16)) == (0x4001 << 16)) { pAd->PSETriggerType2Count++; } pAd->RxPseCheckTimes++; MAC_IO_WRITE32(pAd, MCU_PCIE_REMAP_2, RestoreValue); return FALSE; } else { pAd->RxPseCheckTimes = 0; MAC_IO_WRITE32(pAd, MCU_PCIE_REMAP_2, RestoreValue); return FALSE; } } } else { pAd->RxPseCheckTimes = 0; return TRUE; } } ```
nbd168 commented 1 year ago

mt76 will also check this multiple times. The function is called via a wrapper that keeps the counter. Could you please test if this patch helps? https://nbd.name/p/e626cc2c

DragonBluep commented 1 year ago

Sadly it doesn't work. Maybe we need additional check for 0x4244 register?

[   62.689766] br-lan: port 3(phy1-ap0) entered forwarding state
[  198.718494] mt76_wmac 10300000.wmac: Watchdog Reset, Reset Cause: 4
[  200.529022] mt76_wmac 10300000.wmac: Watchdog Reset, Reset Cause: 4
[  204.438592] mt76_wmac 10300000.wmac: Watchdog Reset, Reset Cause: 4
nbd168 commented 1 year ago

Could you please apply this patch for printing register debug values and show me the output around reset? https://nbd.name/p/78569cfa

DragonBluep commented 1 year ago
Sure, this is the log: ``` [ 100.853934] PSE debug val = 4001 [ 100.962783] PSE debug val = 4001 [ 101.072783] PSE debug val = 4001 [ 101.182949] PSE debug val = 4001 [ 101.292749] PSE debug val = 4001 [ 101.402753] PSE debug val = 4001 [ 101.512788] PSE debug val = 4001 [ 101.622744] PSE debug val = 4001 [ 101.732764] PSE debug val = 4001 [ 101.842738] PSE debug val = 4001 [ 101.846138] mt76_wmac 10300000.wmac: Watchdog Reset, Reset Cause: 4 [ 102.274322] PSE debug val = e001 [ 102.384031] PSE debug val = e401 [ 102.494494] PSE debug val = e401 [ 102.604414] PSE debug val = e001 [ 102.714555] PSE debug val = e401 [ 102.825737] PSE debug val = e001 [ 102.935995] PSE debug val = e401 [ 103.043596] PSE debug val = 4001 [ 103.152745] PSE debug val = 4001 [ 103.262717] PSE debug val = 4001 [ 103.266069] mt76_wmac 10300000.wmac: Watchdog Reset, Reset Cause: 4 [ 103.492778] PSE debug val = e001 [ 103.603092] PSE debug val = e001 [ 103.712767] PSE debug val = e001 [ 103.822739] PSE debug val = e001 [ 103.932752] PSE debug val = e001 [ 104.042745] PSE debug val = e001 [ 104.152742] PSE debug val = e001 [ 104.262711] PSE debug val = e001 [ 104.372809] PSE debug val = e001 [ 104.482742] PSE debug val = e001 [ 104.486332] mt76_wmac 10300000.wmac: Watchdog Reset, Reset Cause: 4 [ 104.733502] PSE debug val = e401 [ 104.845526] PSE debug val = e401 [ 104.953593] PSE debug val = e001 [ 105.065287] PSE debug val = e001 [ 105.173835] PSE debug val = e401 [ 105.283227] PSE debug val = 4001 [ 105.392734] PSE debug val = 4001 [ 105.502740] PSE debug val = 4001 [ 105.612763] PSE debug val = 4001 [ 105.722741] PSE debug val = 4001 [ 105.726088] mt76_wmac 10300000.wmac: Watchdog Reset, Reset Cause: 4 [ 106.065122] PSE debug val = e001 [ 106.175199] PSE debug val = e401 [ 106.293695] PSE debug val = e001 [ 106.405997] PSE debug val = e401 [ 106.524875] PSE debug val = e401 [ 106.634908] PSE debug val = e401 [ 106.746676] PSE debug val = e401 [ 106.854736] PSE debug val = e401 [ 106.983699] PSE debug val = e401 [ 107.094097] PSE debug val = 4401 [ 107.122792] mt76_wmac 10300000.wmac: Watchdog Reset, Reset Cause: 4 [ 107.395741] PSE debug val = e401 [ 107.505552] PSE debug val = e401 [ 107.620204] PSE debug val = e001 [ 107.737359] PSE debug val = e401 [ 107.844637] PSE debug val = e401 [ 107.953569] PSE debug val = 4401 [ 108.066664] PSE debug val = e001 [ 108.178885] PSE debug val = e401 [ 108.283915] PSE debug val = e001 [ 108.394367] PSE debug val = 4401 [ 108.428654] mt76_wmac 10300000.wmac: Watchdog Reset, Reset Cause: 4 [ 108.624498] PSE debug val = e401 [ 108.735373] PSE debug val = e001 [ 108.844878] PSE debug val = e401 [ 108.958689] PSE debug val = e401 [ 109.065765] PSE debug val = e401 [ 109.174966] PSE debug val = e401 [ 109.289276] PSE debug val = e001 [ 109.503474] PSE debug val = e401 [ 109.613434] PSE debug val = 4051 [ 109.733900] PSE debug val = e001 [ 109.843772] PSE debug val = e001 [ 110.069388] PSE debug val = e401 [ 110.184989] PSE debug val = 8001 [ 110.404944] PSE debug val = e001 [ 110.631664] PSE debug val = e401 [ 110.744427] PSE debug val = e401 [ 110.854842] PSE debug val = e001 [ 110.963290] PSE debug val = 4001 [ 111.072757] PSE debug val = 4001 [ 111.182957] PSE debug val = 4001 [ 111.292701] PSE debug val = 4001 [ 111.402695] PSE debug val = 4001 [ 111.512691] PSE debug val = 4001 [ 111.622692] PSE debug val = 4001 [ 111.626043] mt76_wmac 10300000.wmac: Watchdog Reset, Reset Cause: 4 [ 111.914098] PSE debug val = e401 [ 112.043549] PSE debug val = 4051 [ 112.157400] PSE debug val = e401 [ 112.265426] PSE debug val = e401 [ 112.373576] PSE debug val = e001 [ 112.493358] PSE debug val = e401 [ 112.605042] PSE debug val = e401 [ 112.831865] PSE debug val = 4001 [ 113.055903] PSE debug val = e001 [ 113.167618] PSE debug val = e401 [ 113.279334] PSE debug val = e401 [ 113.616259] PSE debug val = e001 [ 113.843741] PSE debug val = e401 [ 113.964444] PSE debug val = 4001 [ 114.072777] PSE debug val = 4001 [ 114.182783] PSE debug val = 4001 [ 114.292769] PSE debug val = 4001 [ 114.402684] PSE debug val = 4001 [ 114.512686] PSE debug val = 4001 [ 114.622687] PSE debug val = 4001 [ 114.732686] PSE debug val = 4001 [ 114.842698] PSE debug val = 4001 [ 114.846052] mt76_wmac 10300000.wmac: Watchdog Reset, Reset Cause: 4 [ 114.992680] PSE debug val = e001 [ 115.102665] PSE debug val = e001 [ 115.212686] PSE debug val = e001 [ 115.322665] PSE debug val = e001 [ 115.432689] PSE debug val = e001 [ 115.542663] PSE debug val = e001 [ 115.652666] PSE debug val = e001 [ 115.762664] PSE debug val = e001 [ 115.872664] PSE debug val = e001 [ 115.982702] PSE debug val = e001 [ 115.986176] mt76_wmac 10300000.wmac: Watchdog Reset, Reset Cause: 4 ```
nbd168 commented 1 year ago

I finally understand the bug now, and this patch should fix it: https://nbd.name/p/4ece22b2

The way it works is this: in the vendor code the function on its own does not detect a rx hang, it only detects if rx is busy, which could also happen due to normal rx activity. The missing part was that in the vendor driver it resets the counter on rx irqs (which indicate real activity), so that it only issues a reset if rx really encountered a hang.

In my patch, I adjusted the mt76 code accordingly.

DragonBluep commented 1 year ago

The watchdog will still reset the chip. It takes approximately one minute to recover from the hang state. It only took about one second before. My chip version is MT7628 E2, MT7628AN ver:1 eco:2

[  143.792548] mt76_wmac 10300000.wmac: Watchdog Reset, Reset Cause: 4
[  228.882903] mt76_wmac 10300000.wmac: Watchdog Reset, Reset Cause: 4
[  585.630345] mt76_wmac 10300000.wmac: Watchdog Reset, Reset Cause: 4
[  677.851328] mt76_wmac 10300000.wmac: Watchdog Reset, Reset Cause: 4
[  752.621170] mt76_wmac 10300000.wmac: Watchdog Reset, Reset Cause: 4
DragonBluep commented 1 year ago

I tried to skip the RX PSE reset with your patch https://nbd.name/p/4ece22b2, but it would cause the client to disconnect after reset request.

--- a/mt7603/mac.c
+++ b/mt7603/mac.c
@@ -1425,6 +1425,10 @@ static void mt7603_mac_watchdog_reset(struct mt7603_dev *dev)
    u32 mask = dev->mt76.mmio.irqmask;
    int i;

+   dev_err(dev->mt76.dev, "Watchdog Reset, Reset Cause: %d\n", dev->cur_reset_cause);
+   if (dev->cur_reset_cause == RESET_CAUSE_RX_PSE_BUSY)
+       return;
+
    ieee80211_stop_queues(dev->mt76.hw);
    set_bit(MT76_RESET, &dev->mphy.state);
[  166.539323] mt76_wmac 10300000.wmac: Watchdog Reset, Reset Cause: 4
[  167.529616] mt76_wmac 10300000.wmac: Watchdog Reset, Reset Cause: 3

Therefore, I believe that reset is the expected behavior. It must be something else caused the RX_PSE_BUSY. Perhaps in some functions, MT7628 requires a different operation than MT7603.

nbd168 commented 1 year ago

I found a few more 7628 specific things, here's a new combined patch: https://nbd.name/p/883e48cf

DragonBluep commented 1 year ago

Thanks for your hard work. It seems that we still need some fixes. With the patch https://nbd.name/p/883e48cf, when I start downloading something, the WiFi signal/SSID will disappear about 30 seconds after watchdog reset.

[   44.981696] br-lan: port 3(phy1-ap0) entered forwarding state
[  156.061964] mt76_wmac 10300000.wmac: Watchdog Reset, Reset Cause: 4
[  235.007446] mt76_wmac 10300000.wmac: Watchdog Reset, Reset Cause: 4
[  298.056256] mt76_wmac 10300000.wmac: Watchdog Reset, Reset Cause: 4
nbd168 commented 1 year ago

Sorry, had a copy&paste bug in there. Fixed version: https://nbd.name/p/40608ec5

DragonBluep commented 1 year ago

Still no lucky. SSID disappear after reset. WAN <--> MT7628 <--> Client

[   92.200741] mt76_wmac 10300000.wmac: Watchdog Reset, Reset Cause: 4
[  203.249635] mt76_wmac 10300000.wmac: Watchdog Reset, Reset Cause: 4
[  307.915254] mt76_wmac 10300000.wmac: Watchdog Reset, Reset Cause: 4
[  377.955877] mt76_wmac 10300000.wmac: Watchdog Reset, Reset Cause: 4
  1. I tried to run iperf3 on MT7628 and perform a pressure test, but this did not trigger a reset.
  2. I also tried to use MT7612 as wwan and then download data via mt7628 wireless, still can't trigger a reset. AP <--> MT7612 <--> MT7628 <--> Client

@Linaro1985 suspects that there are some DMA issues with the Ethernet driver of mt7628. https://github.com/openwrt/openwrt/issues/10074#issuecomment-1159794622 Is it possible that there are some DMA conflicts between mt76 and OpenWrt ethernet driver.

DragonBluep commented 1 year ago

I found mt7628 vendor driver has some additional DMA related code in APCheckBcnQHandler() comparing to the mt7603.

#define DMA_FQCR0        (WF_DMA_BASE + 0x008)   /* 0x21c08 */
#define DMA_FQCR0_FQ_EN                     BIT31
#define DMA_FQCR0_FQ_STS                    BIT30
#define DMA_FQCR0_FQ_MODE                   BIT29
#define DMA_FQCR0_FQ_DEST_QID_MASK          (0x1f)
#define DMA_FQCR0_FQ_DEST_QID(p)            (((p) & DMA_FQCR0_FQ_DEST_QID_MASK) << 24)
#define DMA_FQCR0_FQ_DEST_PID_MASK          (0x3)
#define DMA_FQCR0_FQ_DEST_PID(p)            (((p) & DMA_FQCR0_FQ_DEST_PID_MASK) << 22)
#define DMA_FQCR0_FQ_TARG_QID_MASK          (0x1f)
#define DMA_FQCR0_FQ_TARG_QID(p)            (((p) & DMA_FQCR0_FQ_TARG_QID_MASK) << 16)
#define DMA_FQCR0_FQ_TARG_OM_MASK           (0x3f)
#define DMA_FQCR0_FQ_TARG_OM(p)             (((p) & DMA_FQCR0_FQ_TARG_OM_MASK) << 8)
#define DMA_FQCR0_FQ_TARG_WIDX_MASK         (0xff)
#define DMA_FQCR0_FQ_TARG_WIDX(p)           (((p) & DMA_FQCR0_FQ_TARG_WIDX_MASK))

#define DMA_FQCR1        (WF_DMA_BASE + 0x00c)   /* 0x21c0c */
#define RXSM_GROUP1_EN  (1 << 11)
#define RXSM_GROUP2_EN  (1 << 12)
#define RXSM_GROUP3_EN  (1 << 13)
nbd168 commented 1 year ago

I did more testing and rework to make the reset and the beacon stuck check more reliable. Please try this patch: https://nbd.name/p/54045fb4

DragonBluep commented 1 year ago

It is still broken. I used top to observe CPU usage and found that every time the idle reach to 0%, the PSE watchdog reset will definitely be triggered. But if I use an (MT7628 ethernet +) MT7612 PCIe NIC for testing, even if the CPU usage reaches 100%, it still works.

Mem: 39184K used, 18068K free, 164K shrd, 0K buff, 13080K cached
CPU:   1% usr  30% sys   0% nic   0% idle   0% io   0% irq  66% sirq
Load average: 0.32 0.31 0.22 2/57 3916
  PID  PPID USER     STAT   VSZ %VSZ %CPU COMMAND
  452     2 root     SW       0   0%  41% [mt76-tx phy0]
   10     2 root     RW       0   0%  31% [ksoftirqd/0]
  444     2 root     SW       0   0%  19% [napi/phy0-3]
Mem: 39184K used, 18068K free, 164K shrd, 0K buff, 13080K cached
CPU:   1% usr  30% sys   0% nic   0% idle   0% io   0% irq  66% sirq
Load average: 0.32 0.31 0.22 2/57 3916
  PID  PPID USER     STAT   VSZ %VSZ %CPU COMMAND
  452     2 root     SW       0   0%  41% [mt76-tx phy0]
   10     2 root     RW       0   0%  31% [ksoftirqd/0]
  444     2 root     SW       0   0%  19% [napi/phy0-3]

------
[  404.894713] mt76_wmac 10300000.wmac: Watchdog Reset, Reset Cause: 4
[  476.055148] mt76_wmac 10300000.wmac: Watchdog Reset, Reset Cause: 4
[  671.444612] mt76_wmac 10300000.wmac: Watchdog Reset, Reset Cause: 4
[  761.674225] mt76_wmac 10300000.wmac: Watchdog Reset, Reset Cause: 4
[  946.692878] mt76_wmac 10300000.wmac: Watchdog Reset, Reset Cause: 4
nbd168 commented 1 year ago

When it triggers, does it break the connection?

DragonBluep commented 1 year ago

When it triggers, does it break the connection?

Yes, the SSID will disappear several seconds.

nbd168 commented 1 year ago

Does bumping MT7603_RX_RING_SIZE in mt7603.h to 256 help?

DragonBluep commented 1 year ago

Does bumping MT7603_RX_RING_SIZE in mt7603.h to 256 help?

Edit: The same behavior as before.

DragonBluep commented 1 year ago

Update: The reset log appears after the SSID disappears. Firstly, the WiFi SSID disappears when the CPU load reaches 100%. After hanging for 30+ seconds, the PSE reset is triggered, and finally the SSID is immediately restored again after reset.

Dahhyunnee commented 1 year ago

Hope this issue will have a fix soon. 5G wifi is the only stable connection as of now.

DragonBluep commented 1 year ago

@nbd168 Hi! Finally, I know what caused the PSE reset. It's WiFi fragmentation threshold.

Setting any frag value instead of using the default one (off?) can avoid PSE reset. I'm not sure if this is a problem with mt76 or some OpenWrt core packages. image

Fragmentation Threshold default/off 1024 2346 4096
PSE Reset Counter (download ~1.5 GiB data) 27 0 0 0
nbd168 commented 1 year ago

@DragonBluep, wow, nice find. I suspect that the main effect of that knob is that it likely disables A-MSDU tx. Please try removing this line from mac80211.c in mt76: ieee80211_hw_set(hw, TX_AMSDU);

If that makes it work reliably as well, please put the line back in and try gradually reducing the value of hw->max_tx_fragments in the same source file.

Thanks.

DragonBluep commented 1 year ago

@nbd168 Good news, removing ieee80211_hw_set(hw, TX_AMSDU); or changing hw->max_tx_fragments to 1 can avoid watchdog resetting.

These tests are based on patch https://github.com/openwrt/mt76/issues/793#issuecomment-1655785641 hw->max_tx_fragments disable TX_AMSDU 16 8 4 2 1
Trigger PSE reset? No Yes Yes Yes Yes No

Summaries:

  1. We can workaround this PSE watchdog reset issue by changing max_tx_fragments to 1.
  2. PSE reset is not the reason but a result.
  3. With original mt76 repo, PSE reset is frequently triggered, and WiFi SSID/signal sometimes disappears under heavy load.
  4. With this patch https://github.com/openwrt/mt76/issues/793#issuecomment-1655785641, the WiFi SSID/signal will disappear 100% within 20 seconds and then trigger a PSE reset. This is not to say that it is an incorrect fix. On the contrary, I believe it touches the key code of the real bug.
nbd168 commented 1 year ago

Please try this patch on top of current mt76: https://nbd.name/p/762e9946

DragonBluep commented 1 year ago

Please try this patch on top of current mt76: https://nbd.name/p/762e9946

With this patch, I have now downloaded 7 GiB data and everything seems to be working well. As for the issue of SSID disappearing, I am not sure if it has really been fixed. In the previous test, I downloaded a total of 30+ GiB data and it was never triggered again.

I will keep track it. For users who want to test, you can execute cat /sys/kernel/debug/ieee80211/phy0/mt76/reset to observe the reset counter.

lukasz1992 commented 1 year ago

I think that mt7603 chips do not hang, only mt7628 is affected.

So I would change: max_tx_fragments = 1 to

if (is_mt7628(dev))
    max_tx_fragments = 1
nbd168 commented 1 year ago

That would be good. Could somebody verify that?

lukasz1992 commented 1 year ago

Already sold my routers with mt7603 radio. But there were working fine a year ago.

DragonBluep commented 1 year ago

@dfateyev reported MT7603E also has the similar issue https://github.com/openwrt/mt76/issues/719#issue-1501595654. From my previous test https://github.com/openwrt/mt76/issues/793#issuecomment-1656512592, CPU load reach to 100% is a potential condition of the watchdog resetting. I guess because MT7621 (+ MT7603E) is powerful enough to handle these packages so MT7603 is better than MT7628. If you are testing MT7603, I suggest running some high load tasks in the background to consume all CPU resources.

lukasz1992 commented 1 year ago

Hmm.. I checked these threads. There are some differences though:

khanjui commented 1 year ago

I have never experienced disappearing SSID problem on my mt7603.

lister-wrt commented 1 year ago

I've been experiencing issues on mt7603 (u6 lite) that match what others are describing here (crash/recover).

Nothing in the logs but load is high every time I log in to check what's going on.

It started when I began using a Chromecast on 2.4. I can reliably reproduce it by streaming something in 4K. It usually crashes once or twice but stays up after that.

Is there a way to see what's happening? Increase log verbosity or something?

DragonBluep commented 1 year ago

@lister-wrt Hi! You can follow these testing steps.

  1. Run dmesg | grep mt76 and show us the output.
  2. Do something to crash your 2.4 GHz WiFi, then run cat /sys/kernel/debug/ieee80211/phy0/mt76/reset to see if some resets have happened.
  3. Set fragmentation threshold to 2346, then re-plug the power supply to reboot device.
  4. Do step 2 again to see if it has some help.

Notice: If you can build your own firmware, you'd better apply this patch (https://nbd.name/p/762e9946) to mt76.

Update: If you have applied the patch, replace step 3 by installing your custom firmware.

shown19 commented 1 year ago

Hi, I did try the fragmentation threshold set to 2346 in hope that I will not experience crash anymore with mt7603e wifi (still observing), other concern is beacon stuck value increases over time, is this normal?

Capture

shown19 commented 1 year ago

okay it crashed now even setting fragmetation threshold to 2346

image

Please try this patch on top of current mt76: https://nbd.name/p/762e9946

Hi, will this helps fix this issue also? I'm on OpenWrt 22.03.5 and somehow the WIFI 2.4ghz(mt7603e) will cause this issue once enabled.

lukasz1992 commented 1 year ago

try the latest 23.05 snapshot or just snapshot

DragonBluep commented 1 year ago

@shown19 This problem should be fixed several weeks ago. Please try snapshot version.

Linaro1985 commented 1 year ago

Notice: If you can build your own firmware, you'd better apply this patch (https://nbd.name/p/762e9946) to mt76.

@DragonBluep @nbd168 Just tested with this patch (OpenWrt 23.05 snapshot). The transfer speed is about 80-90 Mbps. Now everything is stable. In /sys/kernel/debug/ieee80211/phy0/mt76/reset everything is 0. Great! Thank you very much!

DragonBluep commented 1 year ago

@nbd168 I tested MT7603E (+ MT7621) today. With your patch and @lukasz1992‘s suggested change https://github.com/openwrt/mt76/issues/793#issuecomment-1677529687, 3 hours iperf3 stress testing and a total of 230+ GB data transmission didn't crash it. It is impressive that all reset counters were 0.

My MT7603E is revision E2.

[    0.000000] SoC Type: MediaTek MT7621 ver:1 eco:3
[   11.758985] mt7621-pci 1e140000.pcie: bus=1 slot=0 irq=22
[   11.770137] mt7603e 0000:01:00.0: enabling device (0000 -> 0002)
[   11.776298] mt7603e 0000:01:00.0: ASIC revision: 76030010
[   12.828596] mt7603e 0000:01:00.0: Firmware Version: ap_pcie
[   12.834174] mt7603e 0000:01:00.0: Build Time: 20160107100755
[   12.877308] mt7603e 0000:01:00.0: firmware init done
shown19 commented 1 year ago

@DragonBluep hi, may I know what device you are using and openwrt version you applied your patch?

DragonBluep commented 1 year ago

@DragonBluep hi, may I know what device you are using and openwrt version you applied your patch?

It's an unsupported device ZTE E8820 V2. I'm using the OpenWrt v22.03 since my device only has 64 MiB RAM.

DragonBluep commented 1 year ago

@nbd168 During testing, I found a strange bug in MT7603E. If I encrypt WiFi and test with iperf3, the TX speed will significantly decrease. However, MT7628 and MT7612E do not have this issue. Verified on master branch and v22.03.

encryption open wpa wpa2
mt7603 -> client 169 73 69
client -> mt7603 197 175 208
mt7628 -> client 142 144
client -> mt7628 195 200
mt7612 -> client 285 317 322
client -> mt7612 400 406 402

Edit: On https://www.speedtest.net/, I can get 90+ Mbps (the maximum bandwidth provided by my carrier).

shown19 commented 1 year ago

Thank you, will try this method also, Let see if there's an improvement in my Newifi D2 with mt7603e. Does this also affects the 5ghz chipset?

shown19 commented 1 year ago

try the latest 23.05 snapshot or just snapshot

@lukasz1992 @DragonBluep Sorry guys but is the patch unrelated to the issue we experience from the main post? Does it mean, I don't need this patch? or should I still be needing this for stability?

lukasz1992 commented 1 year ago

@DragonBluep Could you reproduce the speed issue with newer firmware? I mean overwrite current mt7603_e2.bin with this: https://raw.githubusercontent.com/ptpt52/mt76/e67f2d76f15cb4120b28d7cb1f566dbff762b89f/firmware/mt7603_e2.bin