openwrt / mt76

mac80211 driver for MediaTek MT76x0e, MT76x2e, MT7603, MT7615, MT7628 and MT7688
741 stars 341 forks source link

mt7603: unstable traffic (stalls/hangs) under load (MT7603EN, MT7628AN) #865

Open rmilecki opened 6 months ago

rmilecki commented 6 months ago

I experience stability issues using chipsets supported by the mt7603 driver. When running iperf client on STA I observe hiccups (traffic temporarily slows down and sometimes stops).

I first reported this back in 2021 in e-mail thread Unstable WiFi with mt76 on MT7628AN. It doesn't seem to be regression as this issue seems to go back to 2019 at least. It is also present in the latest mt76 (2024).

For a while there were probably two different issues in mt7603: PSE hangs and traffic hangs. The first problem was hopefully fixed in 2023 with commits baa19b2e4b7b c677dda16523 317620593349 19e4f271d62e c2fcc83b41a6.

Traffic hangs remain unresolved and were observed by multiple people using various devices. See above e-mail for OpenWrt forum reports and GitHub issues #692 #719 #841.

rmilecki commented 6 months ago

Netgear R6220 (MT7621ST SoC + MT7603EN Wi-Fi + MT7612EN Wi-Fi)

Example from OpenWrt 23.05.2 (iperf on STA connected to MT7603EN using channel 1 bandwidth 20 MHz):

[  3] 25.0-26.0 sec  5.50 MBytes  46.1 Mbits/sec
[  3] 26.0-27.0 sec  4.62 MBytes  38.8 Mbits/sec
[  3] 27.0-28.0 sec  4.75 MBytes  39.8 Mbits/sec
[  3] 28.0-29.0 sec  6.50 MBytes  54.5 Mbits/sec
[  3] 29.0-30.0 sec  7.50 MBytes  62.9 Mbits/sec
[  3] 30.0-31.0 sec  6.50 MBytes  54.5 Mbits/sec
[  3] 31.0-32.0 sec  6.75 MBytes  56.6 Mbits/sec
[  3] 32.0-33.0 sec  6.62 MBytes  55.6 Mbits/sec
[  3] 33.0-34.0 sec  6.38 MBytes  53.5 Mbits/sec
[  3] 34.0-35.0 sec  6.50 MBytes  54.5 Mbits/sec
[  3] 35.0-36.0 sec  7.25 MBytes  60.8 Mbits/sec
[  3] 36.0-37.0 sec  6.50 MBytes  54.5 Mbits/sec
[  3] 37.0-38.0 sec  6.62 MBytes  55.6 Mbits/sec
[  3] 38.0-39.0 sec  4.75 MBytes  39.8 Mbits/sec
[  3] 39.0-40.0 sec  3.25 MBytes  27.3 Mbits/sec
[  3] 40.0-41.0 sec  1.88 MBytes  15.7 Mbits/sec
[  3] 41.0-42.0 sec  1.88 MBytes  15.7 Mbits/sec
[  3] 42.0-43.0 sec  2.38 MBytes  19.9 Mbits/sec
[  3] 43.0-44.0 sec   896 KBytes  7.34 Mbits/sec
[  3] 44.0-45.0 sec  1.00 MBytes  8.39 Mbits/sec
[  3] 45.0-46.0 sec  0.00 Bytes  0.00 bits/sec
[  3] 46.0-47.0 sec  1.25 MBytes  10.5 Mbits/sec
[  3] 47.0-48.0 sec  3.00 MBytes  25.2 Mbits/sec
[  3] 48.0-49.0 sec  1.00 MBytes  8.39 Mbits/sec
[  3] 49.0-50.0 sec  1.12 MBytes  9.44 Mbits/sec
[  3] 50.0-51.0 sec  0.00 Bytes  0.00 bits/sec
[  3] 51.0-52.0 sec  1.88 MBytes  15.7 Mbits/sec
[  3] 52.0-53.0 sec  0.00 Bytes  0.00 bits/sec
[  3] 53.0-54.0 sec  1.25 MBytes  10.5 Mbits/sec
[  3] 54.0-55.0 sec  2.75 MBytes  23.1 Mbits/sec
[  3] 55.0-56.0 sec  3.75 MBytes  31.5 Mbits/sec
[  3] 56.0-57.0 sec  6.38 MBytes  53.5 Mbits/sec
[  3] 57.0-58.0 sec  6.50 MBytes  54.5 Mbits/sec
[  3] 58.0-59.0 sec  5.62 MBytes  47.2 Mbits/sec
[  3] 59.0-60.0 sec  6.38 MBytes  53.5 Mbits/sec
[  3] 60.0-61.0 sec  7.50 MBytes  62.9 Mbits/sec
[  3] 61.0-62.0 sec  6.62 MBytes  55.6 Mbits/sec
[  3] 62.0-63.0 sec  6.38 MBytes  53.5 Mbits/sec

Whenever traffic stops I can see that station's TX bitrate reported by router goes down from 72.2 Mbps to 6.5 Mbps.

Switching from HT20 to NOHT results in rate being limited to 54 Mbps (it varies between 54 Mbps and 48 Mbps). I run iperf for 8 hours and experienced only one one-second stall/hang over that time. Average iperf speed was 19.5 Mbps and it varied between 15 and 23-24 Mbps most of the time.

Commenting out ieee80211_hw_set(hw, AMPDU_AGGREGATION); in mac80211.c results in cutting average speed by about a half (down to 29 Mbps) but improves stability too (rate stays at 72.2 Mbps and sometimes drops to 65 Mbps for a second). During the first 1,5 iperf session I had a one single stall/hang. During next one that took 3 hours I had none. Average speed was 30.4 Mbps (I mostly was 31 Mbps ± 4 Mbps).

rmilecki commented 6 months ago

Xiaomi Mi Router 4C (MT7628AN Wi-Fi SoC)

Example from OpenWrt 23.05.2 (iperf on STA connected to MT7628AN using channel 1 bandwidth 20 MHz):

[  3] 75.0-76.0 sec  6.62 MBytes  55.6 Mbits/sec
[  3] 76.0-77.0 sec  6.62 MBytes  55.6 Mbits/sec
[  3] 77.0-78.0 sec  6.75 MBytes  56.6 Mbits/sec
[  3] 78.0-79.0 sec  6.62 MBytes  55.6 Mbits/sec
[  3] 79.0-80.0 sec  6.62 MBytes  55.6 Mbits/sec
[  3] 80.0-81.0 sec  6.62 MBytes  55.6 Mbits/sec
[  3] 81.0-82.0 sec  5.88 MBytes  49.3 Mbits/sec
[  3] 82.0-83.0 sec  5.25 MBytes  44.0 Mbits/sec
[  3] 83.0-84.0 sec  1.25 MBytes  10.5 Mbits/sec
[  3] 84.0-85.0 sec  2.50 MBytes  21.0 Mbits/sec
[  3] 85.0-86.0 sec  1.12 MBytes  9.44 Mbits/sec
[  3] 86.0-87.0 sec   896 KBytes  7.34 Mbits/sec
[  3] 87.0-88.0 sec  1.00 MBytes  8.39 Mbits/sec
[  3] 88.0-89.0 sec  2.00 MBytes  16.8 Mbits/sec
[  3] 89.0-90.0 sec  0.00 Bytes  0.00 bits/sec
[  3] 90.0-91.0 sec  1.25 MBytes  10.5 Mbits/sec
[  3] 91.0-92.0 sec  0.00 Bytes  0.00 bits/sec
[  3] 92.0-93.0 sec  0.00 Bytes  0.00 bits/sec
[  3] 93.0-94.0 sec  0.00 Bytes  0.00 bits/sec
[  3] 94.0-95.0 sec  0.00 Bytes  0.00 bits/sec
[  3] 95.0-96.0 sec  0.00 Bytes  0.00 bits/sec
[  3] 96.0-97.0 sec  0.00 Bytes  0.00 bits/sec
[  3] 97.0-98.0 sec  0.00 Bytes  0.00 bits/sec
[  3] 98.0-99.0 sec  2.50 MBytes  21.0 Mbits/sec
[  3] 99.0-100.0 sec  6.88 MBytes  57.7 Mbits/sec
[  3] 100.0-101.0 sec  6.75 MBytes  56.6 Mbits/sec
[  3] 101.0-102.0 sec  6.88 MBytes  57.7 Mbits/sec
[  3] 102.0-103.0 sec  6.75 MBytes  56.6 Mbits/sec
[  3] 103.0-104.0 sec  5.50 MBytes  46.1 Mbits/sec
[  3] 104.0-105.0 sec  6.75 MBytes  56.6 Mbits/sec
[  3] 105.0-106.0 sec  7.00 MBytes  58.7 Mbits/sec
[  3] 106.0-107.0 sec  7.00 MBytes  58.7 Mbits/sec
[  3] 107.0-108.0 sec  6.75 MBytes  56.6 Mbits/sec

Whenever traffic stops I can see that station's TX bitrate reported by router goes down from 72.2 Mbps to 6.5 Mbps.

Switching from HT20 to NOHT results in rate being limited to 54 Mbps (it varies between 54 Mbps and 48 Mbps, sometimes 36 Mbps). I run iperf for an hour without a single stall/hang. Average iperf speed was 19.2 Mbps and it slowed from from 20 Mbps down to 9-10 Mbps a few times but never stalled/hanged completely.

rmilecki commented 6 months ago

It seems that all those slowdowns/stalls/hangs happen with high traffic only. Slowing Wi-Fi traffic down (by disabling HT or AMPDU) seems to mitigate them.

It's in sync with what I observed back in 2021 when I tried limiting iperf traffic by using -b 20M and -b 10M.

rmilecki commented 6 months ago

I was wondering if hardware still generates any IRQs during those stalls/hangs/slowdowns. I cooked a very trivial & dirty patch: dbg-rx-irqs.txt. It's terrible quality but maybe it shows something interesting? Following is synced output of client's iperf and router's kernel:

[  3]  5.0- 6.0 sec  6.38 MBytes  53.5 Mbits/sec    [ 1045.490814] mt7603e 0000:02:00.0: [mt7603_dbg_watchdog] 4681 0
[  3]  6.0- 7.0 sec  6.50 MBytes  54.5 Mbits/sec    [ 1046.530712] mt7603e 0000:02:00.0: [mt7603_dbg_watchdog] 4878 0
[  3]  7.0- 8.0 sec  6.25 MBytes  52.4 Mbits/sec    [ 1047.570853] mt7603e 0000:02:00.0: [mt7603_dbg_watchdog] 4832 0
[  3]  8.0- 9.0 sec  7.50 MBytes  62.9 Mbits/sec    [ 1048.610675] mt7603e 0000:02:00.0: [mt7603_dbg_watchdog] 4854 0
[  3]  9.0-10.0 sec  6.25 MBytes  52.4 Mbits/sec    [ 1049.650727] mt7603e 0000:02:00.0: [mt7603_dbg_watchdog] 4911 0
[  3] 10.0-11.0 sec  6.50 MBytes  54.5 Mbits/sec    [ 1050.690658] mt7603e 0000:02:00.0: [mt7603_dbg_watchdog] 4198 2
[  3] 11.0-12.0 sec  3.88 MBytes  32.5 Mbits/sec    [ 1051.730982] mt7603e 0000:02:00.0: [mt7603_dbg_watchdog] 2251 19
[  3] 12.0-13.0 sec  2.88 MBytes  24.1 Mbits/sec    [ 1052.770621] mt7603e 0000:02:00.0: [mt7603_dbg_watchdog] 1606 17
[  3] 13.0-14.0 sec  1.12 MBytes  9.44 Mbits/sec    [ 1053.810652] mt7603e 0000:02:00.0: [mt7603_dbg_watchdog] 1991 0
[  3] 14.0-15.0 sec  3.50 MBytes  29.4 Mbits/sec    [ 1054.850595] mt7603e 0000:02:00.0: [mt7603_dbg_watchdog] 1669 0
[  3] 15.0-16.0 sec  1.88 MBytes  15.7 Mbits/sec    [ 1055.890584] mt7603e 0000:02:00.0: [mt7603_dbg_watchdog] 1676 0
[  3] 16.0-17.0 sec  1.88 MBytes  15.7 Mbits/sec    [ 1056.930789] mt7603e 0000:02:00.0: [mt7603_dbg_watchdog] 1052 0
[  3] 17.0-18.0 sec   896 KBytes  7.34 Mbits/sec    [ 1057.970606] mt7603e 0000:02:00.0: [mt7603_dbg_watchdog] 1086 0
[  3] 18.0-19.0 sec  1.75 MBytes  14.7 Mbits/sec    [ 1059.010619] mt7603e 0000:02:00.0: [mt7603_dbg_watchdog] 1297 0
[  3] 19.0-20.0 sec  2.62 MBytes  22.0 Mbits/sec    [ 1060.050603] mt7603e 0000:02:00.0: [mt7603_dbg_watchdog] 4879 0
[  3] 20.0-21.0 sec  6.50 MBytes  54.5 Mbits/sec    [ 1061.090740] mt7603e 0000:02:00.0: [mt7603_dbg_watchdog] 4821 0
[  3] 21.0-22.0 sec  6.25 MBytes  52.4 Mbits/sec    [ 1062.130816] mt7603e 0000:02:00.0: [mt7603_dbg_watchdog] 4682 0

It turns out that whenever slow downs happen there are some MT_INT_RX_DONE(1) interrupts (MCU interrupts).

I added some debugging to *_mcu_*_send.*() functions and none of them is called after early init phase on MT7603EN. So it seems MCU is generating IRQs and sending those packets on its own (those are not replies to MCU requests).

Some debugging in mt7603_queue_rx_skb() revealed that those are type 0 (PKT_TYPE_TXS?) IRQs and tdx[1] is 0x0080cd00 (which means idx 0) in my case. They refer to my only station connected to the router and result in mt76 calling ieee80211_sta_set_buffered().

Is that any good hint on what may be happening? My STA keeps sending traffic with iperf so it clearly doesn't go to sleep. Can that be some queuing issue?

nachalni commented 6 months ago

Hi @rmilecki, I have checked Mediateks proprietary driver and it seems that in that driver idx 0 is not used for client sta, idx used for sta starts from index 1, maybe there is some reason behind that ?

rmilecki commented 6 months ago

FWIW using Netgear OEM firmware with their wireless driver seems to make MT7603EN stable. There are slow downs (I'm wondering if those are also MCU / TX status related) but traffic never stalls/hangs. iperf-netgear-r6220-oem.txt

rmilecki commented 6 months ago

I have checked Mediateks proprietary driver and it seems that in that driver idx 0 is not used for client sta, idx used for sta starts from index 1, maybe there is some reason behind that ?

As a very quick test I connected another device (smartphone) to R6220's MT7603EN and then my ThinkPad notebook with iperf client. I experience the same stability issues only this time tdx[1] is 0x0080cd01 (which means idx 1). I didn't attempt modifying driver code to avoid idx 0 in general.

nachalni commented 6 months ago

What might be interesting too, is that MT_HIGH_PRIORITY_1 high priority register value is different from the proprietary driver. MT76 has this value set to 0x55555553 in https://github.com/openwrt/mt76/blob/master/mt7603/init.c#L64, while proprietary driver has this value set to 0x55555555. Also would be interesting to know the reason behind that

rmilecki commented 6 months ago

I started wondering if MT7603EN may actually stop/slow down on sending packets. I developed another ugly debugging patch for printing max queues lengths over 1 second: mt76_max_queued.txt

Here is my manually (accuracy < 1 s) synced output of station (iperf output) and AP (kernel output):

[  3]  0.0- 1.0 sec  7.75 MBytes  65.0 Mbits/sec    [  303.736585] [mt7603_dbg_watchdog] [IRQs] tx:2166 rx0:4725 rx1:0 [QUEUES] 0 0 17 0 0 1 0
[  3]  1.0- 2.0 sec  6.38 MBytes  53.5 Mbits/sec    [  304.776652] [mt7603_dbg_watchdog] [IRQs] tx:2125 rx0:4708 rx1:0 [QUEUES] 0 0 6 0 0 1 0
[  3]  2.0- 3.0 sec  7.12 MBytes  59.8 Mbits/sec    [  305.816434] [mt7603_dbg_watchdog] [IRQs] tx:2059 rx0:4472 rx1:0 [QUEUES] 0 0 4 0 0 1 0
[  3]  3.0- 4.0 sec  6.50 MBytes  54.5 Mbits/sec    [  306.856286] [mt7603_dbg_watchdog] [IRQs] tx:2181 rx0:4735 rx1:0 [QUEUES] 1 0 10 0 0 1 0
[  3]  4.0- 5.0 sec  6.50 MBytes  54.5 Mbits/sec    [  307.896319] [mt7603_dbg_watchdog] [IRQs] tx:2180 rx0:4784 rx1:0 [QUEUES] 0 0 4 0 0 1 0
[  3]  5.0- 6.0 sec  6.50 MBytes  54.5 Mbits/sec    [  308.936232] [mt7603_dbg_watchdog] [IRQs] tx:1764 rx0:4029 rx1:3 [QUEUES] 0 0 22 0 0 1 0
[  3]  6.0- 7.0 sec  6.62 MBytes  55.6 Mbits/sec    [  309.976276] [mt7603_dbg_watchdog] [IRQs] tx:1330 rx0:3281 rx1:12 [QUEUES] 1 0 63 0 0 1 0 ← q_tx[2] gets longer = traffic slow downs
[  3]  7.0- 8.0 sec  4.00 MBytes  33.6 Mbits/sec    [  311.016331] [mt7603_dbg_watchdog] [IRQs] tx:1371 rx0:3269 rx1:13 [QUEUES] 0 0 57 0 0 1 0
[  3]  8.0- 9.0 sec  4.75 MBytes  39.8 Mbits/sec    [  312.056313] [mt7603_dbg_watchdog] [IRQs] tx:640 rx0:1492 rx1:6 [QUEUES] 0 0 63 0 0 1 0
[  3]  9.0-10.0 sec  1.75 MBytes  14.7 Mbits/sec    [  313.096662] [mt7603_dbg_watchdog] [IRQs] tx:1648 rx0:3656 rx1:0 [QUEUES] 0 0 5 0 0 1 0
[  3] 10.0-11.0 sec  3.75 MBytes  31.5 Mbits/sec    [  314.136306] [mt7603_dbg_watchdog] [IRQs] tx:2151 rx0:4775 rx1:0 [QUEUES] 0 0 8 0 0 1 0
[  3] 11.0-12.0 sec  6.62 MBytes  55.6 Mbits/sec    [  315.176833] [mt7603_dbg_watchdog] [IRQs] tx:2114 rx0:4678 rx1:0 [QUEUES] 0 0 10 0 0 1 0
[  3] 12.0-13.0 sec  6.50 MBytes  54.5 Mbits/sec    [  316.216236] [mt7603_dbg_watchdog] [IRQs] tx:2179 rx0:4758 rx1:0 [QUEUES] 0 0 5 0 0 1 0
[  3] 13.0-14.0 sec  6.50 MBytes  54.5 Mbits/sec    [  317.256330] [mt7603_dbg_watchdog] [IRQs] tx:2132 rx0:4745 rx1:0 [QUEUES] 0 0 7 0 0 1 0
[  3] 14.0-15.0 sec  6.62 MBytes  55.6 Mbits/sec    [  318.296688] [mt7603_dbg_watchdog] [IRQs] tx:2168 rx0:4808 rx1:0 [QUEUES] 0 0 6 0 0 1 0
[  3] 15.0-16.0 sec  6.50 MBytes  54.5 Mbits/sec    [  319.336478] [mt7603_dbg_watchdog] [IRQs] tx:2156 rx0:4788 rx1:0 [QUEUES] 0 0 6 0 0 1 0
[  3] 16.0-17.0 sec  7.38 MBytes  61.9 Mbits/sec    [  320.376263] [mt7603_dbg_watchdog] [IRQs] tx:2236 rx0:4777 rx1:0 [QUEUES] 0 0 7 0 0 1 0
[  3] 17.0-18.0 sec  6.62 MBytes  55.6 Mbits/sec    [  321.416134] [mt7603_dbg_watchdog] [IRQs] tx:1345 rx0:2985 rx1:0 [QUEUES] 0 0 3 0 0 1 0
[  3]  0.0- 1.0 sec  8.38 MBytes  70.3 Mbits/sec    [  439.977705] [mt7603_dbg_watchdog] [IRQs] tx:1608 rx0:4945 rx1:0 [QUEUES] 0 0 22 0 0 1 0
[  3]  1.0- 2.0 sec  7.00 MBytes  58.7 Mbits/sec    [  441.018929] [mt7603_dbg_watchdog] [IRQs] tx:1541 rx0:4946 rx1:0 [QUEUES] 0 0 35 0 0 1 0
[  3]  2.0- 3.0 sec  6.62 MBytes  55.6 Mbits/sec    [  442.057523] [mt7603_dbg_watchdog] [IRQs] tx:1705 rx0:4996 rx1:0 [QUEUES] 0 0 14 0 0 1 0
[  3]  3.0- 4.0 sec  6.62 MBytes  55.6 Mbits/sec    [  443.097840] [mt7603_dbg_watchdog] [IRQs] tx:1577 rx0:4980 rx1:0 [QUEUES] 0 0 31 0 0 1 0
[  3]  4.0- 5.0 sec  6.75 MBytes  56.6 Mbits/sec    [  444.137589] [mt7603_dbg_watchdog] [IRQs] tx:1651 rx0:5030 rx1:0 [QUEUES] 0 0 33 0 0 1 0
[  3]  5.0- 6.0 sec  6.62 MBytes  55.6 Mbits/sec    [  445.178764] [mt7603_dbg_watchdog] [IRQs] tx:1656 rx0:4940 rx1:0 [QUEUES] 0 0 33 0 0 1 0
[  3]  6.0- 7.0 sec  6.62 MBytes  55.6 Mbits/sec    [  446.219212] [mt7603_dbg_watchdog] [IRQs] tx:1650 rx0:4953 rx1:0 [QUEUES] 0 0 31 0 0 1 0
[  3]  7.0- 8.0 sec  6.62 MBytes  55.6 Mbits/sec    [  447.259172] [mt7603_dbg_watchdog] [IRQs] tx:1698 rx0:5006 rx1:0 [QUEUES] 0 0 37 0 0 1 0
[  3]  8.0- 9.0 sec  5.75 MBytes  48.2 Mbits/sec    [  448.297197] [mt7603_dbg_watchdog] [IRQs] tx:1168 rx0:3675 rx1:13 [QUEUES] 1 0 63 0 0 1 0 ← q_tx[2] gets longer = traffic slow downs
[  3]  9.0-10.0 sec  3.00 MBytes  25.2 Mbits/sec    [  449.337174] [mt7603_dbg_watchdog] [IRQs] tx:644 rx0:1982 rx1:16 [QUEUES] 0 0 42 0 0 1 0
[  3] 10.0-11.0 sec  2.12 MBytes  17.8 Mbits/sec    [  450.377171] [mt7603_dbg_watchdog] [IRQs] tx:571 rx0:1793 rx1:6 [QUEUES] 0 0 41 0 0 1 0
[  3] 11.0-12.0 sec  2.12 MBytes  17.8 Mbits/sec    [  451.417169] [mt7603_dbg_watchdog] [IRQs] tx:432 rx0:1613 rx1:9 [QUEUES] 0 0 89 0 0 1 0
[  3] 12.0-13.0 sec  1.00 MBytes  8.39 Mbits/sec    [  452.457172] [mt7603_dbg_watchdog] [IRQs] tx:177 rx0:1102 rx1:5 [QUEUES] 0 0 71 0 0 1 0
[  3] 13.0-14.0 sec  2.12 MBytes  17.8 Mbits/sec    [  453.497707] [mt7603_dbg_watchdog] [IRQs] tx:211 rx0:1008 rx1:7 [QUEUES] 0 0 86 0 0 1 0
[  3] 14.0-15.0 sec  1.00 MBytes  8.39 Mbits/sec    [  454.537117] [mt7603_dbg_watchdog] [IRQs] tx:223 rx0:1191 rx1:6 [QUEUES] 0 0 119 0 0 1 0
[  3] 15.0-16.0 sec  1.88 MBytes  15.7 Mbits/sec    [  455.577135] [mt7603_dbg_watchdog] [IRQs] tx:200 rx0:1194 rx1:3 [QUEUES] 0 0 98 0 0 1 0
[  3] 16.0-17.0 sec  0.00 Bytes  0.00 bits/sec  [  456.617086] [mt7603_dbg_watchdog] [IRQs] tx:116 rx0:233 rx1:4 [QUEUES] 0 0 108 0 0 1 0
[  3] 17.0-18.0 sec  0.00 Bytes  0.00 bits/sec  [  457.657417] [mt7603_dbg_watchdog] [IRQs] tx:40 rx0:294 rx1:1 [QUEUES] 0 0 18 0 0 1 0
[  3] 18.0-19.0 sec  1.25 MBytes  10.5 Mbits/sec    [  458.697068] [mt7603_dbg_watchdog] [IRQs] tx:579 rx0:1806 rx1:9 [QUEUES] 0 0 44 0 0 1 0
[  3] 19.0-20.0 sec  1.88 MBytes  15.7 Mbits/sec    [  459.737067] [mt7603_dbg_watchdog] [IRQs] tx:352 rx0:1874 rx1:6 [QUEUES] 0 0 101 0 0 1 0
[  3] 20.0-21.0 sec  2.00 MBytes  16.8 Mbits/sec    [  460.777078] [mt7603_dbg_watchdog] [IRQs] tx:111 rx0:230 rx1:4 [QUEUES] 0 0 81 0 0 1 0

(scroll those above horizontally for my comments)

rmilecki commented 6 months ago

proprietary driver has this value set to 0x55555555.

I changed mt76 to use 0x55555555 but that doesn't help

nachalni commented 6 months ago

Here is my manually (accuracy < 1 s) synced output of station (iperf output) and AP (kernel output):

(...)

And it seems that interrupts with MT_RXQ_MCU are coming at the time of the slowdown, client device goes to powersave mode and tx packets are loopbacked ?

rmilecki commented 6 months ago

And it seems that interrupts with MT_RXQ_MCU are coming at the time of the slowdown, client device goes to powersave mode and tx packets are loopbacked ?

Yeah, that's in sync with what I observed and described earlier in https://github.com/openwrt/mt76/issues/865#issuecomment-1980806588

Linaro1985 commented 6 months ago

Hi. @rmilecki thank you for reseaching. Also please take a look for https://github.com/openwrt/mt76/commit/a8d9553d8fc4db9c12022451ca1d2368e796c591#commitcomment-130672442 Watchdog functionality was broken. Rolling back this commit restores it.

rmilecki commented 6 months ago

@Linaro1985: FWIW I tried reverting that commit but it didn't help my case:

[  3] 50.0-51.0 sec  4.75 MBytes  39.8 Mbits/sec
[  3] 51.0-52.0 sec  4.62 MBytes  38.8 Mbits/sec
[  3] 52.0-53.0 sec  6.50 MBytes  54.5 Mbits/sec
[  3] 53.0-54.0 sec  3.62 MBytes  30.4 Mbits/sec
[  3] 54.0-55.0 sec  5.50 MBytes  46.1 Mbits/sec
[  3] 55.0-56.0 sec  4.62 MBytes  38.8 Mbits/sec
[  3] 56.0-57.0 sec  4.62 MBytes  38.8 Mbits/sec
[  3] 57.0-58.0 sec  5.75 MBytes  48.2 Mbits/sec
[  3] 58.0-59.0 sec  2.88 MBytes  24.1 Mbits/sec
[  3] 59.0-60.0 sec  3.00 MBytes  25.2 Mbits/sec
[  3] 60.0-61.0 sec  1.88 MBytes  15.7 Mbits/sec
[  3] 61.0-62.0 sec  2.00 MBytes  16.8 Mbits/sec
[  3] 62.0-63.0 sec  0.00 Bytes  0.00 bits/sec
[  3] 63.0-64.0 sec  0.00 Bytes  0.00 bits/sec
[  3] 64.0-65.0 sec  0.00 Bytes  0.00 bits/sec
[  3] 65.0-66.0 sec  0.00 Bytes  0.00 bits/sec
[  3] 66.0-67.0 sec  1.75 MBytes  14.7 Mbits/sec
[  3] 67.0-68.0 sec  7.50 MBytes  62.9 Mbits/sec
[  3] 68.0-69.0 sec  7.25 MBytes  60.8 Mbits/sec
[  3] 69.0-70.0 sec  6.50 MBytes  54.5 Mbits/sec
(...)
[  3] 150.0-151.0 sec  6.50 MBytes  54.5 Mbits/sec
[  3] 151.0-152.0 sec  3.00 MBytes  25.2 Mbits/sec
[  3] 152.0-153.0 sec  1.88 MBytes  15.7 Mbits/sec
[  3] 153.0-154.0 sec  1.25 MBytes  10.5 Mbits/sec
[  3] 154.0-155.0 sec  1.88 MBytes  15.7 Mbits/sec
[  3] 155.0-156.0 sec  2.75 MBytes  23.1 Mbits/sec
[  3] 156.0-157.0 sec  1.00 MBytes  8.39 Mbits/sec
[  3] 157.0-158.0 sec  0.00 Bytes  0.00 bits/sec
[  3] 158.0-159.0 sec  0.00 Bytes  0.00 bits/sec
[  3] 159.0-160.0 sec  7.38 MBytes  61.9 Mbits/sec
[  3] 160.0-161.0 sec  6.62 MBytes  55.6 Mbits/sec
[  3] 161.0-162.0 sec  6.38 MBytes  53.5 Mbits/sec
[  3] 162.0-163.0 sec  7.62 MBytes  64.0 Mbits/sec
[  3] 163.0-164.0 sec  6.50 MBytes  54.5 Mbits/sec
[  3] 164.0-165.0 sec  6.50 MBytes  54.5 Mbits/sec
(...)
[  3] 175.0-176.0 sec  6.50 MBytes  54.5 Mbits/sec
[  3] 176.0-177.0 sec  7.62 MBytes  64.0 Mbits/sec
[  3] 177.0-178.0 sec  6.50 MBytes  54.5 Mbits/sec
[  3] 178.0-179.0 sec  7.25 MBytes  60.8 Mbits/sec
[  3] 179.0-180.0 sec  5.62 MBytes  47.2 Mbits/sec
[  3] 180.0-181.0 sec  1.88 MBytes  15.7 Mbits/sec
[  3] 181.0-182.0 sec  2.00 MBytes  16.8 Mbits/sec
[  3] 182.0-183.0 sec  1.00 MBytes  8.39 Mbits/sec
[  3] 183.0-184.0 sec  0.00 Bytes  0.00 bits/sec
[  3] 184.0-185.0 sec  0.00 Bytes  0.00 bits/sec
[  3] 185.0-186.0 sec  0.00 Bytes  0.00 bits/sec
[  3] 186.0-187.0 sec  0.00 Bytes  0.00 bits/sec
[  3] 187.0-188.0 sec  0.00 Bytes  0.00 bits/sec
[  3] 188.0-189.0 sec  5.50 MBytes  46.1 Mbits/sec
[  3] 189.0-190.0 sec  4.50 MBytes  37.7 Mbits/sec
[  3] 190.0-191.0 sec  3.75 MBytes  31.5 Mbits/sec
[  3] 191.0-192.0 sec  3.75 MBytes  31.5 Mbits/sec
[  3] 192.0-193.0 sec  4.88 MBytes  40.9 Mbits/sec
[  3] 193.0-194.0 sec  4.88 MBytes  40.9 Mbits/sec
[  3] 194.0-195.0 sec  5.50 MBytes  46.1 Mbits/sec

Please note that my issue goes back to 2021 at least. I guess those are just 2 different problems.

rmilecki commented 6 months ago

I developed a simple workaround that seems to fix stability for me with MT7603 and MT7628: [PATCH] wifi: mt76: mt7603: add debugfs attr for disabling frames buffering

I just pushed that under-review PATCH to OpenWrt, see commit 7236d4f82b57 ("mt76: add mt7603 possible workaround for MT7603EN / MT7628AN stability")

My both devices seem really stable with mt7603 as soon as I do:

echo N > /sys/kernel/debug/ieee80211/phy0/mt76/frames_buffering
LuisMitaHL commented 5 months ago

The patch for disabling frames buffering seems to have been deleted a few hours ago on https://github.com/openwrt/openwrt/commit/a10a6fbac794b30885d65ec817ebdcfe9f94d78a

Besides that, a new version of mt76 arrived and two commits from there are fixes for mt7603. That would be the final solution?

Linaro1985 commented 5 months ago

@enmaskarado I think it would be the final solution because of https://github.com/openwrt/mt76/commit/e4de3592c4e3baa82142eff583cb5a761f790709 (see commit description)

By the way, I'm already testing the fixes and so far everything is fine.

everything411 commented 5 months ago

I have not experienced any stability issues after https://github.com/openwrt/mt76/commit/e4de3592c4e3baa82142eff583cb5a761f790709 . For me, mt76 is more stable than proprietary driver now. There are some slow downs for a MI 4C MT7628 router using proprietary drivers but mt76 doesn't suffer from that now.

biboc commented 5 months ago

@everything411 On which version of Openwrt are you? Have you got also these mt76_wmac MCU timed out problem? https://github.com/openwrt/mt76/issues/628 I can't fix it yet Did you change eth driver as well?

From OpenWRT 23, I patched mt76 driver with two commit changes you mentionned, I still have the problem

everything411 commented 5 months ago

@biboc I'm on OpenWrt master. do you backport b14c235? this commit is not in 23.05

biboc commented 5 months ago

@everything411 I built OpenWRT and I upgraded Makefile https://github.com/openwrt/openwrt/blob/main/package/kernel/mt76/Makefile to 2024-04-03 that includes https://github.com/openwrt/mt76/commit/b14c2351ddb8601c322576d84029e463d456caef Doesn't it?

It got multiple MCU HANG like describe here: https://github.com/openwrt/mt76/issues/628 It may be the cause of my problem

biboc commented 5 months ago

Ok MCU HANG comes from another program that restarted wifi Now I only have BEACON stuck and tx hang

# cat /sys/kernel/debug/ieee80211/phy0/mt76/reset
             TX hang: 88
   TX DMA busy stuck: 0
   RX DMA busy stuck: 0
        Beacon stuck: 9172
   RX PSE busy stuck: 0
            MCU hang: 0
    PSE reset failed: 0

And a ping which is very long 4 to 15 seconds!

# ping 10.201.21.88
PING 10.201.21.88 (10.201.21.88): 56 data bytes
64 bytes from 10.201.21.88: seq=0 ttl=64 time=12517.019 ms
64 bytes from 10.201.21.88: seq=1 ttl=64 time=11659.209 ms
64 bytes from 10.201.21.88: seq=2 ttl=64 time=14620.829 ms
64 bytes from 10.201.21.88: seq=3 ttl=64 time=14952.260 ms
64 bytes from 10.201.21.88: seq=4 ttl=64 time=15381.312 ms
64 bytes from 10.201.21.88: seq=5 ttl=64 time=15897.280 ms
64 bytes from 10.201.21.88: seq=6 ttl=64 time=14896.970 ms
64 bytes from 10.201.21.88: seq=7 ttl=64 time=13912.369 ms
64 bytes from 10.201.21.88: seq=8 ttl=64 time=13224.867 ms
64 bytes from 10.201.21.88: seq=16 ttl=64 time=8219.529 ms
64 bytes from 10.201.21.88: seq=17 ttl=64 time=7528.824 ms
64 bytes from 10.201.21.88: seq=18 ttl=64 time=8228.625 ms
64 bytes from 10.201.21.88: seq=21 ttl=64 time=6560.212 ms
64 bytes from 10.201.21.88: seq=22 ttl=64 time=6444.579 ms
64 bytes from 10.201.21.88: seq=23 ttl=64 time=7337.334 ms
64 bytes from 10.201.21.88: seq=24 ttl=64 time=6492.642 ms
64 bytes from 10.201.21.88: seq=25 ttl=64 time=6079.157 ms
64 bytes from 10.201.21.88: seq=26 ttl=64 time=5514.735 ms

Station is closed to this one with ok metric:

Station --:--:--:--:--:-- (on mesh0)
        signal:         -49 [-49, -87] dBm
        signal avg:     -48 [-48, -86] dBm
        mesh plink:     ESTAB

DEST ADDR         NEXT HOP          IFACE       SN      METRIC  QLEN    EXPTIME DTIM    DRET    FLAGS   HOP_COUNT       PATH_CHANGE
--:--:--:--:--:-- --:--:--:--:--:-- mesh0       1       1366    0       2900    1600    4       0x15    1       1
biboc commented 5 months ago

@everything411 I'm on Openwrt 23.05.3 + mt76 PKG_SOURCE_DATE:=2024-04-03 PKG_SOURCE_VERSION:=1e336a8582dce2ef32ddd440d423e9afef961e71

biboc commented 5 months ago

@nbd168 Any idea on my problem? Why ping and connection to other nodes are so slow? And what is the cause of Beacon stuck and TX hang ? Thanks,

Linaro1985 commented 5 months ago

I no longer have problems with hangs but sometimes when I reboot the device I get this

[   10.722558] pci 0000:00:00.0: enabling device (0000 -> 0003)
[   10.733956] mt7603e 0000:01:00.0: can't change power state from D3cold to D0 (config space inaccessible)
[   10.753158] mt7603e 0000:01:00.0: ASIC revision: 0000
[   10.764095] ------------[ cut here ]------------
[   10.773293] WARNING: CPU: 3 PID: 535 at target-mipsel_24kc_musl/linux-ramips_mt7621/mt76-2024-04-03-1e336a85/mt7603/eeprom.c:27 0x823a7f00 [mt7603e@(ptrval)+0x9980]
[   10.802587] Modules linked in: mt7603e(+) mt76_connac_lib mt76 mac80211 libchacha20poly1305 ipt_REJECT cfg80211 xt_time xt_tcpudp xt_policy xt_multiport xt_mark xt_mac xt_limit xt_esp xt_comment xt_TCPMSS xt_LOG xfrm_interface ts_kmp ts_fsm ts_bm slhc poly1305_mips nfnetlink nf_reject_ipv4 nf_log_syslog nf_defrag_ipv6 nf_defrag_ipv4 libcurve25519_generic libcrc32c iptable_mangle iptable_filter ipt_ah ip_tables hwmon crc_ccitt compat chacha_mips asn1_decoder ip6table_mangle ip6table_filter ip6_tables ip6t_REJECT x_tables nf_reject_ipv6 ip6_gre ip_gre gre l2tp_netlink l2tp_core udp_tunnel ip6_udp_tunnel ipcomp6 xfrm6_tunnel esp6 ah6 xfrm4_tunnel ipcomp esp4 ah4 ip6_tunnel tunnel6 tunnel4 ip_tunnel xfrm_user xfrm_ipcomp af_key xfrm_algo crypto_user algif_skcipher algif_rng algif_hash algif_aead af_alg sha512_generic sha256_generic libsha256 sha1_generic seqiv jitterentropy_rng drbg md5 kpp crypto_hw_eip93 hmac echainiv ecb des_generic libdes cmac cbc authencesn authenc arc4 leds_gpio
[   10.803332]  gpio_button_hotplug crc32c_generic
[   10.985342] CPU: 3 PID: 535 Comm: kmodloader Not tainted 5.15.150 #0
[   10.997999] Stack : 000f0000 823b0000 00000001 80083bf0 00000000 00000000 00000000 00000000
[   11.014687]         00000000 00000000 00000000 00000000 00000000 00000001 82057ad0 80c7f460
[   11.031371]         82057b68 00000000 00000000 82057978 00000038 8039f0e4 ffffffea 00000000
[   11.048056]         82057984 000000f0 8081cab0 ffffffff 8073ae10 82057ab0 00000000 823a7f00
[   11.064744]         00000009 82860220 000f0000 823b0000 00000018 80411304 0000000c 809d000c
[   11.081431]         ...
[   11.086301] Call Trace:
[   11.086355] [<80083bf0>] 0x80083bf0
[   11.098168] [<8039f0e4>] 0x8039f0e4
[   11.105129] [<823a7f00>] 0x823a7f00 [mt7603e@(ptrval)+0x9980]
[   11.116590] [<80411304>] 0x80411304
[   11.123544] [<80007908>] 0x80007908
[   11.130484] [<80007910>] 0x80007910
[   11.137428] [<823a7f00>] 0x823a7f00 [mt7603e@(ptrval)+0x9980]
[   11.148873] [<803831c4>] 0x803831c4
[   11.155829] [<80720000>] 0x80720000
[   11.162771] [<8002df2c>] 0x8002df2c
[   11.169712] [<823a7f00>] 0x823a7f00 [mt7603e@(ptrval)+0x9980]
[   11.181159] [<8002e010>] 0x8002e010
[   11.188117] [<823a7dd0>] 0x823a7dd0 [mt7603e@(ptrval)+0x9980]
[   11.197962] urngd: v1.0.2 started.
[   11.199609] [<823a7f00>] 0x823a7f00 [mt7603e@(ptrval)+0x9980]
[   11.217800] [<823a1fa8>] 0x823a1fa8 [mt7603e@(ptrval)+0x9980]
[   11.229277] [<8008c864>] 0x8008c864
[   11.236234] [<8041d844>] 0x8041d844
[   11.243193] [<823a0168>] 0x823a0168 [mt7603e@(ptrval)+0x9980]
[   11.254633] [<803d3c98>] 0x803d3c98
[   11.261588] [<803cacd0>] 0x803cacd0
[   11.268527] [<803ca2b8>] 0x803ca2b8
[   11.275475] [<80424474>] 0x80424474
[   11.282416] [<802521dc>] 0x802521dc
[   11.289367] [<804249a8>] 0x804249a8
[   11.296330] [<80425138>] 0x80425138
[   11.303279] [<8042508c>] 0x8042508c
[   11.310217] [<80421e68>] 0x80421e68
[   11.317179] [<80423648>] 0x80423648
[   11.321222] irq 26: nobody cared (try booting with the "irqpoll" option)
[   11.324144] [<80425aa0>] 0x80425aa0
[   11.344367] [<8018dcf0>] 0x8018dcf0
[   11.351321] [<823af048>] 0x823af048 [mt7603e@(ptrval)+0x9980]
[   11.362769] [<823af000>] 0x823af000 [mt7603e@(ptrval)+0x9980]
[   11.374214] [<8000157c>] 0x8000157c
[   11.381180] [<800c5664>] 0x800c5664
[   11.388117] [<802b5d0c>] 0x802b5d0c
[   11.395078] [<800c350c>] 0x800c350c
[   11.402023] [<800c5738>] 0x800c5738
[   11.408984] [<80014550>] 0x80014550

Another reboot fixes mt7603 initialization.

biboc commented 5 months ago

I'll open a new issue

lukasz1992 commented 2 months ago

https://github.com/openwrt/mt76/commit/c3eba20da2c0e77acb171434662b86aedb03cbbf ?