Closed rany2 closed 8 months ago
@littoy Could you try
bb517921433c2ba1235c98f68f94c36ff6600188
?
test use this commit have no lucky, also 'timeout' in about 1hour.
I have test an old version at 22/05 , that's fine for me: PKG_SOURCE_DATE:=2023-05-22 PKG_SOURCE_VERSION:=7cb022d0e0c70007c13225be2b556c48a53accce
@ryderlee1110 This is another log prior to a crash, could this be an issue of the driver being out of sync with hw?
[ 8981.175817] ieee80211 phy0: WM: ( 375.863107:48:MQM-W)[WR] BSS usage overflows during removing entry
[ 8985.955163] ieee80211 phy0: WM: ( 380.641306:49:TXP-E)rlmTxPwrThermalCompCb: eDbdcBandIdx = 1, i1ThermalCompVal = 1
[ 8986.956829] ieee80211 phy0: WM: ( 381.642740:50:TXP-E)rlmTxPwrThermalCompCb: eDbdcBandIdx = 1, i1ThermalCompVal = 1
[ 8987.958631] ieee80211 phy0: WM: ( 382.644296:51:TXP-E)rlmTxPwrThermalCompCb: eDbdcBandIdx = 1, i1ThermalCompVal = 0
[ 9073.633155] ieee80211 phy0: WM: ( 468.298013:52:TXP-E)rlmTxPwrThermalCompCb: eDbdcBandIdx = 1, i1ThermalCompVal = 1
[ 9074.634493] ieee80211 phy0: WM: ( 469.299112:53:TXP-E)rlmTxPwrThermalCompCb: eDbdcBandIdx = 1, i1ThermalCompVal = 1
[ 9075.636255] ieee80211 phy0: WM: ( 470.300638:54:TXP-E)rlmTxPwrThermalCompCb: eDbdcBandIdx = 1, i1ThermalCompVal = 0
[ 9165.284912] ieee80211 phy0: WM: ( 559.927530:55:BSS-E)_whCapSetGeneric_Falcon DW0 = 0x0, DW1 = 0x0
[ 9165.294330] ieee80211 phy0: WM: ( 559.936929:56:RA-E)set initRateDownMCS[4] old 1ss_m0, new 1ss_m0
[ 9165.303342] ieee80211 phy0: WM: ( 559.937082:57:TXC-E)heACtrlInitStaRec(): prOperMode=401438, u2OperMode=10!
[ 9165.313318] ieee80211 phy0: WM: ( 559.937113:58:BSS-E)muruAddStaRec, prRuStaRec->u1Bw = 2 eBand = 5000000 BSS BW = 2
[ 9165.345661] ieee80211 phy0: WM: ( 559.988260:59:BF-E)BFSetNdpRate() Invalid: 3, WlanIdx: 4, u1Mcs: 9
[ 9165.354837] ieee80211 phy0: WM: ( 559.988291:60:BF-E)BFSetNdpRate() Invalid: 3, WlanIdx: 4, u1Mcs: 9
[ 9165.363943] ieee80211 phy0: WM: ( 559.988321:61:BF-E)BFSetNdpRate() Invalid: 1, WlanIdx: 6, u1Mcs: 9
[ 9165.373213] ieee80211 phy0: WM: ( 559.988352:62:BF-E)BFSetNdpRate() Invalid: 3, WlanIdx: 7, u1Mcs: 9
[ 9167.534626] ieee80211 phy0: WM: ( 562.176676:63:MQM-W)[WR] BSS usage overflows during removing entry
[ 9187.663866] mt7915e 0000:02:00.0: Message 000025ed (seq 12) timeout
[ 9208.143726] mt7915e 0000:02:00.0: Message 0000aded (seq 13) timeout
[ 9228.623577] mt7915e 0000:02:00.0: Message 00005aed (seq 14) timeout
[ 9249.103402] mt7915e 0000:02:00.0: Message 00005aed (seq 15) timeout
[ 9269.583234] mt7915e 0000:02:00.0: Message 00005aed (seq 1) timeout
Edit: logcat:
Thu May 25 20:44:51 2023 daemon.info dawn: Client 04:E5:98:DE:E9:F2: Kicking due to low active data transfer: RX rate 6.000000 below 6 limit [554/14213]
Thu May 25 20:44:51 2023 daemon.info dawn: Client 48:55:19:5B:F2:16: Kicking due to low active data transfer: RX rate 1.000000 below 6 limit
Thu May 25 20:44:51 2023 daemon.notice hostapd: eap2ghz: BEACON-REQ-TX-STATUS 98:50:2e:72:ee:56 105 ack=1
Thu May 25 20:44:51 2023 daemon.notice hostapd: eap2ghz: BEACON-RESP-RX 98:50:2e:72:ee:56 105 04
Thu May 25 20:44:51 2023 daemon.notice hostapd: eap5ghz: BEACON-REQ-TX-STATUS dc:e5:5b:41:71:9c 98 ack=1
Thu May 25 20:44:51 2023 daemon.notice hostapd: eap5ghz: BEACON-RESP-RX dc:e5:5b:41:71:9c 98 04
Thu May 25 20:45:01 2023 daemon.info dawn: Client 04:E5:98:DE:E9:F2: Kicking due to low active data transfer: RX rate 6.000000 below 6 limit
Thu May 25 20:45:01 2023 daemon.info dawn: Client 48:55:19:5B:F2:16: Kicking due to low active data transfer: RX rate 6.000000 below 6 limit
Thu May 25 20:45:06 2023 daemon.notice hostapd: eap2ghz: BEACON-REQ-TX-STATUS 98:50:2e:72:ee:56 107 ack=1
Thu May 25 20:45:06 2023 daemon.notice hostapd: eap2ghz: BEACON-RESP-RX 98:50:2e:72:ee:56 107 04
Thu May 25 20:45:06 2023 daemon.notice hostapd: eap5ghz: BEACON-REQ-TX-STATUS dc:e5:5b:41:71:9c 99 ack=1
Thu May 25 20:45:06 2023 daemon.notice hostapd: eap5ghz: BEACON-RESP-RX dc:e5:5b:41:71:9c 99 04
Thu May 25 20:45:08 2023 daemon.notice hostapd: eap2ghz: BEACON-REQ-TX-STATUS 98:50:2e:72:ee:56 109 ack=1
Thu May 25 20:45:08 2023 daemon.notice hostapd: eap2ghz: BEACON-RESP-RX 98:50:2e:72:ee:56 109 04
Thu May 25 20:45:08 2023 daemon.notice hostapd: eap5ghz: BEACON-REQ-TX-STATUS dc:e5:5b:41:71:9c 100 ack=1
Thu May 25 20:45:08 2023 daemon.notice hostapd: eap5ghz: BEACON-RESP-RX dc:e5:5b:41:71:9c 100 04
Thu May 25 20:45:09 2023 daemon.notice hostapd: eap2ghz: BEACON-REQ-TX-STATUS 98:50:2e:72:ee:56 111 ack=1
Thu May 25 20:45:09 2023 daemon.notice hostapd: eap2ghz: BEACON-RESP-RX 98:50:2e:72:ee:56 111 04
Thu May 25 20:45:10 2023 daemon.notice hostapd: eap5ghz: BEACON-REQ-TX-STATUS dc:e5:5b:41:71:9c 101 ack=1
Thu May 25 20:45:10 2023 daemon.notice hostapd: eap5ghz: BEACON-RESP-RX dc:e5:5b:41:71:9c 101 04
Thu May 25 20:45:11 2023 daemon.info dawn: Client 04:E5:98:DE:E9:F2: Kicking due to low active data transfer: RX rate 6.000000 below 6 limit
Thu May 25 20:45:11 2023 daemon.info dawn: Client 48:55:19:5B:F2:16: Kicking due to low active data transfer: RX rate 6.000000 below 6 limit
Thu May 25 20:45:11 2023 daemon.notice hostapd: eap2ghz: BEACON-REQ-TX-STATUS 98:50:2e:72:ee:56 113 ack=1
Thu May 25 20:45:11 2023 daemon.notice hostapd: eap2ghz: BEACON-RESP-RX 98:50:2e:72:ee:56 113 04
Thu May 25 20:45:12 2023 daemon.notice hostapd: eap5ghz: BEACON-REQ-TX-STATUS dc:e5:5b:41:71:9c 102 ack=1
Thu May 25 20:45:12 2023 daemon.notice hostapd: eap5ghz: BEACON-RESP-RX dc:e5:5b:41:71:9c 102 04
Thu May 25 20:45:18 2023 daemon.err hostapd: nl80211: kernel reports: key addition failed
Thu May 25 20:45:18 2023 kern.info kernel: [ 9165.284912] ieee80211 phy0: WM: ( 559.927530:55:BSS-E)_whCapSetGeneric_Falcon DW0 = 0x0, DW1 = 0x0
Thu May 25 20:45:18 2023 kern.info kernel: [ 9165.294330] ieee80211 phy0: WM: ( 559.936929:56:RA-E)set initRateDownMCS[4] old 1ss_m0, new 1ss_m0
Thu May 25 20:45:18 2023 kern.info kernel: [ 9165.303342] ieee80211 phy0: WM: ( 559.937082:57:TXC-E)heACtrlInitStaRec(): prOperMode=401438, u2OperMode=10!
Thu May 25 20:45:18 2023 kern.info kernel: [ 9165.313318] ieee80211 phy0: WM: ( 559.937113:58:BSS-E)muruAddStaRec, prRuStaRec->u1Bw = 2 eBand = 5000000 BSS BW = 2
Thu May 25 20:45:18 2023 daemon.notice hostapd: eap5ghz: STA-OPMODE-N_SS-CHANGED 04:e5:98:de:e9:f2 1
Thu May 25 20:45:18 2023 daemon.info hostapd: eap5ghz: STA 04:e5:98:de:e9:f2 IEEE 802.11: associated (aid 1)
Thu May 25 20:45:18 2023 daemon.notice hostapd: eap5ghz: AP-STA-CONNECTED 04:e5:98:de:e9:f2 auth_alg=ft
Thu May 25 20:45:18 2023 kern.info kernel: [ 9165.345661] ieee80211 phy0: WM: ( 559.988260:59:BF-E)BFSetNdpRate() Invalid: 3, WlanIdx: 4, u1Mcs: 9
Thu May 25 20:45:18 2023 kern.info kernel: [ 9165.354837] ieee80211 phy0: WM: ( 559.988291:60:BF-E)BFSetNdpRate() Invalid: 3, WlanIdx: 4, u1Mcs: 9
Thu May 25 20:45:18 2023 kern.info kernel: [ 9165.363943] ieee80211 phy0: WM: ( 559.988321:61:BF-E)BFSetNdpRate() Invalid: 1, WlanIdx: 6, u1Mcs: 9
Thu May 25 20:45:18 2023 kern.info kernel: [ 9165.373213] ieee80211 phy0: WM: ( 559.988352:62:BF-E)BFSetNdpRate() Invalid: 3, WlanIdx: 7, u1Mcs: 9
Thu May 25 20:45:18 2023 daemon.notice hostapd: eap2ghz: Prune association for 04:e5:98:de:e9:f2
Thu May 25 20:45:18 2023 daemon.notice hostapd: eap2ghz: AP-STA-DISCONNECTED 04:e5:98:de:e9:f2
Thu May 25 20:45:18 2023 daemon.info hostapd: eap5ghz: STA 04:e5:98:de:e9:f2 RADIUS: starting accounting session 345B1D0219B51657
Thu May 25 20:45:18 2023 daemon.info hostapd: eap5ghz: STA 04:e5:98:de:e9:f2 IEEE 802.1X: authenticated - EAP type: 0 (unknown)
Thu May 25 20:45:21 2023 kern.info kernel: [ 9167.534626] ieee80211 phy0: WM: ( 562.176676:63:MQM-W)[WR] BSS usage overflows during removing entry
Thu May 25 20:45:41 2023 kern.err kernel: [ 9187.663866] mt7915e 0000:02:00.0: Message 000025ed (seq 12) timeout
Thu May 25 20:46:01 2023 kern.err kernel: [ 9208.143726] mt7915e 0000:02:00.0: Message 0000aded (seq 13) timeout
Thu May 25 20:46:22 2023 kern.err kernel: [ 9228.623577] mt7915e 0000:02:00.0: Message 00005aed (seq 14) timeout
Thu May 25 20:46:42 2023 kern.err kernel: [ 9249.103402] mt7915e 0000:02:00.0: Message 00005aed (seq 15) timeout
Thu May 25 20:47:03 2023 kern.err kernel: [ 9269.583234] mt7915e 0000:02:00.0: Message 00005aed (seq 1) timeout
Thu May 25 20:47:23 2023 kern.err kernel: [ 9290.063058] mt7915e 0000:02:00.0: Message 0000aded (seq 2) timeout
Thu May 25 20:47:44 2023 kern.err kernel: [ 9310.542887] mt7915e 0000:02:00.0: Message 00005aed (seq 3) timeout Thu May 25 20:48:04 2023 kern.err kernel: [ 9331.022721] mt7915e 0000:02:00.0: Message 00005aed (seq 4) timeout
Thu May 25 20:48:25 2023 kern.err kernel: [ 9351.502556] mt7915e 0000:02:00.0: Message 00005aed (seq 5) timeout
Just a recap:
@littoy Could you try
bb517921433c2ba1235c98f68f94c36ff6600188
?test use this commit have no lucky, also 'timeout' in about 1hour.
I have test an old version at 22/05 , that's fine for me: PKG_SOURCE_DATE:=2023-05-22 PKG_SOURCE_VERSION:=7cb022d0e0c70007c13225be2b556c48a53accce
Which repo? (cannot find this commit)
@littoy Could you try
bb517921433c2ba1235c98f68f94c36ff6600188
?test use this commit have no lucky, also 'timeout' in about 1hour. I have test an old version at 22/05 , that's fine for me: PKG_SOURCE_DATE:=2023-05-22 PKG_SOURCE_VERSION:=7cb022d0e0c70007c13225be2b556c48a53accce
Which repo? (cannot find this commit)
your repo at 22/05, commit hash was overwrited looks like.
Found some possible rules, if the network interface configuration is frequently modified in openwrt, timeout is prone to occur. There seems to be an easier way to reproduce the problem I encountered. /etc/init.d/network restart/reload multiple times can be reproduced.
@littoy I think that's a different issue caused by AP being stopped in an improper way. It was fixed in mt7921, I'll make a patch to see if it is fixed by implementing something similar in mt7915.
@littoy On my device I cannot have it fail by spamming "/etc/init.d/network restart", can you try 0f62587b7f0dc9c165823400e7ef736f4fe4a803?
I have no idea what happend, test result:
Fri May 26 20:09:44 2023 daemon.info samba4-server: io_uring module found, enabling VFS io_uring. (also needs Kernel 5.4+ Support)
Fri May 26 20:09:59 2023 daemon.info hostapd: wlan1: STA f4:d4:88:61:7e:fb IEEE 802.11: authenticated
Fri May 26 20:09:59 2023 daemon.info hostapd: wlan1: STA f4:d4:88:61:7e:fb IEEE 802.11: associated (aid 1)
Fri May 26 20:09:59 2023 daemon.notice hostapd: wlan1: AP-STA-CONNECTED f4:d4:88:61:7e:fb auth_alg=open
Fri May 26 20:09:59 2023 daemon.info hostapd: wlan1: STA f4:d4:88:61:7e:fb RADIUS: starting accounting session 006DFC5B6CB176BF
Fri May 26 20:09:59 2023 daemon.info hostapd: wlan1: STA f4:d4:88:61:7e:fb WPA: pairwise key handshake completed (RSN)
Fri May 26 20:09:59 2023 daemon.notice hostapd: wlan1: EAPOL-4WAY-HS-COMPLETED f4:d4:88:61:7e:fb
Fri May 26 20:13:35 2023 authpriv.info dropbear[31288]: Child connection from 192.168.127.214:57784
Fri May 26 20:13:35 2023 authpriv.notice dropbear[31288]: Pubkey auth succeeded for 'root' with ssh-rsa key SHA256:37RoS1zkDfuYSTf+3aQTaROzT0I6Y68UsJ0ccGWnfio from 192.168.127.214:57784
Fri May 26 20:13:52 2023 authpriv.info dropbear[31288]: Exit (root) from <192.168.127.214:57784>: Disconnect received
Fri May 26 20:14:25 2023 authpriv.info dropbear[31769]: Child connection from 192.168.127.214:57877
Fri May 26 20:14:25 2023 authpriv.notice dropbear[31769]: Pubkey auth succeeded for 'root' with ssh-rsa key SHA256:37RoS1zkDfuYSTf+3aQTaROzT0I6Y68UsJ0ccGWnfio from 192.168.127.214:57877
Fri May 26 20:14:28 2023 authpriv.info dropbear[31769]: Exit (root) from <192.168.127.214:57877>: Disconnect received
Fri May 26 20:27:39 2023 authpriv.info dropbear[6459]: Child connection from 192.168.127.214:60580
Fri May 26 20:27:39 2023 authpriv.notice dropbear[6459]: Pubkey auth succeeded for 'root' with ssh-rsa key SHA256:37RoS1zkDfuYSTf+3aQTaROzT0I6Y68UsJ0ccGWnfio from 192.168.127.214:60580
Fri May 26 20:27:41 2023 authpriv.info dropbear[6459]: Exit (root) from <192.168.127.214:60580>: Disconnect received
Fri May 26 20:29:45 2023 daemon.info : 12[KNL] interface utun deleted
Fri May 26 20:29:45 2023 daemon.info : 05[KNL] interface utun deleted
Fri May 26 20:29:53 2023 daemon.info : 05[KNL] interface utun deleted
Fri May 26 20:29:53 2023 daemon.info : 07[KNL] interface utun deleted
Fri May 26 20:35:19 2023 kern.err kernel: [ 2134.403004] mt7915e 0000:04:00.0: Message 00005aed (seq 5) timeout
Fri May 26 20:35:39 2023 kern.err kernel: [ 2154.883023] mt7915e 0000:04:00.0: Message 00005aed (seq 6) timeout
Fri May 26 20:36:00 2023 kern.err kernel: [ 2175.363345] mt7915e 0000:04:00.0: Message 00005aed (seq 7) timeout
Fri May 26 20:36:20 2023 kern.err kernel: [ 2195.843122] mt7915e 0000:04:00.0: Message 00005aed (seq 8) timeout
Fri May 26 20:36:41 2023 kern.err kernel: [ 2216.323164] mt7915e 0000:04:00.0: Message 00005aed (seq 9) timeout
Fri May 26 20:37:01 2023 kern.err kernel: [ 2236.803375] mt7915e 0000:04:00.0: Message 00005aed (seq 10) timeout
Fri May 26 20:37:01 2023 authpriv.info dropbear[15618]: Child connection from 192.168.125.32:63664
Fri May 26 20:37:01 2023 authpriv.notice dropbear[15618]: Pubkey auth succeeded for 'root' with ssh-rsa key SHA256:37RoS1zkDfuYSTf+3aQTaROzT0I6Y68UsJ0ccGWnfio from 192.168.125.32:63664
Fri May 26 20:37:22 2023 kern.err kernel: [ 2257.283191] mt7915e 0000:04:00.0: Message 000026ed (seq 11) timeout
Fri May 26 20:37:42 2023 kern.err kernel: [ 2277.763267] mt7915e 0000:04:00.0: Message 00005aed (seq 12) timeout
Fri May 26 20:38:01 2023 authpriv.info dropbear[15618]: Exit (root) from <192.168.125.32:63664>: Disconnect received
Fri May 26 20:38:03 2023 kern.err kernel: [ 2298.243353] mt7915e 0000:04:00.0: Message 00005aed (seq 13) timeout
When I upgrade to mt76 from your repo 1b65b9833e957cd71f557b1cfb0e5d40231c3fb3 and sync mac80211+hostapd from openwrt master branch, now looks work stable.
I am also testing new builds using rany's mt76 code + Openwrt master on a Belkin RT3200, hoping for better stability (especially against those "Timeout waiting for MCU reset state" problems). However I noticed that wed is not working because of this commit so I reverted it.
When I upgrade to mt76 from your repo 1b65b9833e957cd71f557b1cfb0e5d40231c3fb3 and sync mac80211+hostapd from openwrt master branch, now looks work stable.
Same here, been running fine for more than 12 hours now but I was holding off sending a comment here until it lasts for a day. Happy to know you have similar results..
Fantastic news, thank you rany2 very much for your work!
I wonder if everything is still fine with the most recent firmware, but not have the device yet to test it.
@littoy Just to confirm, you're not having any crashes anymore with your MT7916?
@littoy Just to confirm, you're not having any crashes anymore with your MT7916?
yes, without any problem since last 20hours.
I am also testing new builds using rany's mt76 code + Openwrt master on a Belkin RT3200, hoping for better stability (especially against those "Timeout waiting for MCU reset state" problems). However I noticed that wed is not working because of this commit so I reverted it.
This is what happens now on the router:
Never saw this before. I think I'll revert to the original mt76 code + 20230315163130 WA firmware.
@ggg70 if you could try again, then observing /proc/interrupts would be useful to help troubleshoot. Could you also share your config/scenario?
I am also testing new builds using rany's mt76 code + Openwrt master on a Belkin RT3200, hoping for better stability (especially against those "Timeout waiting for MCU reset state" problems). However I noticed that wed is not working because of this commit so I reverted it.
This is what happens now on the router:
Never saw this before. I think I'll revert to the original mt76 code + 20230315163130 WA firmware.
That's normal, if throughput and CPU consumption is correlated I wouldn't worry. Its especially an issue if you have a PPP connection (unrelated to mt76)
That's normal, if throughput and CPU consumption is correlated I wouldn't worry. Its especially an issue if you have a PPP connection (unrelated to mt76)
It is not correlated, I wasn't performing heavy transfers with the router during those intervals. With wed enabled I usually see 30-40% CPU utilization for downloads and 60-70% for uploads at around 800Mbps (with PPPoE). I'll keep you informed how's working after using mt76 with just the "good" firmware and some minor modifications.
@ryderlee1110 Just heads up, I think the commits that fixed these issues were:
I think you can post this to linux-wireless https://github.com/rany2/mt76/commit/a1742bcfe3e6770370240ba9309d1fb2661d28ad BTW could you please closed tickets that have been resolved?
@ryderlee1110 I have only opened this issue and don't have any privileges to close anything else. I do agree the issue tracker needs some cleanup. I'll submit that patch later today, thanks for feedback.
P.S do you think this issue should be closed as it is caused by WA firmware and not mt76 itself?
@rany It appears this did fix the crash!!! I tried about 10 times without any issues.
On Wed, May 24, 2023, 5:35 PM rany @.***> wrote:
@Brain2000 https://github.com/Brain2000 I think the WA fw is what fixed it, but I updated both WA fw and ROM patch just in case. WM fw could not be used due to (what I think are) vendor modifications making it incompatible with mt76. However so far it's been smooth sailing. I undid all the patches I applied from mtk-openwrt-feeds and it seems to be going well so far.
While this is not confirmed, there is a chance that the only reason WA fw from TP-Link's EAP->>some numbers here<< work is because they patched it and fixed this issue themselves, which is a possibility. I noticed that the debugging info in the fw show tp-link specific paths so they did build that firmware themselves, and possibly made changes that resolved the issue.
@littoy https://github.com/littoy If you're using my mt76 repo, try updating to 1fdd73655aadfe85e2d145e259e814ccaea75dd1. Having solid uptime so far on it (doesn't crash in under an hour with many clients like it used to before).
— Reply to this email directly, view it on GitHub https://github.com/openwrt/mt76/issues/776#issuecomment-1561956079, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABIAWFVVJYZXPJJ6LPORKRLXHZ5LDANCNFSM6AAAAAAXTQZTDE . You are receiving this because you were mentioned.Message ID: @.***>
@Brain2000 Fantastic, happy to have confirmation! Though it's actually a combination of two things as previously stated:
If you use the version in my repo you should have these issues fixed with no crashes. Haven't had an issue since!
@rany2 I'm asking just to be sure: the cause for this issue is the GET_MIB_INFO method of getting survey data? But I think this method was introduced in mt76 long time ago, why did the timeout problem appear only recently? My OpenWrt build using "official" mt76 + your WA firmware did not fix the timeout problem, so in other words should I also apply this in order to fix it? Thanks.
@ggg70 yes, the WA firmware is only to prevent the out of resource issue when UDP flood occurs.
@ggg70 yes, the WA firmware is only to prevent the out of resource issue when UDP flood occurs.
Sorry, but I am still confused. From your description: "The vendor's mt7915_wm could not be used due to an incompatibility with mt76, however it might be possible to make use of it by reverting changes that made use of GET_MIB_INFO" I infer that this patch is necessary only if you also use the WM firmware from TP-Link, so it would not be necessary for other versions (i. e. 2023041815). So I don't see the connection with that timeout problem... unless this is specifically solved by WM firmware from TP-Link and in this case I am forced to use the GET_MIB_INFO reverting patch.
@ggg70 The commit msg is out of date, initial motivation for that commit was to get it to work with the vendor firmware but as it turns out it ended up fixing the crashes on its own.
tl;dr I fixed the issue by accident while trying to get it work with vendor firmware, getting survey data with that command was the cause and there was no need to switch to vendor WM fw.
@ggg70 if you could try again, then observing /proc/interrupts would be useful to help troubleshoot. Could you also share your config/scenario?
@lukasz1992 After almost 4 days of working fine, the problem with 100% CPU usage appeared again. Here is the interrupts file:
CPU0 CPU1
10: 14125046 18713122 GICv2 30 Level arch_timer
15: 1 0 MT_SYSIRQ 163 Level mt-pmic-pwrap
22: 0 0 mt-eint 0 Edge gpio-keys
75: 1 0 mt-eint 53 Level mt7530
124: 0 0 mt-eint 102 Edge gpio-keys
125: 12 0 MT_SYSIRQ 91 Level ttyS0
128: 0 0 MT_SYSIRQ 118 Level 1100a000.spi
131: 22145 0 MT_SYSIRQ 96 Level mtk-snand
132: 84039 0 MT_SYSIRQ 95 Level mtk-ecc
133: 0 0 MT_SYSIRQ 122 Level 11016000.spi
134: 4144758 0 MT_SYSIRQ 211 Level mt7615e
135: 0 0 MT_SYSIRQ 232 Level xhci-hcd:usb1
138: 0 0 MT_SYSIRQ 219 Level 1b007000.dma-controller
139: 40526140 0 MT_SYSIRQ 214 Level mt7915e
142: 13794927 0 MT_SYSIRQ 224 Level 1b100000.ethernet
143: 290 1852388 MT_SYSIRQ 225 Level 1b100000.ethernet
146: 0 0 dummy 0 Edge PCIe PME
147: 0 0 mt7530 0 Edge mt7530-0:00
148: 0 0 mt7530 1 Edge mt7530-0:01
149: 0 0 mt7530 2 Edge mt7530-0:02
150: 0 0 mt7530 3 Edge mt7530-0:03
151: 0 1 mt7530 4 Edge mt7530-0:04
IPI0: 1039635 2000396 Rescheduling interrupts
IPI1: 13525940 45954702 Function call interrupts
IPI2: 0 0 CPU stop interrupts
IPI3: 0 0 CPU stop (for crash dump) interrupts
IPI4: 0 0 Timer broadcast interrupts
IPI5: 1487744 8024278 IRQ work interrupts
IPI6: 0 0 CPU wake-up interrupts
Err: 0
I am using Belkin RT3200 as a router with PPPoE connection and with mostly WLAN clients, so wed is enabled. All variants I tried have some issues:
What remains to be tested by me is main mt76 + this patch which I hope will fix the timeout issue and not be affected by the 100% CPU usage.
@ggg70 Could you check this file but two times, like second time 10 seconds after first time? My idea was to compare numbers between calls, not just the static value.
rany2's mt76 -> 100% CPU after a while (several days)
what process was responsible for it? ksoftirqd or the tx worker?
what process was responsible for it? ksoftirqd or the tx worker?
ksoftirqd. I will try again with your latest mt76 version to see if it still happens... but we might have to wait a few days.
ksoftirqd. I will try again with your latest mt76 version to see if it still happens... but we might have to wait a few days.
@rany2 OK, after almost 4 days, back to 100% CPU by ksoftirqd using your mt76 version. Here are some numbers from interrupts if they tell you something:
10: 17854275 15025655 GICv2 30 Level arch_timer
10: 17860521 15031962 GICv2 30 Level arch_timer
131: 21776 0 MT_SYSIRQ 96 Level mtk-snand
131: 21783 0 MT_SYSIRQ 96 Level mtk-snand
132: 82566 0 MT_SYSIRQ 95 Level mtk-ecc
132: 82586 0 MT_SYSIRQ 95 Level mtk-ecc
134: 4144772 0 MT_SYSIRQ 211 Level mt7615e
134: 4145319 0 MT_SYSIRQ 211 Level mt7615e
139: 30191900 0 MT_SYSIRQ 214 Level mt7915e
139: 30197192 0 MT_SYSIRQ 214 Level mt7915e
142: 8407271 0 MT_SYSIRQ 224 Level 1b100000.ethernet
142: 8408206 0 MT_SYSIRQ 224 Level 1b100000.ethernet
143: 244 1344989 MT_SYSIRQ 225 Level 1b100000.ethernet
143: 244 1345620 MT_SYSIRQ 225 Level 1b100000.ethernet
IPI0: 479847 1067299 Rescheduling interrupts
IPI0: 481092 1069887 Rescheduling interrupts
Anything else I can do?
@ggg70 I have a similar issue, it's been occurring with the mainline kernel version as well. When an iperf3 test is carried out there seems to be an onslaught of IRQ activity at around that time (with softirq taking up multiple CPU cores). I think it's the same issue/theme.
When that happens TX takes a hit and instead of having >800Mbit/s it drops to 400Mbit and often worse. Reboot sometimes fixes and it gets to a nice 800+Mbit/s; however this happens immediately and not after many days.
Is this reproducible with latest HEAD?
@mrkiko Yep, seeing that it is not reproducible with TP-Link's WA firmware; it's probably firmware and not driver specific UNLESS the driver could take some steps to prevent this issue from occurring.
Thanks a lot for your reply. Now I remember you explained this to me so my question was kinda useless. Sorry.
This issue doesn't occur for me anymore, so I'm closing it.
I was able to reproduce it again on the latest master, it's still relevant.
@rany2 are you able to reproduce it with this firmware? https://github.com/cmonroe/feed-wifi-master/tree/smartrg-master/mt76/files/firmware
@lukasz1992 I'll try this firmware. Related but I haven't had any success with this patch https://github.com/openwrt/mt76/issues/690#issuecomment-1936970214... maybe I'm imagining things but with this applied it does happen less frequently but I was still able to repro regardless.
@lukasz1992 No luck, that firmware doesn't help.
Either way, it does seem like a firmware issue seeing that it seems like https://github.com/rany2/mt76/commit/07be1c7eadfc26d6f1a241607930f157c317202a still works as a workaround.
It seems like with the broadcast AQL patches, this issue is resolved. I can't trigger it anymore. Hopefully I don't reopen this again, but so far so good!
Issue found courtesy of @Brain2000.
This issue was brought up in #690 but I think it is worth keeping track off in its own issue as it appears there are many different things that could trigger #690.
In essence, the following is a requirement to trigger this issue:
In order to trigger this I will provide you with the following code from @Brain2000.
File
spam_multicast.py
:You could trigger it like so:
I've tried triggering this on MT76x2E, MT7610E and MT7628AN with no luck so I presume this is unique to mt7915e's firmware or kernel driver.
Notes:
All the best