openwrt / mt76

mac80211 driver for MediaTek MT76x0e, MT76x2e, MT7603, MT7615, MT7628 and MT7688
749 stars 342 forks source link

mt7915e: AP mode: UDP flood kills driver/MCU communication #776

Closed rany2 closed 8 months ago

rany2 commented 1 year ago

Issue found courtesy of @Brain2000.

This issue was brought up in #690 but I think it is worth keeping track off in its own issue as it appears there are many different things that could trigger #690.

In essence, the following is a requirement to trigger this issue:

In order to trigger this I will provide you with the following code from @Brain2000.

File spam_multicast.py:

import socket

UDP_IP = "224.0.0.251"
UDP_PORT = 5353
DEST_PAIR = (UDP_IP, UDP_PORT)

TTL = 2
DATA = b"flajshdflkjashdflkjhasdlkfjhwlueiryluiashdfljhasljkdfhlkajsdhfl ashdfljkashdlfkjhaslkdjfhlaskdfhwhateverandever"

sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
sock.setsockopt(socket.IPPROTO_IP, socket.IP_MULTICAST_TTL, TTL)

while True:
    sock.sendto(DATA, DEST_PAIR)

You could trigger it like so:

$ # Connect to the MT7915E AP
$ nmcli con up mt7915ap
$ # Run the script for a period of 30 seconds and immediately rfkill 
$ timeout 30 python3 ~/spam_multicast.py; rfkill block wifi
$ # Wait 5 seconds
$ sleep 5
$ # You could unblock your radio now

I've tried triggering this on MT76x2E, MT7610E and MT7628AN with no luck so I presume this is unique to mt7915e's firmware or kernel driver.

Notes:

All the best

littoy commented 1 year ago

@littoy Could you try bb517921433c2ba1235c98f68f94c36ff6600188?

test use this commit have no lucky, also 'timeout' in about 1hour.

I have test an old version at 22/05 , that's fine for me: PKG_SOURCE_DATE:=2023-05-22 PKG_SOURCE_VERSION:=7cb022d0e0c70007c13225be2b556c48a53accce

rany2 commented 1 year ago

@ryderlee1110 This is another log prior to a crash, could this be an issue of the driver being out of sync with hw?

[ 8981.175817] ieee80211 phy0: WM: ( 375.863107:48:MQM-W)[WR] BSS usage overflows during removing entry
[ 8985.955163] ieee80211 phy0: WM: ( 380.641306:49:TXP-E)rlmTxPwrThermalCompCb: eDbdcBandIdx = 1, i1ThermalCompVal = 1
[ 8986.956829] ieee80211 phy0: WM: ( 381.642740:50:TXP-E)rlmTxPwrThermalCompCb: eDbdcBandIdx = 1, i1ThermalCompVal = 1
[ 8987.958631] ieee80211 phy0: WM: ( 382.644296:51:TXP-E)rlmTxPwrThermalCompCb: eDbdcBandIdx = 1, i1ThermalCompVal = 0
[ 9073.633155] ieee80211 phy0: WM: ( 468.298013:52:TXP-E)rlmTxPwrThermalCompCb: eDbdcBandIdx = 1, i1ThermalCompVal = 1
[ 9074.634493] ieee80211 phy0: WM: ( 469.299112:53:TXP-E)rlmTxPwrThermalCompCb: eDbdcBandIdx = 1, i1ThermalCompVal = 1
[ 9075.636255] ieee80211 phy0: WM: ( 470.300638:54:TXP-E)rlmTxPwrThermalCompCb: eDbdcBandIdx = 1, i1ThermalCompVal = 0
[ 9165.284912] ieee80211 phy0: WM: ( 559.927530:55:BSS-E)_whCapSetGeneric_Falcon DW0 = 0x0, DW1 = 0x0
[ 9165.294330] ieee80211 phy0: WM: ( 559.936929:56:RA-E)set initRateDownMCS[4] old 1ss_m0, new 1ss_m0
[ 9165.303342] ieee80211 phy0: WM: ( 559.937082:57:TXC-E)heACtrlInitStaRec(): prOperMode=401438, u2OperMode=10!
[ 9165.313318] ieee80211 phy0: WM: ( 559.937113:58:BSS-E)muruAddStaRec, prRuStaRec->u1Bw = 2 eBand = 5000000 BSS BW = 2
[ 9165.345661] ieee80211 phy0: WM: ( 559.988260:59:BF-E)BFSetNdpRate() Invalid: 3, WlanIdx: 4, u1Mcs: 9
[ 9165.354837] ieee80211 phy0: WM: ( 559.988291:60:BF-E)BFSetNdpRate() Invalid: 3, WlanIdx: 4, u1Mcs: 9
[ 9165.363943] ieee80211 phy0: WM: ( 559.988321:61:BF-E)BFSetNdpRate() Invalid: 1, WlanIdx: 6, u1Mcs: 9
[ 9165.373213] ieee80211 phy0: WM: ( 559.988352:62:BF-E)BFSetNdpRate() Invalid: 3, WlanIdx: 7, u1Mcs: 9
[ 9167.534626] ieee80211 phy0: WM: ( 562.176676:63:MQM-W)[WR] BSS usage overflows during removing entry
[ 9187.663866] mt7915e 0000:02:00.0: Message 000025ed (seq 12) timeout
[ 9208.143726] mt7915e 0000:02:00.0: Message 0000aded (seq 13) timeout
[ 9228.623577] mt7915e 0000:02:00.0: Message 00005aed (seq 14) timeout
[ 9249.103402] mt7915e 0000:02:00.0: Message 00005aed (seq 15) timeout
[ 9269.583234] mt7915e 0000:02:00.0: Message 00005aed (seq 1) timeout

Edit: logcat:

Thu May 25 20:44:51 2023 daemon.info dawn: Client 04:E5:98:DE:E9:F2: Kicking due to low active data transfer: RX rate 6.000000 below 6 limit                                                                          [554/14213]
Thu May 25 20:44:51 2023 daemon.info dawn: Client 48:55:19:5B:F2:16: Kicking due to low active data transfer: RX rate 1.000000 below 6 limit                                                                                     
Thu May 25 20:44:51 2023 daemon.notice hostapd: eap2ghz: BEACON-REQ-TX-STATUS 98:50:2e:72:ee:56 105 ack=1                                                                                                                        
Thu May 25 20:44:51 2023 daemon.notice hostapd: eap2ghz: BEACON-RESP-RX 98:50:2e:72:ee:56 105 04                                                                                                                                 
Thu May 25 20:44:51 2023 daemon.notice hostapd: eap5ghz: BEACON-REQ-TX-STATUS dc:e5:5b:41:71:9c 98 ack=1                                                                                                                         
Thu May 25 20:44:51 2023 daemon.notice hostapd: eap5ghz: BEACON-RESP-RX dc:e5:5b:41:71:9c 98 04                                                       
Thu May 25 20:45:01 2023 daemon.info dawn: Client 04:E5:98:DE:E9:F2: Kicking due to low active data transfer: RX rate 6.000000 below 6 limit     
Thu May 25 20:45:01 2023 daemon.info dawn: Client 48:55:19:5B:F2:16: Kicking due to low active data transfer: RX rate 6.000000 below 6 limit          
Thu May 25 20:45:06 2023 daemon.notice hostapd: eap2ghz: BEACON-REQ-TX-STATUS 98:50:2e:72:ee:56 107 ack=1                                             
Thu May 25 20:45:06 2023 daemon.notice hostapd: eap2ghz: BEACON-RESP-RX 98:50:2e:72:ee:56 107 04                                                      
Thu May 25 20:45:06 2023 daemon.notice hostapd: eap5ghz: BEACON-REQ-TX-STATUS dc:e5:5b:41:71:9c 99 ack=1                                              
Thu May 25 20:45:06 2023 daemon.notice hostapd: eap5ghz: BEACON-RESP-RX dc:e5:5b:41:71:9c 99 04                                                       
Thu May 25 20:45:08 2023 daemon.notice hostapd: eap2ghz: BEACON-REQ-TX-STATUS 98:50:2e:72:ee:56 109 ack=1                                             
Thu May 25 20:45:08 2023 daemon.notice hostapd: eap2ghz: BEACON-RESP-RX 98:50:2e:72:ee:56 109 04                                                      
Thu May 25 20:45:08 2023 daemon.notice hostapd: eap5ghz: BEACON-REQ-TX-STATUS dc:e5:5b:41:71:9c 100 ack=1                                             
Thu May 25 20:45:08 2023 daemon.notice hostapd: eap5ghz: BEACON-RESP-RX dc:e5:5b:41:71:9c 100 04                                                      
Thu May 25 20:45:09 2023 daemon.notice hostapd: eap2ghz: BEACON-REQ-TX-STATUS 98:50:2e:72:ee:56 111 ack=1                                             
Thu May 25 20:45:09 2023 daemon.notice hostapd: eap2ghz: BEACON-RESP-RX 98:50:2e:72:ee:56 111 04                                                      
Thu May 25 20:45:10 2023 daemon.notice hostapd: eap5ghz: BEACON-REQ-TX-STATUS dc:e5:5b:41:71:9c 101 ack=1                                             
Thu May 25 20:45:10 2023 daemon.notice hostapd: eap5ghz: BEACON-RESP-RX dc:e5:5b:41:71:9c 101 04                                                      
Thu May 25 20:45:11 2023 daemon.info dawn: Client 04:E5:98:DE:E9:F2: Kicking due to low active data transfer: RX rate 6.000000 below 6 limit          
Thu May 25 20:45:11 2023 daemon.info dawn: Client 48:55:19:5B:F2:16: Kicking due to low active data transfer: RX rate 6.000000 below 6 limit          
Thu May 25 20:45:11 2023 daemon.notice hostapd: eap2ghz: BEACON-REQ-TX-STATUS 98:50:2e:72:ee:56 113 ack=1                                             
Thu May 25 20:45:11 2023 daemon.notice hostapd: eap2ghz: BEACON-RESP-RX 98:50:2e:72:ee:56 113 04                                                      
Thu May 25 20:45:12 2023 daemon.notice hostapd: eap5ghz: BEACON-REQ-TX-STATUS dc:e5:5b:41:71:9c 102 ack=1                                             
Thu May 25 20:45:12 2023 daemon.notice hostapd: eap5ghz: BEACON-RESP-RX dc:e5:5b:41:71:9c 102 04                                                      
Thu May 25 20:45:18 2023 daemon.err hostapd: nl80211: kernel reports: key addition failed                                                             
Thu May 25 20:45:18 2023 kern.info kernel: [ 9165.284912] ieee80211 phy0: WM: ( 559.927530:55:BSS-E)_whCapSetGeneric_Falcon DW0 = 0x0, DW1 = 0x0      
Thu May 25 20:45:18 2023 kern.info kernel: [ 9165.294330] ieee80211 phy0: WM: ( 559.936929:56:RA-E)set initRateDownMCS[4] old 1ss_m0, new 1ss_m0      
Thu May 25 20:45:18 2023 kern.info kernel: [ 9165.303342] ieee80211 phy0: WM: ( 559.937082:57:TXC-E)heACtrlInitStaRec(): prOperMode=401438, u2OperMode=10!                                                                       
Thu May 25 20:45:18 2023 kern.info kernel: [ 9165.313318] ieee80211 phy0: WM: ( 559.937113:58:BSS-E)muruAddStaRec, prRuStaRec->u1Bw = 2 eBand = 5000000 BSS BW = 2                                                               
Thu May 25 20:45:18 2023 daemon.notice hostapd: eap5ghz: STA-OPMODE-N_SS-CHANGED 04:e5:98:de:e9:f2 1                                                  
Thu May 25 20:45:18 2023 daemon.info hostapd: eap5ghz: STA 04:e5:98:de:e9:f2 IEEE 802.11: associated (aid 1)                                          
Thu May 25 20:45:18 2023 daemon.notice hostapd: eap5ghz: AP-STA-CONNECTED 04:e5:98:de:e9:f2 auth_alg=ft                                               
Thu May 25 20:45:18 2023 kern.info kernel: [ 9165.345661] ieee80211 phy0: WM: ( 559.988260:59:BF-E)BFSetNdpRate() Invalid: 3, WlanIdx: 4, u1Mcs: 9    
Thu May 25 20:45:18 2023 kern.info kernel: [ 9165.354837] ieee80211 phy0: WM: ( 559.988291:60:BF-E)BFSetNdpRate() Invalid: 3, WlanIdx: 4, u1Mcs: 9    
Thu May 25 20:45:18 2023 kern.info kernel: [ 9165.363943] ieee80211 phy0: WM: ( 559.988321:61:BF-E)BFSetNdpRate() Invalid: 1, WlanIdx: 6, u1Mcs: 9    
Thu May 25 20:45:18 2023 kern.info kernel: [ 9165.373213] ieee80211 phy0: WM: ( 559.988352:62:BF-E)BFSetNdpRate() Invalid: 3, WlanIdx: 7, u1Mcs: 9    
Thu May 25 20:45:18 2023 daemon.notice hostapd: eap2ghz: Prune association for 04:e5:98:de:e9:f2                                                      
Thu May 25 20:45:18 2023 daemon.notice hostapd: eap2ghz: AP-STA-DISCONNECTED 04:e5:98:de:e9:f2                                                        
Thu May 25 20:45:18 2023 daemon.info hostapd: eap5ghz: STA 04:e5:98:de:e9:f2 RADIUS: starting accounting session 345B1D0219B51657                     
Thu May 25 20:45:18 2023 daemon.info hostapd: eap5ghz: STA 04:e5:98:de:e9:f2 IEEE 802.1X: authenticated - EAP type: 0 (unknown)                                                                  
Thu May 25 20:45:21 2023 kern.info kernel: [ 9167.534626] ieee80211 phy0: WM: ( 562.176676:63:MQM-W)[WR] BSS usage overflows during removing entry                                               
Thu May 25 20:45:41 2023 kern.err kernel: [ 9187.663866] mt7915e 0000:02:00.0: Message 000025ed (seq 12) timeout                                                                                 
Thu May 25 20:46:01 2023 kern.err kernel: [ 9208.143726] mt7915e 0000:02:00.0: Message 0000aded (seq 13) timeout                                                                                 
Thu May 25 20:46:22 2023 kern.err kernel: [ 9228.623577] mt7915e 0000:02:00.0: Message 00005aed (seq 14) timeout                                                                                 
Thu May 25 20:46:42 2023 kern.err kernel: [ 9249.103402] mt7915e 0000:02:00.0: Message 00005aed (seq 15) timeout                                                                                 
Thu May 25 20:47:03 2023 kern.err kernel: [ 9269.583234] mt7915e 0000:02:00.0: Message 00005aed (seq 1) timeout                                                                                  
Thu May 25 20:47:23 2023 kern.err kernel: [ 9290.063058] mt7915e 0000:02:00.0: Message 0000aded (seq 2) timeout                                                                                  
Thu May 25 20:47:44 2023 kern.err kernel: [ 9310.542887] mt7915e 0000:02:00.0: Message 00005aed (seq 3) timeout                                                                                                                  Thu May 25 20:48:04 2023 kern.err kernel: [ 9331.022721] mt7915e 0000:02:00.0: Message 00005aed (seq 4) timeout                                                                                                                  
Thu May 25 20:48:25 2023 kern.err kernel: [ 9351.502556] mt7915e 0000:02:00.0: Message 00005aed (seq 5) timeout                                                                                                                  
rany2 commented 1 year ago

Just a recap:

rany2 commented 1 year ago

@littoy Could you try bb517921433c2ba1235c98f68f94c36ff6600188?

test use this commit have no lucky, also 'timeout' in about 1hour.

I have test an old version at 22/05 , that's fine for me: PKG_SOURCE_DATE:=2023-05-22 PKG_SOURCE_VERSION:=7cb022d0e0c70007c13225be2b556c48a53accce

Which repo? (cannot find this commit)

littoy commented 1 year ago

@littoy Could you try bb517921433c2ba1235c98f68f94c36ff6600188?

test use this commit have no lucky, also 'timeout' in about 1hour. I have test an old version at 22/05 , that's fine for me: PKG_SOURCE_DATE:=2023-05-22 PKG_SOURCE_VERSION:=7cb022d0e0c70007c13225be2b556c48a53accce

Which repo? (cannot find this commit)

your repo at 22/05, commit hash was overwrited looks like.

littoy commented 1 year ago

Found some possible rules, if the network interface configuration is frequently modified in openwrt, timeout is prone to occur. There seems to be an easier way to reproduce the problem I encountered. /etc/init.d/network restart/reload multiple times can be reproduced.

rany2 commented 1 year ago

@littoy I think that's a different issue caused by AP being stopped in an improper way. It was fixed in mt7921, I'll make a patch to see if it is fixed by implementing something similar in mt7915.

rany2 commented 1 year ago

@littoy On my device I cannot have it fail by spamming "/etc/init.d/network restart", can you try 0f62587b7f0dc9c165823400e7ef736f4fe4a803?

littoy commented 1 year ago

I have no idea what happend, test result:


Fri May 26 20:09:44 2023 daemon.info samba4-server: io_uring module found, enabling VFS io_uring. (also needs Kernel 5.4+ Support)
Fri May 26 20:09:59 2023 daemon.info hostapd: wlan1: STA f4:d4:88:61:7e:fb IEEE 802.11: authenticated
Fri May 26 20:09:59 2023 daemon.info hostapd: wlan1: STA f4:d4:88:61:7e:fb IEEE 802.11: associated (aid 1)
Fri May 26 20:09:59 2023 daemon.notice hostapd: wlan1: AP-STA-CONNECTED f4:d4:88:61:7e:fb auth_alg=open
Fri May 26 20:09:59 2023 daemon.info hostapd: wlan1: STA f4:d4:88:61:7e:fb RADIUS: starting accounting session 006DFC5B6CB176BF
Fri May 26 20:09:59 2023 daemon.info hostapd: wlan1: STA f4:d4:88:61:7e:fb WPA: pairwise key handshake completed (RSN)
Fri May 26 20:09:59 2023 daemon.notice hostapd: wlan1: EAPOL-4WAY-HS-COMPLETED f4:d4:88:61:7e:fb
Fri May 26 20:13:35 2023 authpriv.info dropbear[31288]: Child connection from 192.168.127.214:57784
Fri May 26 20:13:35 2023 authpriv.notice dropbear[31288]: Pubkey auth succeeded for 'root' with ssh-rsa key SHA256:37RoS1zkDfuYSTf+3aQTaROzT0I6Y68UsJ0ccGWnfio from 192.168.127.214:57784
Fri May 26 20:13:52 2023 authpriv.info dropbear[31288]: Exit (root) from <192.168.127.214:57784>: Disconnect received
Fri May 26 20:14:25 2023 authpriv.info dropbear[31769]: Child connection from 192.168.127.214:57877
Fri May 26 20:14:25 2023 authpriv.notice dropbear[31769]: Pubkey auth succeeded for 'root' with ssh-rsa key SHA256:37RoS1zkDfuYSTf+3aQTaROzT0I6Y68UsJ0ccGWnfio from 192.168.127.214:57877
Fri May 26 20:14:28 2023 authpriv.info dropbear[31769]: Exit (root) from <192.168.127.214:57877>: Disconnect received
Fri May 26 20:27:39 2023 authpriv.info dropbear[6459]: Child connection from 192.168.127.214:60580
Fri May 26 20:27:39 2023 authpriv.notice dropbear[6459]: Pubkey auth succeeded for 'root' with ssh-rsa key SHA256:37RoS1zkDfuYSTf+3aQTaROzT0I6Y68UsJ0ccGWnfio from 192.168.127.214:60580
Fri May 26 20:27:41 2023 authpriv.info dropbear[6459]: Exit (root) from <192.168.127.214:60580>: Disconnect received
Fri May 26 20:29:45 2023 daemon.info : 12[KNL] interface utun deleted
Fri May 26 20:29:45 2023 daemon.info : 05[KNL] interface utun deleted
Fri May 26 20:29:53 2023 daemon.info : 05[KNL] interface utun deleted
Fri May 26 20:29:53 2023 daemon.info : 07[KNL] interface utun deleted
Fri May 26 20:35:19 2023 kern.err kernel: [ 2134.403004] mt7915e 0000:04:00.0: Message 00005aed (seq 5) timeout
Fri May 26 20:35:39 2023 kern.err kernel: [ 2154.883023] mt7915e 0000:04:00.0: Message 00005aed (seq 6) timeout
Fri May 26 20:36:00 2023 kern.err kernel: [ 2175.363345] mt7915e 0000:04:00.0: Message 00005aed (seq 7) timeout
Fri May 26 20:36:20 2023 kern.err kernel: [ 2195.843122] mt7915e 0000:04:00.0: Message 00005aed (seq 8) timeout
Fri May 26 20:36:41 2023 kern.err kernel: [ 2216.323164] mt7915e 0000:04:00.0: Message 00005aed (seq 9) timeout
Fri May 26 20:37:01 2023 kern.err kernel: [ 2236.803375] mt7915e 0000:04:00.0: Message 00005aed (seq 10) timeout
Fri May 26 20:37:01 2023 authpriv.info dropbear[15618]: Child connection from 192.168.125.32:63664
Fri May 26 20:37:01 2023 authpriv.notice dropbear[15618]: Pubkey auth succeeded for 'root' with ssh-rsa key SHA256:37RoS1zkDfuYSTf+3aQTaROzT0I6Y68UsJ0ccGWnfio from 192.168.125.32:63664
Fri May 26 20:37:22 2023 kern.err kernel: [ 2257.283191] mt7915e 0000:04:00.0: Message 000026ed (seq 11) timeout
Fri May 26 20:37:42 2023 kern.err kernel: [ 2277.763267] mt7915e 0000:04:00.0: Message 00005aed (seq 12) timeout
Fri May 26 20:38:01 2023 authpriv.info dropbear[15618]: Exit (root) from <192.168.125.32:63664>: Disconnect received
Fri May 26 20:38:03 2023 kern.err kernel: [ 2298.243353] mt7915e 0000:04:00.0: Message 00005aed (seq 13) timeout
littoy commented 1 year ago

When I upgrade to mt76 from your repo 1b65b9833e957cd71f557b1cfb0e5d40231c3fb3 and sync mac80211+hostapd from openwrt master branch, now looks work stable.

ggg70 commented 1 year ago

I am also testing new builds using rany's mt76 code + Openwrt master on a Belkin RT3200, hoping for better stability (especially against those "Timeout waiting for MCU reset state" problems). However I noticed that wed is not working because of this commit so I reverted it.

rany2 commented 1 year ago

When I upgrade to mt76 from your repo 1b65b9833e957cd71f557b1cfb0e5d40231c3fb3 and sync mac80211+hostapd from openwrt master branch, now looks work stable.

Same here, been running fine for more than 12 hours now but I was holding off sending a comment here until it lasts for a day. Happy to know you have similar results..

lukasz1992 commented 1 year ago

Fantastic news, thank you rany2 very much for your work!

I wonder if everything is still fine with the most recent firmware, but not have the device yet to test it.

rany2 commented 1 year ago

@littoy Just to confirm, you're not having any crashes anymore with your MT7916?

littoy commented 1 year ago

@littoy Just to confirm, you're not having any crashes anymore with your MT7916?

yes, without any problem since last 20hours.

ggg70 commented 1 year ago

I am also testing new builds using rany's mt76 code + Openwrt master on a Belkin RT3200, hoping for better stability (especially against those "Timeout waiting for MCU reset state" problems). However I noticed that wed is not working because of this commit so I reverted it.

This is what happens now on the router: image

image Never saw this before. I think I'll revert to the original mt76 code + 20230315163130 WA firmware.

lukasz1992 commented 1 year ago

@ggg70 if you could try again, then observing /proc/interrupts would be useful to help troubleshoot. Could you also share your config/scenario?

rany2 commented 1 year ago

I am also testing new builds using rany's mt76 code + Openwrt master on a Belkin RT3200, hoping for better stability (especially against those "Timeout waiting for MCU reset state" problems). However I noticed that wed is not working because of this commit so I reverted it.

This is what happens now on the router: image

image Never saw this before. I think I'll revert to the original mt76 code + 20230315163130 WA firmware.

That's normal, if throughput and CPU consumption is correlated I wouldn't worry. Its especially an issue if you have a PPP connection (unrelated to mt76)

ggg70 commented 1 year ago

That's normal, if throughput and CPU consumption is correlated I wouldn't worry. Its especially an issue if you have a PPP connection (unrelated to mt76)

It is not correlated, I wasn't performing heavy transfers with the router during those intervals. With wed enabled I usually see 30-40% CPU utilization for downloads and 60-70% for uploads at around 800Mbps (with PPPoE). I'll keep you informed how's working after using mt76 with just the "good" firmware and some minor modifications.

rany2 commented 1 year ago

@ryderlee1110 Just heads up, I think the commits that fixed these issues were:

ryderlee1110 commented 1 year ago

I think you can post this to linux-wireless https://github.com/rany2/mt76/commit/a1742bcfe3e6770370240ba9309d1fb2661d28ad BTW could you please closed tickets that have been resolved?

rany2 commented 1 year ago

@ryderlee1110 I have only opened this issue and don't have any privileges to close anything else. I do agree the issue tracker needs some cleanup. I'll submit that patch later today, thanks for feedback.

P.S do you think this issue should be closed as it is caused by WA firmware and not mt76 itself?

Brain2000 commented 1 year ago

@rany It appears this did fix the crash!!! I tried about 10 times without any issues.

On Wed, May 24, 2023, 5:35 PM rany @.***> wrote:

@Brain2000 https://github.com/Brain2000 I think the WA fw is what fixed it, but I updated both WA fw and ROM patch just in case. WM fw could not be used due to (what I think are) vendor modifications making it incompatible with mt76. However so far it's been smooth sailing. I undid all the patches I applied from mtk-openwrt-feeds and it seems to be going well so far.

While this is not confirmed, there is a chance that the only reason WA fw from TP-Link's EAP->>some numbers here<< work is because they patched it and fixed this issue themselves, which is a possibility. I noticed that the debugging info in the fw show tp-link specific paths so they did build that firmware themselves, and possibly made changes that resolved the issue.

@littoy https://github.com/littoy If you're using my mt76 repo, try updating to 1fdd73655aadfe85e2d145e259e814ccaea75dd1. Having solid uptime so far on it (doesn't crash in under an hour with many clients like it used to before).

— Reply to this email directly, view it on GitHub https://github.com/openwrt/mt76/issues/776#issuecomment-1561956079, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABIAWFVVJYZXPJJ6LPORKRLXHZ5LDANCNFSM6AAAAAAXTQZTDE . You are receiving this because you were mentioned.Message ID: @.***>

rany2 commented 1 year ago

@Brain2000 Fantastic, happy to have confirmation! Though it's actually a combination of two things as previously stated:

If you use the version in my repo you should have these issues fixed with no crashes. Haven't had an issue since!

ggg70 commented 1 year ago

@rany2 I'm asking just to be sure: the cause for this issue is the GET_MIB_INFO method of getting survey data? But I think this method was introduced in mt76 long time ago, why did the timeout problem appear only recently? My OpenWrt build using "official" mt76 + your WA firmware did not fix the timeout problem, so in other words should I also apply this in order to fix it? Thanks.

rany2 commented 1 year ago

@ggg70 yes, the WA firmware is only to prevent the out of resource issue when UDP flood occurs.

ggg70 commented 1 year ago

@ggg70 yes, the WA firmware is only to prevent the out of resource issue when UDP flood occurs.

Sorry, but I am still confused. From your description: "The vendor's mt7915_wm could not be used due to an incompatibility with mt76, however it might be possible to make use of it by reverting changes that made use of GET_MIB_INFO" I infer that this patch is necessary only if you also use the WM firmware from TP-Link, so it would not be necessary for other versions (i. e. 2023041815). So I don't see the connection with that timeout problem... unless this is specifically solved by WM firmware from TP-Link and in this case I am forced to use the GET_MIB_INFO reverting patch.

rany2 commented 1 year ago

@ggg70 The commit msg is out of date, initial motivation for that commit was to get it to work with the vendor firmware but as it turns out it ended up fixing the crashes on its own.

tl;dr I fixed the issue by accident while trying to get it work with vendor firmware, getting survey data with that command was the cause and there was no need to switch to vendor WM fw.

ggg70 commented 1 year ago

@ggg70 if you could try again, then observing /proc/interrupts would be useful to help troubleshoot. Could you also share your config/scenario?

@lukasz1992 After almost 4 days of working fine, the problem with 100% CPU usage appeared again. Here is the interrupts file:

           CPU0       CPU1       
 10:   14125046   18713122     GICv2  30 Level     arch_timer
 15:          1          0  MT_SYSIRQ 163 Level     mt-pmic-pwrap
 22:          0          0   mt-eint   0 Edge      gpio-keys
 75:          1          0   mt-eint  53 Level     mt7530
124:          0          0   mt-eint 102 Edge      gpio-keys
125:         12          0  MT_SYSIRQ  91 Level     ttyS0
128:          0          0  MT_SYSIRQ 118 Level     1100a000.spi
131:      22145          0  MT_SYSIRQ  96 Level     mtk-snand
132:      84039          0  MT_SYSIRQ  95 Level     mtk-ecc
133:          0          0  MT_SYSIRQ 122 Level     11016000.spi
134:    4144758          0  MT_SYSIRQ 211 Level     mt7615e
135:          0          0  MT_SYSIRQ 232 Level     xhci-hcd:usb1
138:          0          0  MT_SYSIRQ 219 Level     1b007000.dma-controller
139:   40526140          0  MT_SYSIRQ 214 Level     mt7915e
142:   13794927          0  MT_SYSIRQ 224 Level     1b100000.ethernet
143:        290    1852388  MT_SYSIRQ 225 Level     1b100000.ethernet
146:          0          0     dummy   0 Edge      PCIe PME
147:          0          0    mt7530   0 Edge      mt7530-0:00
148:          0          0    mt7530   1 Edge      mt7530-0:01
149:          0          0    mt7530   2 Edge      mt7530-0:02
150:          0          0    mt7530   3 Edge      mt7530-0:03
151:          0          1    mt7530   4 Edge      mt7530-0:04
IPI0:   1039635    2000396       Rescheduling interrupts
IPI1:  13525940   45954702       Function call interrupts
IPI2:         0          0       CPU stop interrupts
IPI3:         0          0       CPU stop (for crash dump) interrupts
IPI4:         0          0       Timer broadcast interrupts
IPI5:   1487744    8024278       IRQ work interrupts
IPI6:         0          0       CPU wake-up interrupts
Err:          0

I am using Belkin RT3200 as a router with PPPoE connection and with mostly WLAN clients, so wed is enabled. All variants I tried have some issues:

What remains to be tested by me is main mt76 + this patch which I hope will fix the timeout issue and not be affected by the 100% CPU usage.

lukasz1992 commented 1 year ago

@ggg70 Could you check this file but two times, like second time 10 seconds after first time? My idea was to compare numbers between calls, not just the static value.

rany2 commented 1 year ago

rany2's mt76 -> 100% CPU after a while (several days)

what process was responsible for it? ksoftirqd or the tx worker?

ggg70 commented 1 year ago

what process was responsible for it? ksoftirqd or the tx worker?

ksoftirqd. I will try again with your latest mt76 version to see if it still happens... but we might have to wait a few days.

ggg70 commented 1 year ago

ksoftirqd. I will try again with your latest mt76 version to see if it still happens... but we might have to wait a few days.

@rany2 OK, after almost 4 days, back to 100% CPU by ksoftirqd using your mt76 version. Here are some numbers from interrupts if they tell you something:

 10:   17854275   15025655     GICv2  30 Level     arch_timer
 10:   17860521   15031962     GICv2  30 Level     arch_timer

131:      21776          0  MT_SYSIRQ  96 Level     mtk-snand
131:      21783          0  MT_SYSIRQ  96 Level     mtk-snand

132:      82566          0  MT_SYSIRQ  95 Level     mtk-ecc
132:      82586          0  MT_SYSIRQ  95 Level     mtk-ecc

134:    4144772          0  MT_SYSIRQ 211 Level     mt7615e
134:    4145319          0  MT_SYSIRQ 211 Level     mt7615e

139:   30191900          0  MT_SYSIRQ 214 Level     mt7915e
139:   30197192          0  MT_SYSIRQ 214 Level     mt7915e

142:    8407271          0  MT_SYSIRQ 224 Level     1b100000.ethernet
142:    8408206          0  MT_SYSIRQ 224 Level     1b100000.ethernet

143:        244    1344989  MT_SYSIRQ 225 Level     1b100000.ethernet
143:        244    1345620  MT_SYSIRQ 225 Level     1b100000.ethernet

IPI0:    479847    1067299       Rescheduling interrupts
IPI0:    481092    1069887       Rescheduling interrupts

Anything else I can do?

rany2 commented 1 year ago

@ggg70 I have a similar issue, it's been occurring with the mainline kernel version as well. When an iperf3 test is carried out there seems to be an onslaught of IRQ activity at around that time (with softirq taking up multiple CPU cores). I think it's the same issue/theme.

When that happens TX takes a hit and instead of having >800Mbit/s it drops to 400Mbit and often worse. Reboot sometimes fixes and it gets to a nice 800+Mbit/s; however this happens immediately and not after many days.

mrkiko commented 1 year ago

Is this reproducible with latest HEAD?

rany2 commented 1 year ago

@mrkiko Yep, seeing that it is not reproducible with TP-Link's WA firmware; it's probably firmware and not driver specific UNLESS the driver could take some steps to prevent this issue from occurring.

mrkiko commented 1 year ago

Thanks a lot for your reply. Now I remember you explained this to me so my question was kinda useless. Sorry.

rany2 commented 9 months ago

This issue doesn't occur for me anymore, so I'm closing it.

rany2 commented 8 months ago

I was able to reproduce it again on the latest master, it's still relevant.

lukasz1992 commented 8 months ago

@rany2 are you able to reproduce it with this firmware? https://github.com/cmonroe/feed-wifi-master/tree/smartrg-master/mt76/files/firmware

rany2 commented 8 months ago

@lukasz1992 I'll try this firmware. Related but I haven't had any success with this patch https://github.com/openwrt/mt76/issues/690#issuecomment-1936970214... maybe I'm imagining things but with this applied it does happen less frequently but I was still able to repro regardless.

rany2 commented 8 months ago

@lukasz1992 No luck, that firmware doesn't help.

rany2 commented 8 months ago

Either way, it does seem like a firmware issue seeing that it seems like https://github.com/rany2/mt76/commit/07be1c7eadfc26d6f1a241607930f157c317202a still works as a workaround.

rany2 commented 8 months ago

It seems like with the broadcast AQL patches, this issue is resolved. I can't trigger it anymore. Hopefully I don't reopen this again, but so far so good!