openwrt / mt76

mac80211 driver for MediaTek MT76x0e, MT76x2e, MT7603, MT7615, MT7628 and MT7688
749 stars 341 forks source link

mt7915e: AP mode: UDP flood kills driver/MCU communication #776

Closed rany2 closed 8 months ago

rany2 commented 1 year ago

Issue found courtesy of @Brain2000.

This issue was brought up in #690 but I think it is worth keeping track off in its own issue as it appears there are many different things that could trigger #690.

In essence, the following is a requirement to trigger this issue:

In order to trigger this I will provide you with the following code from @Brain2000.

File spam_multicast.py:

import socket

UDP_IP = "224.0.0.251"
UDP_PORT = 5353
DEST_PAIR = (UDP_IP, UDP_PORT)

TTL = 2
DATA = b"flajshdflkjashdflkjhasdlkfjhwlueiryluiashdfljhasljkdfhlkajsdhfl ashdfljkashdlfkjhaslkdjfhlaskdfhwhateverandever"

sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
sock.setsockopt(socket.IPPROTO_IP, socket.IP_MULTICAST_TTL, TTL)

while True:
    sock.sendto(DATA, DEST_PAIR)

You could trigger it like so:

$ # Connect to the MT7915E AP
$ nmcli con up mt7915ap
$ # Run the script for a period of 30 seconds and immediately rfkill 
$ timeout 30 python3 ~/spam_multicast.py; rfkill block wifi
$ # Wait 5 seconds
$ sleep 5
$ # You could unblock your radio now

I've tried triggering this on MT76x2E, MT7610E and MT7628AN with no luck so I presume this is unique to mt7915e's firmware or kernel driver.

Notes:

All the best

rany2 commented 1 year ago

EDIT: wrong, did not fix the issue.

Note, I meant to say that "AMPDU" not "AMSDU." I disabled it by applying the following patch:

diff --git a/mt7915/main.c b/mt7915/main.c
index 8ce7b1c5..98e37afa 100644
--- a/mt7915/main.c
+++ b/mt7915/main.c
@@ -777,6 +777,8 @@ mt7915_ampdu_action(struct ieee80211_hw *hw, struct ieee80211_vif *vif,
        struct mt76_txq *mtxq;
        int ret = 0;

+       return -EOPNOTSUPP;
+
        if (!txq)
                return -EINVAL;
rany2 commented 1 year ago

Note, I meant to say that "AMPDU" not "AMSDU." I disabled it by applying the following patch:

I would like to make an important update, it turns there is no relation. This is not the issue.

rany2 commented 1 year ago

Note, I meant to say that "AMPDU" not "AMSDU." I disabled it by applying the following patch:

I would like to make an important update, it turns there is no relation. This is not the issue.

Well, perhaps the connection between this issue and APDU is simply the degraded link speed that makes this issue occur less; and obviously disabling AMPDU would cause you to have slower speeds. It might just be some memory corruption issue where after a certain amount of time, memory corruption occurs leading to this. I legitimately don't know but I think it might be likely seeing how the behavior after this issue occurs varies tremendously. (Sometimes it recovers, sometimes not, sometimes it doesn't recover automatically, sometimes setting some value to sys_recovery fixes it, etc)

Edit: I should mention by "it recovers" I mean L1 SER kicks in.

rany2 commented 1 year ago

Also important to note something I mentioned on another issue thread:

This issue occurs also when running that spam_multicast.py script from the ethernet switch itself HOWEVER it ends up doing an L1 recovery everytime. So perhaps some corruption in the RX buffers that overtime takes its toll on the system?

Well, perhaps the connection between this issue and APDU is simply the degraded link speed that makes this issue occur less; and obviously disabling AMPDU would cause you to have slower speeds. It might just be some memory corruption issue where after a certain amount of time, memory corruption occurs leading to this. I legitimately don't know but I think it might be likely seeing how the behavior after this issue occurs varies tremendously. (Sometimes it recovers, sometimes not, sometimes it doesn't recover automatically, sometimes setting some value to sys_recovery fixes it, etc)

Edit: I should mention by "it recovers" I mean L1 SER kicks in.

rany2 commented 1 year ago

Apologies for the constant corrections, but I tried once more to get it to crash from the ethernet switch and this time around it was not so lucky and was unable to recover itself. This whole ordeal is really inconsistent.

rany2 commented 1 year ago

@ryderlee1110 This node had L1 SER kicking in whenever it was crashing under normal circumstances (and was working OK) but when I ran the spam_multicast script, it tried to recover itself but ended up failing to do so.

Starting from 42764.738366 I intentionally tried to crash it with that script, however it seems like a recovery was triggered but with no impact:

[21428.819377] mt7915e 0000:02:00.0: phy0 SER recovery state: 0x00000004
[21428.825916] mt7915e 0000:02:00.0: phy0 L1 SER recovery start.
[21428.832581] mt7915e 0000:02:00.0: phy0 SER recovery state: 0x00000008
[21428.859071] mt7915e 0000:02:00.0: phy0 SER recovery state: 0x00000010
[21428.865835] mt7915e 0000:02:00.0: phy0 SER recovery state: 0x00000020
[21428.875986] mt7915e 0000:02:00.0: phy0 L1 SER recovery completed.
[21494.763591] mt7915e 0000:02:00.0: phy0 SER recovery state: 0x00000004
[21494.770172] mt7915e 0000:02:00.0: phy0 L1 SER recovery start.
[21494.776999] mt7915e 0000:02:00.0: phy0 SER recovery state: 0x00000008
[21494.803319] mt7915e 0000:02:00.0: phy0 SER recovery state: 0x00000010
[21494.809978] mt7915e 0000:02:00.0: phy0 SER recovery state: 0x00000020
[21494.820620] mt7915e 0000:02:00.0: phy0 L1 SER recovery completed.
[42764.731797] mt7915e 0000:02:00.0: phy0 SER recovery state: 0x00000004
[42764.738366] mt7915e 0000:02:00.0: phy0 L1 SER recovery start.
[42764.770530] mt7915e 0000:02:00.0: phy0 SER recovery state: 0x00000008
[42764.823208] mt7915e 0000:02:00.0: phy0 SER recovery state: 0x00000010
[42764.829861] mt7915e 0000:02:00.0: phy0 SER recovery state: 0x00000020
[42764.911799] mt7915e 0000:02:00.0: phy0 L1 SER recovery completed.
[42771.059616] mt7915e 0000:02:00.0: Message 00005aed (seq 15) timeout
[42774.099611] mt7915e 0000:02:00.0: Message 00005aed (seq 1) timeout
[42777.139574] mt7915e 0000:02:00.0: Message 00005aed (seq 2) timeout
[42780.179584] mt7915e 0000:02:00.0: Message 00005aed (seq 3) timeout
[42783.219522] mt7915e 0000:02:00.0: Message 00005aed (seq 4) timeout
[42786.259499] mt7915e 0000:02:00.0: Message 00005aed (seq 5) timeout
[42789.299507] mt7915e 0000:02:00.0: Message 00005aed (seq 6) timeout
[42792.349449] mt7915e 0000:02:00.0: Message 00005aed (seq 7) timeout
[42795.379422] mt7915e 0000:02:00.0: Message 00005aed (seq 8) timeout
[42798.419397] mt7915e 0000:02:00.0: Message 000025ed (seq 9) timeout
rany2 commented 1 year ago

@ryderlee1110 I think there is an issue with SER with respect to how MT_PCIE1_MAC_INT_ENABLE is used unconditionally.

This didn't fix my issue but I noticed in mac_restart the driver isn't checking if mdev is mt7915 causing you to use MT_PCIE1_MAC_INT_ENABLE when you should have used MT_PCIE1_MAC_INT_ENABLE_MT7916. At least this is the logic in pci.c

If unclear check this patch I made which fixes what I think is a possible issue: https://github.com/rany2/mt76/commit/7cb022d0e0c70007c13225be2b556c48a53accce.

Something else I noticed is that after I do a full chip recovery by writing 7 to sys_recovery, hif no longer receives any interrupts. So I thought the above would fix it but apparently not. Regardless I think it might be worth checking to see if it was intended or if this is indeed a bug.

rany2 commented 1 year ago

Something else I noticed is that after I do a full chip recovery by writing 7 to sys_recovery, hif no longer receives any interrupts. So I thought the above would fix it but apparently not. Regardless I think it might be worth checking to see if it was intended or if this is indeed a bug.

Should have mentioned that writing 7 to sys_recovery until memory consumption goes back to normal and then rmmod and insmod mt7915e fixes it.

rany2 commented 1 year ago

Updated title to reflect the fact that it doesn't matter whether the flood originated from STA or not, if there is a flood of any kind and the AP has to transmit that traffic to clients; it will die and this will be spammed to dmesg (provided fw wm debug is enabled):

[  471.558903] ieee80211 phy2: WM: ( 116.141511:98:MEM-W)whFbmWfdmaCmdEventXmit: xmit fail (out of resource)
[  471.568533] ieee80211 phy2: WM: ( 116.141541:99:MEM-W)whFbmWfdmaCmdEventXmit: xmit fail (out of resource)
[  471.764205] ieee80211 phy2: WM: ( 116.346711:00:MEM-W)whFbmWfdmaCmdEventXmit: xmit fail (out of resource)
[  471.773806] ieee80211 phy2: WM: ( 116.346772:01:MEM-W)whFbmWfdmaCmdEventXmit: xmit fail (out of resource)
[  471.860779] ieee80211 phy2: WM: ( 116.443299:02:MEM-W)whFbmWfdmaCmdEventXmit: xmit fail (out of resource)
[  471.947629] ieee80211 phy2: WM: ( 116.530121:03:MEM-W)whFbmWfdmaCmdEventXmit: xmit fail (out of resource)
[  472.066295] ieee80211 phy2: WM: ( 116.648743:04:MEM-W)whFbmWfdmaCmdEventXmit: xmit fail (out of resource)
[  472.075883] ieee80211 phy2: WM: ( 116.648774:05:MEM-W)whFbmWfdmaCmdEventXmit: xmit fail (out of resource)
rany2 commented 1 year ago

Just a heads up, I no longer face this issue when using this WA firmware I sourced from the EAP615 WALL source dump: https://github.com/rany2/mt76/commit/78fd50c3a92bb0649548a933181fd18e2b4f76ce

ryderlee1110 commented 1 year ago

which version you downgraded from ?

rany2 commented 1 year ago

These firmwares I tried had this issue:

However it still has the random MCU timeout issue like before, just that now UDP flood doesn't kill it and it recovers.

lukasz1992 commented 1 year ago

@rany2 So:

rany2 commented 1 year ago

@lukasz1992 No, it only recovers itself in this specific case (UDP flooding). I still have random driver hangs/MCU timeout.

lukasz1992 commented 1 year ago

:( many thanks anyway

rany2 commented 1 year ago

Looking at another source dump, I noticed that there seems to be a distinction between three variants of the same firmware with MT7915D working fine only in fixing this issue; however OpenWRT/linux-firmware don't make this distinction. Could this be the issue?

littoy commented 1 year ago

Just a heads up, I no longer face this issue when using this WA firmware I sourced from the EAP615 WALL source dump: rany2/mt76@78fd50c

@rany2 hi, tools and scripts is missing in your source. is this by design?

rany2 commented 1 year ago

@littoy It's missing because my mt76 is based on wireless-next which has some changes compared to OWRT variant.

rany2 commented 1 year ago

@littoy on your openwrt source tree, you need to make the following changes: https://github.com/rany2/openwrt/commit/f52d443e32897adef6e62a1ab12f4e8a81bafc04

just change PKG_SOURCE_VERSION to dee319231825423a9ac5135591a781e1e398267f

littoy commented 1 year ago

@littoy on your openwrt source tree, you need to make the following changes: rany2/openwrt@f52d443

just change PKG_SOURCE_VERSION to e3578d0be0984451eb4c80dafa5906a969dd4fae

👌, thank you very much.

rany2 commented 1 year ago

@ryderlee1110 hang is with MCUWA now, and the UDP flood is no longer relevant:

root@router:/sys/kernel/debug/ieee80211/phy0/mt76# cat xmit-queues 
     queue | hw-queued |      head |      tail |
      MAIN |         0 |       837 |       837 |
     MCUWM |         0 |        17 |        17 |
     MCUWA |        94 |       122 |        28 |
   MCUFWDL |         0 |       121 |       121 |
root@router:/sys/kernel/debug/ieee80211/phy0/mt76# cat sys_recovery 
Please echo the correct value ...
0: grab firmware transient SER state
1: trigger system error L1 recovery
2: trigger system error L2 recovery
3: trigger system error L3 rx abort
4: trigger system error L3 tx abort
5: trigger system error L3 tx disable
6: trigger system error L3 bf recovery
7: trigger system error full recovery
8: trigger firmware crash

let's dump firmware SER statistics...
::E  R , SER_STATUS        = 0x00000000
::E  R , SER_PLE_ERR       = 0x00000000
::E  R , SER_PLE_ERR_1     = 0x00000000
::E  R , SER_PLE_ERR_AMSDU = 0x00000000
::E  R , SER_PSE_ERR       = 0x00000000
::E  R , SER_PSE_ERR_1     = 0x00000000
::E  R , SER_LMAC_WISR6_B0 = 0x00000000
::E  R , SER_LMAC_WISR6_B1 = 0x00000000
::E  R , SER_LMAC_WISR7_B0 = 0x00000000
::E  R , SER_LMAC_WISR7_B1 = 0x00000000

SYS_RESET_COUNT: WM 0, WA 0
rany2 commented 1 year ago

If I write 7 to sys_recovery, I get this after about a minute:

[ 4787.831468] rcu: INFO: rcu_sched self-detected stall on CPU
[ 4787.837063] rcu:     0-....: (5999 ticks this GP) idle=767/1/0x40000002 softirq=140248/140248 fqs=2999 
[ 4787.846166]  (t=6000 jiffies g=215273 q=2055)
[ 4787.850509] NMI backtrace for cpu 0
[ 4787.853980] CPU: 0 PID: 694 Comm: napi/phy0-9 Tainted: G        W         5.15.112 #0
[ 4787.861782] Stack : 00000000 800859e8 00000000 00000004 00000000 00000000 8140dd14 80a20000
[ 4787.870139]         80860000 80789264 81505078 8085ce03 00000000 00000001 8140dcc0 81452640
[ 4787.878490]         00000000 00000000 80789264 8140db60 ffffefff 00000000 ffffffea 00000000
[ 4787.886841]         8140db6c 00000309 80862ae0 ffffffff 80789264 00000000 00000000 00000000
[ 4787.895193]         00000000 8085a2fc 80860000 8085a300 00000018 8040b510 00000000 80a20000
[ 4787.903546]         ...
[ 4787.905986] Call Trace:
[ 4787.908415] [<8000812c>] show_stack+0x28/0xf0
[ 4787.912788] [<80380234>] dump_stack_lvl+0x60/0x80
[ 4787.917489] [<80387020>] nmi_cpu_backtrace+0x108/0x178
[ 4787.922607] [<803871d8>] nmi_trigger_cpumask_backtrace+0x148/0x178
[ 4787.928764] [<8009f254>] rcu_dump_cpu_stacks+0x158/0x1ac
[ 4787.934072] [<8009fb98>] rcu_sched_clock_irq+0x800/0x9d0
[ 4787.939368] [<800a6270>] update_process_times+0xc8/0x124
[ 4787.944665] [<800bae90>] tick_handle_periodic+0x34/0xc8
[ 4787.949888] [<804d34c8>] gic_compare_interrupt+0x7c/0x9c
[ 4787.955189] [<8008e348>] handle_percpu_devid_irq+0xbc/0x188
[ 4787.960757] [<80087a9c>] generic_handle_domain_irq+0x2c/0x44
[ 4787.966393] [<8039d4a0>] gic_handle_local_int+0xa4/0x110
[ 4787.971690] [<8039d51c>] gic_irq_dispatch+0x10/0x20
[ 4787.976550] [<800879fc>] handle_irq_desc+0x20/0x38
[ 4787.981321] [<806dc538>] do_domain_IRQ+0x3c/0x50
[ 4787.985940] [<8039c7dc>] plat_irq_dispatch+0x98/0xcc
[ 4787.990908] [<80003568>] except_vec_vi_end+0xb8/0xc4
[ 4787.995858] [<82f501a4>] mt76_mmio_init+0xf8/0x144 [mt76]
[ 4788.001271] [<82e6dd38>] mt7915_mac_wtbl_lmac_addr+0x1b0/0x960 [mt7915e]
[ 4788.007980] [<82e70000>] mt7915_mac_reset_work+0x464/0xc88 [mt7915e]
lukasz1992 commented 1 year ago

@rany2 because of this line https://github.com/openwrt/mt76/blob/969b7b5ebd129068ca56e4b0d831593a2f92382f/mmio.c#LL103C23-L103C36

rany2 commented 1 year ago

@lukasz1992 honestly I think the stacktrace is wrong/deceitful... doesn't make sense

rany2 commented 1 year ago

anyway, if it helps the line that seems to trigger that stacktrace is ieee80211_stop_queues:

(gdb) l *mt7915_mac_reset_work+0x464
0x10060 is in mt7915_mac_reset_work (..../mt7915/mac.c:1533).
1528        ext_phy = dev->mt76.phys[MT_BAND1];
1529    
1530        dev->recovery.hw_full_reset = true;
1531    
1532        wake_up(&dev->mt76.mcu.wait);
1533        ieee80211_stop_queues(mt76_hw(dev));
1534        if (ext_phy)
1535            ieee80211_stop_queues(ext_phy->hw);
1536    
1537        cancel_delayed_work_sync(&dev->mphy.mac_work);
rany2 commented 1 year ago

@lukasz1992 Issue I'm having with some stacktraces I'm getting is that they are inconsistent and I get different functions/offsets for seemingly the same issue, so I don't know what's up with that.

ThiloteE commented 1 year ago

@rany2 I am not really a coder, but https://forum.openwrt.org/t/mt76-wireless-driver-debugging/154514/117 sounds like it might be of help?

"I found out last week how to use gdb properly to get the code where a crash occurs. Because openwrt uses -O2 optimization by default during the compile, it will inline a lot of functions, so the function names in the stack dump are often not the real function, especially if the offset is above 0x100.

Here's an example usage, run from the build server:"

(gdb) add-symbol-file ~/openwrt/build_dir/target-aarch64_cortex-a53_musl/linux-mediatek_mt7622/backports-6.1-rc8/net/mac80211/mac80211.o
add symbol table from file "/home/osboxes/openwrt/build_dir/target-aarch64_cortex-a53_musl/linux-mediatek_mt7622/backports-6.1-rc8/net/mac80211/mac80211.o"
(y or n) y
Reading symbols from /home/osboxes/openwrt/build_dir/target-aarch64_cortex-a53_musl/linux-mediatek_mt7622/backports-6.1-rc8/net/mac80211/mac80211.o...

(gdb) l *ieee80211_sta_ps_transition+0x448
0x23a20 is in ieee80211_rx_8023 (/home/osboxes/openwrt/build_dir/target-aarch64_cortex-a53_musl/linux-mediatek_mt7622/backports-6.1-rc8/net/mac80211/rx.c:4771).
4766                 */
4767                xmit_skb->priority += 256;
4768                xmit_skb->protocol = htons(ETH_P_802_3);
4769                skb_reset_network_header(xmit_skb);
4770                skb_reset_mac_header(xmit_skb);
4771                dev_queue_xmit(xmit_skb);
4772            }
4773    
4774            if (!skb)
4775                return;
rany2 commented 1 year ago

@ThiloteE I see, I guess I have to look everything up with gdb. Though I don't think then that -O2 is the issue, because this issue would pop up with every static function. At least I know why the function names were seemingly inconsistent.

Brain2000 commented 1 year ago

@ThiloteE @rany2 I recompiled openwrt without -O2, which makes the stack traces a little more reliable, but it can still inline some things, so that wasn't the magic bullet. Using gdb on the server in which openwrt is the only way to be sure of the exact line where a stack trace points.

rany2 commented 1 year ago

@Brain2000 While I haven't done that, how bad is performance without -O2; I don't care if it's worse in benchmarks, but could you feel that it's much slower?

rany2 commented 1 year ago

@Brain2000 BTW seeing that you're the original reporter, could you confirm that using rom patch/WA I attached fixed this issue for you (it did for me)?

Brain2000 commented 1 year ago

@rany2 The performance wasn't any different that I could tell. I'm sure it's a few microseconds faster here and there. It probably becomes more pronounced if there are lots of connected devices at speeds above 500mbps.

I read yesterday about that rom patch, and it piqued my interest. I'm going to try it as soon as I can get a moment. It might be a couple of days though, as I'm booked solid on my schedule at the moment.

rany2 commented 1 year ago

@Brain2000 I think the WA fw is what fixed it, but I updated both WA fw and ROM patch just in case. WM fw could not be used due to (what I think are) vendor modifications making it incompatible with mt76. However so far it's been smooth sailing. I undid all the patches I applied from mtk-openwrt-feeds and it seems to be going well so far.

While this is not confirmed, there is a chance that the only reason WA fw from TP-Link's EAP->>some numbers here<< work is because they patched it and fixed this issue themselves, which is a possibility. I noticed that the debugging info in the fw show tp-link specific paths so they did build that firmware themselves, and possibly made changes that resolved the issue.

@littoy If you're using my mt76 repo, try updating to 1fdd73655aadfe85e2d145e259e814ccaea75dd1. Having solid uptime so far on it (doesn't crash in under an hour with many clients like it used to before).

rany2 commented 1 year ago

Edit: almost forgot:

# cat /sys/kernel/debug/ieee80211/phy0/mt76/sys_recovery 
...
let's dump firmware SER statistics...
::E  R , SER_STATUS        = 0x00000000
::E  R , SER_PLE_ERR       = 0x00000000
::E  R , SER_PLE_ERR_1     = 0x00000000
::E  R , SER_PLE_ERR_AMSDU = 0x00000000
::E  R , SER_PSE_ERR       = 0x00000000
::E  R , SER_PSE_ERR_1     = 0x00000000
::E  R , SER_LMAC_WISR6_B0 = 0x00000000
::E  R , SER_LMAC_WISR6_B1 = 0x00000000
::E  R , SER_LMAC_WISR7_B0 = 0x00000000
::E  R , SER_LMAC_WISR7_B1 = 0x00000000

SYS_RESET_COUNT: WM 0, WA 0

# cat /sys/kernel/debug/ieee80211/phy1/mt76/sys_recovery 
...
let's dump firmware SER statistics...
::E  R , SER_STATUS        = 0x00000000
::E  R , SER_PLE_ERR       = 0x00000000
::E  R , SER_PLE_ERR_1     = 0x00000000
::E  R , SER_PLE_ERR_AMSDU = 0x00000000
::E  R , SER_PSE_ERR       = 0x00000000
::E  R , SER_PSE_ERR_1     = 0x00000000
::E  R , SER_LMAC_WISR6_B0 = 0x00000000
::E  R , SER_LMAC_WISR6_B1 = 0x00000000
::E  R , SER_LMAC_WISR7_B0 = 0x00000000
::E  R , SER_LMAC_WISR7_B1 = 0x00000000

SYS_RESET_COUNT: WM 0, WA 0
# 

@ryderlee1110 Here's the log right now with that TP-Link firmware I linked to earlier. It's the same issue of the driver hanging as before but this time around there are no "xmit fails" it just dies (in the past, the log would almost always be spammed by xmit failed- out of resource, but now the fw log is a lot cleaner):

Wed May 24 22:57:49 2023 daemon.notice hostapd: eap2ghz: BEACON-REQ-TX-STATUS 98:50:2e:72:ee:56 113 ack=1                                             
Wed May 24 22:57:49 2023 daemon.notice hostapd: eap2ghz: BEACON-RESP-RX 98:50:2e:72:ee:56 113 04                                                                                                                                 
Wed May 24 22:57:49 2023 daemon.notice hostapd: eap5ghz: BEACON-REQ-TX-STATUS 04:e5:98:de:e9:f2 13 ack=1                                                                                                                         
Wed May 24 22:57:49 2023 daemon.notice hostapd: eap5ghz: BEACON-REQ-TX-STATUS dc:e5:5b:41:71:9c 14 ack=1                                              
Wed May 24 22:57:49 2023 daemon.notice hostapd: eap5ghz: BEACON-RESP-RX dc:e5:5b:41:71:9c 14 04                                                       
Wed May 24 22:57:50 2023 daemon.notice hostapd: eap5ghz: BEACON-RESP-RX 04:e5:98:de:e9:f2 13 00 510b4c60dd2a010000000000003eff002091b72447013461dd2a01d57380e02a0100000064001110000d4d617a6f75742034204672656501088c129824b048606c03010b0504000200000706504120010d242a01073201ff30180100000fac040100000fac040200000fac01000fac030d000b0502002f00004605720000000036035e9b003b0251002d1aed0917ffff0000000000000000000001000000000000000000003d160b0015000000000000000
000000000000000000000007f0a04000a0a010001400040451102e7070518151c2a000000000000000000bf0cb1018033faff0000faff0000c005000b00fcff02020180               
Wed May 24 22:57:50 2023 daemon.notice hostapd: eap5ghz: BEACON-RESP-RX 04:e5:98:de:e9:f2 13 00 510b4c60dd2a010000000000003eff002091b72447013461dd2a0175c3020048ff1c230500081a441002200e920f01af08000c00fafffaff391cc7711c07ff0724f43f0028fcffff022703ff0e260008a9ff2fa9ff4575ff6575ffdd1a00904c0408bf0cb1018033faff0000faff0000c005000b00fcffdd180050f2020101810003a4000027a4000042435e0062322f0002020101
Wed May 24 22:57:51 2023 daemon.info dawn: Client 04:E5:98:DE:E9:F2: Kicking due to low active data transfer: RX rate 6.000000 below 6 limit                                                     
Wed May 24 22:57:51 2023 daemon.notice hostapd: eap5ghz: BSS-TM-RESP 04:e5:98:de:e9:f2 status_code=0 bss_termination_delay=0 target_bssid=00:20:91:b7:24:47                                                                      
Wed May 24 22:57:53 2023 daemon.err hostapd: nl80211: kernel reports: key addition failed                                                             
Wed May 24 22:57:53 2023 daemon.notice hostapd: eap2ghz: STA-OPMODE-N_SS-CHANGED 04:e5:98:de:e9:f2 1                                                  
Wed May 24 22:57:53 2023 daemon.info hostapd: eap2ghz: STA 04:e5:98:de:e9:f2 IEEE 802.11: associated (aid 1)                                                                                     
Wed May 24 22:57:53 2023 daemon.notice hostapd: eap2ghz: AP-STA-CONNECTED 04:e5:98:de:e9:f2 auth_alg=ft                                                                                          
Wed May 24 22:57:53 2023 kern.info kernel: [ 5031.241725] ieee80211 phy0: WM: ( 721.873586:49:BSS-E)_whCapSetGeneric_Falcon DW0 = 0x0, DW1 = 0x0                                                 
Wed May 24 22:57:53 2023 kern.info kernel: [ 5031.250766] ieee80211 phy0: WM: ( 721.873952:50:RA-E)set initRateDownMCS[6] old 1ss_m0, new 1ss_m0                                                 
Wed May 24 22:57:53 2023 kern.info kernel: [ 5031.259777] ieee80211 phy0: WM: ( 721.874105:51:TXC-E)heACtrlInitStaRec(): prOperMode=401568, u2OperMode=0!                                        
Wed May 24 22:57:53 2023 kern.info kernel: [ 5031.269570] ieee80211 phy0: WM: ( 721.874135:52:BSS-E)muruAddStaRec, prRuStaRec->u1Bw = 0 eBand = 2407000 BSS BW = 0
Wed May 24 22:57:53 2023 kern.info kernel: [ 5031.280142] ieee80211 phy0: WM: ( 721.874196:53:MURU-E)[MuruMumGroupFormation]muruMumGen2MuGrpEntry Fail                                           
Wed May 24 22:57:53 2023 daemon.notice hostapd: eap5ghz: Prune association for 04:e5:98:de:e9:f2                                                                                                 
Wed May 24 22:57:53 2023 daemon.notice hostapd: eap5ghz: AP-STA-DISCONNECTED 04:e5:98:de:e9:f2                                                                                                                                   
Wed May 24 22:57:53 2023 daemon.info hostapd: eap2ghz: STA 04:e5:98:de:e9:f2 RADIUS: starting accounting session EC5AE1D5F54811D4                                                                                                
Wed May 24 22:57:53 2023 daemon.info hostapd: eap2ghz: STA 04:e5:98:de:e9:f2 IEEE 802.1X: authenticated - EAP type: 0 (unknown)                                                                                                  
Wed May 24 22:57:55 2023 kern.info kernel: [ 5033.485208] ieee80211 phy0: WM: ( 724.116536:54:MQM-W)[WR] BSS usage overflows during removing entry                                                                               
Wed May 24 22:57:55 2023 daemon.err hostapd: nl80211: kernel reports: key addition failed                                                                                                                                        
Wed May 24 22:57:55 2023 daemon.notice hostapd: eap5ghz: STA-OPMODE-N_SS-CHANGED 04:e5:98:de:e9:f2 1                                                                                                                             
Wed May 24 22:57:55 2023 daemon.info hostapd: eap5ghz: STA 04:e5:98:de:e9:f2 IEEE 802.11: associated (aid 3)                                                                                                                     
Wed May 24 22:57:55 2023 daemon.notice hostapd: eap5ghz: AP-STA-CONNECTED 04:e5:98:de:e9:f2 auth_alg=ft                                                                                                                                                                       
Wed May 24 22:57:55 2023 kern.info kernel: [ 5033.598626] ieee80211 phy0: WM: ( 724.229909:55:BSS-E)_whCapSetGeneric_Falcon DW0 = 0x0, DW1 = 0x0                                                                                                                              
Wed May 24 22:57:55 2023 kern.info kernel: [ 5033.607656] ieee80211 phy0: WM: ( 724.230275:56:RA-E)set initRateDownMCS[2] old 1ss_m0, new 1ss_m0                                                                                                                              
Wed May 24 22:57:55 2023 kern.info kernel: [ 5033.616677] ieee80211 phy0: WM: ( 724.230428:57:TXC-E)heACtrlInitStaRec(): prOperMode=401308, u2OperMode=10!                                                                                                                    
Wed May 24 22:57:55 2023 kern.info kernel: [ 5033.626559] ieee80211 phy0: WM: ( 724.230458:58:BSS-E)muruAddStaRec, prRuStaRec->u1Bw = 2 eBand = 5000000 BSS BW = 2                                                                                                            
Wed May 24 22:57:55 2023 daemon.notice hostapd: eap2ghz: Prune association for 04:e5:98:de:e9:f2                                                                                                                                                                              
Wed May 24 22:57:55 2023 daemon.notice hostapd: eap2ghz: AP-STA-DISCONNECTED 04:e5:98:de:e9:f2                                                                                                                                                                                
Wed May 24 22:57:55 2023 daemon.info hostapd: eap5ghz: STA 04:e5:98:de:e9:f2 RADIUS: starting accounting session 191F160DED6332A8                                                                                                                                             
Wed May 24 22:57:55 2023 daemon.info hostapd: eap5ghz: STA 04:e5:98:de:e9:f2 IEEE 802.1X: authenticated - EAP type: 0 (unknown)                                                                                                                                               
Wed May 24 22:57:57 2023 kern.info kernel: [ 5035.755212] ieee80211 phy0: WM: ( 726.385976:59:MQM-W)[WR] BSS usage overflows during removing entry                       
Wed May 24 22:58:01 2023 daemon.notice hostapd: eap5ghz: AP-STA-DISCONNECTED 04:e5:98:de:e9:f2                                                                           
Wed May 24 22:58:01 2023 daemon.err hostapd: nl80211: kernel reports: key addition failed                                                                                
Wed May 24 22:58:01 2023 daemon.notice hostapd: eap5ghz: STA-OPMODE-N_SS-CHANGED 04:e5:98:de:e9:f2 1                                                                     
Wed May 24 22:58:01 2023 kern.info kernel: [ 5039.345268] ieee80211 phy0: WM: ( 729.975179:60:MQM-W)[WR] BSS usage overflows during removing entry                       
Wed May 24 22:58:01 2023 daemon.info hostapd: eap5ghz: STA 04:e5:98:de:e9:f2 IEEE 802.11: associated (aid 3)                                                             
Wed May 24 22:58:01 2023 daemon.notice hostapd: eap5ghz: AP-STA-CONNECTED 04:e5:98:de:e9:f2 auth_alg=ft                                                                  
Wed May 24 22:58:01 2023 kern.info kernel: [ 5039.357124] ieee80211 phy0: WM: ( 729.987020:61:BSS-E)_whCapSetGeneric_Falcon DW0 = 0x0, DW1 = 0x0                         
Wed May 24 22:58:01 2023 kern.info kernel: [ 5039.366147] ieee80211 phy0: WM: ( 729.987386:62:RA-E)set initRateDownMCS[2] old 1ss_m0, new 1ss_m0                         
Wed May 24 22:58:01 2023 kern.info kernel: [ 5039.375157] ieee80211 phy0: WM: ( 729.987508:63:TXC-E)heACtrlInitStaRec(): prOperMode=401308, u2OperMode=10!                                                                       
Wed May 24 22:58:01 2023 kern.info kernel: [ 5039.385038] ieee80211 phy0: WM: ( 729.987569:64:BSS-E)muruAddStaRec, prRuStaRec->u1Bw = 2 eBand = 5000000 BSS BW = 2                                                               
Wed May 24 22:58:01 2023 kern.info kernel: [ 5039.395600] ieee80211 phy0: WM: ( 729.987630:65:MURU-E)[muruMumGroupCheck]StaRxNstsSum=2, ApTxNsts=2                       
Wed May 24 22:58:01 2023 kern.info kernel: [ 5039.404788] ieee80211 phy0: WM: ( 729.987661:66:BF-E)PFMU ID 0xFFFF is abnormal                                            
Wed May 24 22:58:01 2023 kern.info kernel: [ 5039.412107] ieee80211 phy0: WM: ( 729.987691:67:BF-E)BFSEUpdate Failed. u4Status=-1073741823                               
Wed May 24 22:58:01 2023 kern.info kernel: [ 5039.420558] ieee80211 phy0: WM: ( 729.987722:68:BF-E)BFSetNdpRate() Invalid: 3, WlanIdx: 2, u1Mcs: 9                                                                               
Wed May 24 22:58:01 2023 daemon.notice hostapd: eap2ghz: Prune association for 04:e5:98:de:e9:f2                                                                                                                                 
Wed May 24 22:58:01 2023 daemon.info hostapd: eap5ghz: STA 04:e5:98:de:e9:f2 RADIUS: starting accounting session 191F160DED6332A8                                                                                                
Wed May 24 22:58:01 2023 daemon.info hostapd: eap5ghz: STA 04:e5:98:de:e9:f2 IEEE 802.1X: authenticated - EAP type: 0 (unknown)                                                                                                  
Wed May 24 22:58:02 2023 daemon.info dawn: Client 04:E5:98:DE:E9:F2: Kicking due to low active data transfer: RX rate 6.000000 below 6 limit                                                                                     
Wed May 24 22:58:02 2023 daemon.notice hostapd: eap5ghz: BSS-TM-RESP 04:e5:98:de:e9:f2 status_code=0 bss_termination_delay=0 target_bssid=00:20:91:b7:24:47                                                                      
Wed May 24 22:58:02 2023 daemon.err hostapd: nl80211: kernel reports: key addition failed                                                                                                                                        
Wed May 24 22:58:02 2023 daemon.notice hostapd: eap2ghz: STA-OPMODE-N_SS-CHANGED 04:e5:98:de:e9:f2 1                                                                                                                             
Wed May 24 22:58:02 2023 daemon.info hostapd: eap2ghz: STA 04:e5:98:de:e9:f2 IEEE 802.11: associated (aid 1)                                                                                                                     
Wed May 24 22:58:02 2023 daemon.notice hostapd: eap2ghz: AP-STA-CONNECTED 04:e5:98:de:e9:f2 auth_alg=ft                                                                                                                          
Wed May 24 22:58:02 2023 kern.info kernel: [ 5040.439590] ieee80211 phy0: WM: ( 731.069234:69:BSS-E)_whCapSetGeneric_Falcon DW0 = 0x0, DW1 = 0x0                                                                                 
Wed May 24 22:58:02 2023 kern.info kernel: [ 5040.448646] ieee80211 phy0: WM: ( 731.069692:70:RA-E)set initRateDownMCS[6] old 1ss_m0, new 1ss_m0                                                                                 
Wed May 24 22:58:02 2023 kern.info kernel: [ 5040.457660] ieee80211 phy0: WM: ( 731.069814:71:TXC-E)heACtrlInitStaRec(): prOperMode=401568, u2OperMode=0!                                                                        
Wed May 24 22:58:02 2023 kern.info kernel: [ 5040.467461] ieee80211 phy0: WM: ( 731.069844:72:BSS-E)muruAddStaRec, prRuStaRec->u1Bw = 0 eBand = 2407000 BSS BW = 0                                                               
Wed May 24 22:58:02 2023 kern.info kernel: [ 5040.478024] ieee80211 phy0: WM: ( 731.069905:73:MURU-E)[MuruMumGroupFormation]muruMumGen2MuGrpEntry Fail                                                                           
Wed May 24 22:58:02 2023 daemon.notice hostapd: eap5ghz: Prune association for 04:e5:98:de:e9:f2                                                                                                                                 
Wed May 24 22:58:02 2023 daemon.notice hostapd: eap5ghz: AP-STA-DISCONNECTED 04:e5:98:de:e9:f2                                                                                                                                   
Wed May 24 22:58:02 2023 daemon.info hostapd: eap2ghz: STA 04:e5:98:de:e9:f2 RADIUS: starting accounting session EC5AE1D5F54811D4                                                                                                
Wed May 24 22:58:02 2023 daemon.info hostapd: eap2ghz: STA 04:e5:98:de:e9:f2 IEEE 802.1X: authenticated - EAP type: 0 (unknown)                                                                                                  
Wed May 24 22:58:04 2023 kern.info kernel: [ 5042.755165] ieee80211 phy0: WM: ( 733.384206:74:MQM-W)[WR] BSS usage overflows during removing entry                                                                               
Wed May 24 22:58:06 2023 daemon.warn dawn: Client / BSSID = 04:E5:98:DE:E9:F2 / 00:20:91:15:55:5C: BEACON REQUEST failed                                                                                                         
Wed May 24 22:58:07 2023 daemon.warn dawn: Client / BSSID = 98:50:2E:72:EE:56 / 00:20:91:15:55:5C: BEACON REQUEST failed                                                                                                         
Wed May 24 22:58:08 2023 daemon.warn dawn: Client / BSSID = 04:E5:98:DE:E9:F2 / 00:20:91:15:55:5C: BEACON REQUEST failed                                                                                                         
Wed May 24 22:58:09 2023 daemon.warn dawn: Client / BSSID = DC:E5:5B:41:71:9C / 00:20:91:15:55:5C: BEACON REQUEST failed                                                                                                         
Wed May 24 22:58:10 2023 daemon.warn dawn: Client / BSSID = F4:26:79:E6:EC:59 / 00:20:91:15:55:5C: BEACON REQUEST failed                                                                                                         
Wed May 24 22:58:24 2023 kern.err kernel: [ 5062.884380] mt7915e 0000:02:00.0: Message 000025ed (seq 4) timeout                                                                                                                  
Wed May 24 22:58:45 2023 daemon.notice hostapd: nl80211: nl80211_recv_beacons->nl_recvmsgs failed: -5                                                                                                                            
Wed May 24 22:58:45 2023 kern.err kernel: [ 5083.364221] mt7915e 0000:02:00.0: Message 0000aded (seq 5) timeout                                                                                                                  
Wed May 24 22:59:05 2023 kern.err kernel: [ 5103.844062] mt7915e 0000:02:00.0: Message 00005aed (seq 6) timeout                                                                                                                  
Wed May 24 22:59:26 2023 daemon.notice hostapd: eap2ghz: STA-OPMODE-SMPS-MODE-CHANGED 04:e5:98:de:e9:f2 off                                                                                                                      
Wed May 24 22:59:26 2023 kern.err kernel: [ 5124.323922] mt7915e 0000:02:00.0: Message 00005aed (seq 7) timeout                                                                                                                  
Wed May 24 22:59:46 2023 kern.err kernel: [ 5144.803750] mt7915e 0000:02:00.0: Message 000025ed (seq 8) timeout                                                                                                                  
Wed May 24 23:00:07 2023 kern.err kernel: [ 5165.283597] mt7915e 0000:02:00.0: Message 00005aed (seq 9) timeout                                                                                                                  
Wed May 24 23:00:27 2023 kern.err kernel: [ 5185.763453] mt7915e 0000:02:00.0: Message 00005aed (seq 10) timeout                                                                                                                 
Wed May 24 23:00:48 2023 kern.err kernel: [ 5206.243299] mt7915e 0000:02:00.0: Message 000025ed (seq 11) timeout                                                                                                                 
Wed May 24 23:01:08 2023 daemon.notice hostapd: Beacon request: 04:e5:98:de:e9:f2 is not connected                                                                                                                               
Wed May 24 23:01:08 2023 kern.err kernel: [ 5226.723148] mt7915e 0000:02:00.0: Message 00005aed (seq 12) timeout                                                                                                                 
Wed May 24 23:01:29 2023 kern.err kernel: [ 5247.203101] mt7915e 0000:02:00.0: Message 000025ed (seq 13) timeout                                                                                                                 
rany2 commented 1 year ago

I can't spot anything interesting in the fw log, though this peeked my interest:

Wed May 24 22:58:02 2023 daemon.notice hostapd: eap5ghz: Prune association for 04:e5:98:de:e9:f2                                                                                                                                 
Wed May 24 22:58:02 2023 daemon.notice hostapd: eap5ghz: AP-STA-DISCONNECTED 04:e5:98:de:e9:f2                                                                                                                                   
Wed May 24 22:58:02 2023 daemon.info hostapd: eap2ghz: STA 04:e5:98:de:e9:f2 RADIUS: starting accounting session EC5AE1D5F54811D4                                                                                                
Wed May 24 22:58:02 2023 daemon.info hostapd: eap2ghz: STA 04:e5:98:de:e9:f2 IEEE 802.1X: authenticated - EAP type: 0 (unknown)                                                                                                  
Wed May 24 22:58:04 2023 kern.info kernel: [ 5042.755165] ieee80211 phy0: WM: ( 733.384206:74:MQM-W)[WR] BSS usage overflows during removing entry                                                                               
Wed May 24 22:58:06 2023 daemon.warn dawn: Client / BSSID = 04:E5:98:DE:E9:F2 / 00:20:91:15:55:5C: BEACON REQUEST failed                                                                                                         
Wed May 24 22:58:07 2023 daemon.warn dawn: Client / BSSID = 98:50:2E:72:EE:56 / 00:20:91:15:55:5C: BEACON REQUEST failed                                                                                                         
Wed May 24 22:58:08 2023 daemon.warn dawn: Client / BSSID = 04:E5:98:DE:E9:F2 / 00:20:91:15:55:5C: BEACON REQUEST failed                                                                                                         
Wed May 24 22:58:09 2023 daemon.warn dawn: Client / BSSID = DC:E5:5B:41:71:9C / 00:20:91:15:55:5C: BEACON REQUEST failed                                                                                                         
Wed May 24 22:58:10 2023 daemon.warn dawn: Client / BSSID = F4:26:79:E6:EC:59 / 00:20:91:15:55:5C: BEACON REQUEST failed                                                                                                         
Wed May 24 22:58:24 2023 kern.err kernel: [ 5062.884380] mt7915e 0000:02:00.0: Message 000025ed (seq 4) timeout                                                                                                                  
Wed May 24 22:58:45 2023 daemon.notice hostapd: nl80211: nl80211_recv_beacons->nl_recvmsgs failed: -5                                                                                                                            
Wed May 24 22:58:45 2023 kern.err kernel: [ 5083.364221] mt7915e 0000:02:00.0: Message 0000aded (seq 5) timeout                                                                                                                  
Wed May 24 22:59:05 2023 kern.err kernel: [ 5103.844062] mt7915e 0000:02:00.0: Message 00005aed (seq 6) timeout                                                                                                                 

Could it be that 04:e5:98:de:e9:f2 switched between the different APs so quickly the firmware crashed? Does the firmware have an inability to handle the same client across different bands?

rany2 commented 1 year ago

This is a gamble but I'll try setting NEEDS_UNIQUE_STA_ADDR for mt7915e. I think it might be the issue. Edit: here goes nothing: https://github.com/rany2/mt76/commit/42fe310f9363d2430f9b08e675ecc3b11e9e6068

littoy commented 1 year ago

@Brain2000 I think the WA fw is what fixed it, but I updated both WA fw and ROM patch just in case. WM fw could not be used due to (what I think are) vendor modifications making it incompatible with mt76. However so far it's been smooth sailing. I undid all the patches I applied from mtk-openwrt-feeds and it seems to be going well so far.

While this is not confirmed, there is a chance that the only reason WA fw from TP-Link's EAP->>some numbers here<< work is because they patched it and fixed this issue themselves, which is a possibility. I noticed that the debugging info in the fw show tp-link specific paths so they did build that firmware themselves, and possibly made changes that resolved the issue.

@littoy If you're using my mt76 repo, try updating to 1fdd73655aadfe85e2d145e259e814ccaea75dd1. Having solid uptime so far on it (doesn't crash in under an hour with many clients like it used to before).

I'll try later.

@littoy on your openwrt source tree, you need to make the following changes: rany2/openwrt@f52d443

just change PKG_SOURCE_VERSION to dee319231825423a9ac5135591a781e1e398267f

And this version not stable on my pre test, timeout appear in short time.

rany2 commented 1 year ago

@littoy Could you try bb517921433c2ba1235c98f68f94c36ff6600188?

littoy commented 1 year ago

okay.

littoy commented 1 year ago

@rany2 is your master for test and mainline_based for stable ?

rany2 commented 1 year ago

@littoy master is stable, use it... as for the other ones:

basically both mainline_based and master have the fixes, but I am using master so far and it works OK

littoy commented 1 year ago

okay, I'll update to master's lasted for test.

rany2 commented 1 year ago

Same exact issue just came up. @ryderlee1110 even with NEEDS_UNIQUE_STA_ADDR set it's not enough but I think the pattern is very clear now, extremely rapid disassoc from one band and reassoc to another band causes the issue. I think it also explains why daemons like DAWN make this issue come up more, as a roaming assistant it causes clients to switch bands more making this issue pop up:

Thu May 25 10:32:03 2023 daemon.notice hostapd: eap5ghz: Prune association for 10:f6:05:27:94:97
Thu May 25 10:32:03 2023 daemon.notice hostapd: eap5ghz: AP-STA-DISCONNECTED 10:f6:05:27:94:97
Thu May 25 10:32:03 2023 daemon.info hostapd: eap2ghz: STA 10:f6:05:27:94:97 RADIUS: starting accounting session D2F444A42FC17282
Thu May 25 10:32:03 2023 daemon.info hostapd: eap2ghz: STA 10:f6:05:27:94:97 IEEE 802.1X: authenticated - EAP type: 0 (unknown)
Thu May 25 10:32:05 2023 kern.info kernel: [ 4218.092345] ieee80211 phy0: WM: (4203.903260:78:MQM-W)[WR] BSS usage overflows during removing entry
Thu May 25 10:32:26 2023 kern.err kernel: [ 4238.571131] mt7915e 0000:02:00.0: Message 000025ed (seq 2) timeout
Thu May 25 10:32:46 2023 daemon.notice hostapd: nl80211: nl80211_recv_beacons->nl_recvmsgs failed: -5
Thu May 25 10:32:46 2023 kern.err kernel: [ 4259.050914] mt7915e 0000:02:00.0: Message 0000aded (seq 3) timeout
Thu May 25 10:33:07 2023 kern.err kernel: [ 4279.530668] mt7915e 0000:02:00.0: Message 00005aed (seq 4) timeout
Thu May 25 10:33:27 2023 kern.err kernel: [ 4300.010484] mt7915e 0000:02:00.0: Message 00005aed (seq 5) timeout
Thu May 25 10:33:47 2023 kern.err kernel: [ 4320.490214] mt7915e 0000:02:00.0: Message 000025ed (seq 6) timeout
dhewg commented 1 year ago

I think the pattern is very clear now, extremely rapid disassoc from one band and reassoc to another band causes the issue

That sounds like a timing/locking issue. Any idea if https://github.com/openwrt/openwrt/issues/12661 may be related?

rany2 commented 1 year ago

I think the pattern is very clear now, extremely rapid disassoc from one band and reassoc to another band causes the issue

That sounds like a timing/locking issue. Any idea if openwrt/openwrt#12661 may be related?

Probably but have no idea if it's an issue with firmware or this driver.

littoy commented 1 year ago

okay, I'll update to master's lasted for test.

look like your master commit hash rewrite ? I test commit hash f0bacafabeae61bd22e0ac216aa839f396331757. timeout shortlly, then wifi not work normal.

[   21.606090] usbcore: registered new interface driver ax88179_178a
[   21.633949] usbcore: registered new interface driver mt76x2u
[   21.727072] mt7915e 0000:04:00.0: HW/SW Version: 0x8a108a10, Build Time: 20230424212203a
[   21.727072] 
[   21.788014] mt7915e 0000:04:00.0: WM Firmware Version: ____000000, Build Time: 20230424212218
[   21.817953] mt7915e 0000:04:00.0: WA Firmware Version: DEV_000000, Build Time: 20230424212310
[   22.003678] PPP generic driver version 2.4.2
[   22.005388] PPP MPPE Compression module registered
[   22.007243] NET: Registered PF_PPPOX protocol family
[   22.010282] wireguard: WireGuard 1.0.0 loaded. See www.wireguard.com for information.
[   22.011453] wireguard: Copyright (C) 2015-2019 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.
[   22.045868] Setting dangerous option enable_guc - tainting kernel
[   22.048215] l2tp_ppp: PPPoL2TP kernel driver, V2.0
[   22.050001] kmodloader: done loading kernel modules from /etc/modules.d/*
[   23.598903] br-lan: port 1(eth2) entered blocking state
[   23.598911] br-lan: port 1(eth2) entered disabled state
[   23.598971] device eth2 entered promiscuous mode
[   23.677765] br-lan: port 2(eth3) entered blocking state
[   23.677774] br-lan: port 2(eth3) entered disabled state
[   23.677845] device eth3 entered promiscuous mode
[   23.679347] 8021q: adding VLAN 0 to HW filter on device eth0
[   23.679441] br-wan: port 1(eth0) entered blocking state
[   23.679445] br-wan: port 1(eth0) entered disabled state
[   23.679529] device eth0 entered promiscuous mode
[   23.679874] br-wan: port 1(eth0) entered blocking state
[   23.679877] br-wan: port 1(eth0) entered forwarding state
[   23.759039] br-wan: port 2(eth1) entered blocking state
[   23.759047] br-wan: port 2(eth1) entered disabled state
[   23.759139] device eth1 entered promiscuous mode
[   24.411544] fast-classifier: starting up
[   25.096442] fast-classifier: registered
[   25.447910] br-lan: port 3(wlan0) entered blocking state
[   25.447926] br-lan: port 3(wlan0) entered disabled state
[   25.448077] device wlan0 entered promiscuous mode
[   25.448141] br-lan: port 3(wlan0) entered blocking state
[   25.448148] br-lan: port 3(wlan0) entered forwarding state
[   25.458282] IPv6: ADDRCONF(NETDEV_CHANGE): br-lan: link becomes ready
[   25.558913] IPv6: ADDRCONF(NETDEV_CHANGE): wlan0: link becomes ready
[   26.144359] br-lan: port 4(wlan1) entered blocking state
[   26.144378] br-lan: port 4(wlan1) entered disabled state
[   26.144581] device wlan1 entered promiscuous mode
[   26.144670] br-lan: port 4(wlan1) entered blocking state
[   26.144678] br-lan: port 4(wlan1) entered forwarding state
[   26.147397] br-lan: port 4(wlan1) entered disabled state
[   26.813849] IPv6: ADDRCONF(NETDEV_CHANGE): wlan1: link becomes ready
[   26.813908] br-lan: port 4(wlan1) entered blocking state
[   26.813913] br-lan: port 4(wlan1) entered forwarding state
[   27.956502] svc: failed to register nfsdv3 RPC service (errno 111).
[   29.257652] igc 0000:01:00.0 eth1: NIC Link is Up 2500 Mbps Full Duplex, Flow Control: RX/TX
[   29.257779] br-wan: port 2(eth1) entered blocking state
[   29.257783] br-wan: port 2(eth1) entered forwarding state
[ 2134.600535] mt7915e 0000:04:00.0: Message 00005aed (seq 12) timeout
[ 2155.080719] mt7915e 0000:04:00.0: Message 000025ed (seq 13) timeout
rany2 commented 1 year ago

[ 21.727072] mt7915e 0000:04:00.0: HW/SW Version: 0x8a108a10, Build Time: 20230424212203a [ 21.727072] [ 21.788014] mt7915e 0000:04:00.0: WM Firmware Version: ____000000, Build Time: 20230424212218 [ 21.817953] mt7915e 0000:04:00.0: WA Firmware Version: DEV_000000, Build Time: 20230424212310

That firmware doesn't look right, it should be:

[   13.300612] mt7915e 0000:02:00.0: HW/SW Version: 0x8a108a10, Build Time: 20230330182132a
[   13.635186] mt7915e 0000:02:00.0: WM Firmware Version: ____000000, Build Time: 20230418151336
[   13.673128] mt7915e 0000:02:00.0: WA Firmware Version: DEV_000000, Build Time: 20230315163130

if you got it from my master

lukasz1992 commented 1 year ago

I wonder how about booting with only 1 cpu active (maxcpus=1), does it change anything?

rany2 commented 1 year ago

I wonder how about booting with only 1 cpu active (maxcpus=1), does it change anything?

I think it's multithreaded irrespective of whether you have multi-core or not. So shouldn't make a difference I think, but you could write 0 to napi_threaded but I have no idea how useful that is in this case or if it does what the name suggests..

littoy commented 1 year ago

[ 21.727072] mt7915e 0000:04:00.0: HW/SW Version: 0x8a108a10, Build Time: 20230424212203a [ 21.727072] [ 21.788014] mt7915e 0000:04:00.0: WM Firmware Version: ____000000, Build Time: 20230424212218 [ 21.817953] mt7915e 0000:04:00.0: WA Firmware Version: DEV_000000, Build Time: 20230424212310

That firmware doesn't look right, it should be:

[   13.300612] mt7915e 0000:02:00.0: HW/SW Version: 0x8a108a10, Build Time: 20230330182132a
[   13.635186] mt7915e 0000:02:00.0: WM Firmware Version: ____000000, Build Time: 20230418151336
[   13.673128] mt7915e 0000:02:00.0: WA Firmware Version: DEV_000000, Build Time: 20230315163130

if you got it from my master

oh, my card is 7916 use mt7916_rom_patch.bin