openwrt / mt76

mac80211 driver for MediaTek MT76x0e, MT76x2e, MT7603, MT7615, MT7628 and MT7688
736 stars 338 forks source link

MT7612E still firmware crashes on x86_64 laptop #362

Open dangowrt opened 4 years ago

dangowrt commented 4 years ago

Previously wifi was usable in station mode with disable_ht=1 and disable_vht=1 set. Now, after recent commits (i'm on fd892bc033fbfad7b1a5bcba1432709a74327a1e), it crashes immediately after connecting once any traffic goes over the link:

[   97.705962] mt76x2e 0000:02:00.0: MCU message 31 (seq 6) timed out
[   97.756817] mt76x2e 0000:02:00.0: Firmware Version: 0.0.00
[   97.756823] mt76x2e 0000:02:00.0: Build: 1
[   97.756827] mt76x2e 0000:02:00.0: Build Time: 201507311614____
[   97.779403] mt76x2e 0000:02:00.0: Firmware running!
[   97.780210] ieee80211 phy0: Hardware restart was requested
[   98.799811] mt76x2e 0000:02:00.0: MCU message 2 (seq 8) timed out
[   99.733597] mt76x2e 0000:02:00.0: MCU message 31 (seq 9) timed out
[  100.101205] mt76x2e 0000:02:00.0: Firmware Version: 0.0.00
[  100.101210] mt76x2e 0000:02:00.0: Build: 1
[  100.101213] mt76x2e 0000:02:00.0: Build Time: 201507311614____
[  100.123795] mt76x2e 0000:02:00.0: Firmware running!
[  100.124641] ieee80211 phy0: Hardware restart was requested
[  101.147616] mt76x2e 0000:02:00.0: MCU message 2 (seq 11) timed out
[  101.515282] mt76x2e 0000:02:00.0: Firmware Version: 0.0.00
[  101.515288] mt76x2e 0000:02:00.0: Build: 1
[  101.515291] mt76x2e 0000:02:00.0: Build Time: 201507311614____
[  101.540124] mt76x2e 0000:02:00.0: Firmware running!
[  101.540776] ieee80211 phy0: Hardware restart was requested
[  102.588311] mt76x2e 0000:02:00.0: MCU message 2 (seq 13) timed out
[  103.065953] mt76x2e 0000:02:00.0: Firmware Version: 0.0.00
[  103.065957] mt76x2e 0000:02:00.0: Build: 1
[  103.065958] mt76x2e 0000:02:00.0: Build Time: 201507311614____
[  103.091848] mt76x2e 0000:02:00.0: Firmware running!
[  103.092525] ieee80211 phy0: Hardware restart was requested
[  104.109009] mt76x2e 0000:02:00.0: MCU message 2 (seq 15) timed out
[  104.373208] mt76x2e 0000:02:00.0: Firmware Version: 0.0.00
[  104.373212] mt76x2e 0000:02:00.0: Build: 1
[  104.373214] mt76x2e 0000:02:00.0: Build Time: 201507311614____
[  104.395841] mt76x2e 0000:02:00.0: Firmware running!
[  104.396516] ieee80211 phy0: Hardware restart was requested
[  104.396568] NOHZ: local_softirq_pending 08
[  105.416328] mt76x2e 0000:02:00.0: MCU message 31 (seq 1) timed out
[  105.783989] mt76x2e 0000:02:00.0: Firmware Version: 0.0.00
[  105.783993] mt76x2e 0000:02:00.0: Build: 1
[  105.783995] mt76x2e 0000:02:00.0: Build Time: 201507311614____
[  105.809537] mt76x2e 0000:02:00.0: Firmware running!
[  105.810229] ieee80211 phy0: Hardware restart was requested
[  106.830350] mt76x2e 0000:02:00.0: MCU message 2 (seq 3) timed out
[  107.097891] mt76x2e 0000:02:00.0: Firmware Version: 0.0.00
[  107.097895] mt76x2e 0000:02:00.0: Build: 1
[  107.097897] mt76x2e 0000:02:00.0: Build Time: 201507311614____
[  107.123330] mt76x2e 0000:02:00.0: Firmware running!
[  107.124044] ieee80211 phy0: Hardware restart was requested
[  107.124197] NOHZ: local_softirq_pending 08
[  108.137617] mt76x2e 0000:02:00.0: MCU message 2 (seq 5) timed out
[  108.401902] mt76x2e 0000:02:00.0: Firmware Version: 0.0.00
[  108.401907] mt76x2e 0000:02:00.0: Build: 1
[  108.401910] mt76x2e 0000:02:00.0: Build Time: 201507311614____
[  108.421163] kauditd_printk_skb: 8 callbacks suppressed
[  108.424477] mt76x2e 0000:02:00.0: Firmware running!
[  108.425472] ieee80211 phy0: Hardware restart was requested
[  108.425655] NOHZ: local_softirq_pending 08
[  109.444921] mt76x2e 0000:02:00.0: MCU message 2 (seq 7) timed out
[  109.805846] mt76x2e 0000:02:00.0: Firmware Version: 0.0.00
[  109.805850] mt76x2e 0000:02:00.0: Build: 1
[  109.805853] mt76x2e 0000:02:00.0: Build Time: 201507311614____
[  109.831296] mt76x2e 0000:02:00.0: Firmware running!
[  109.832028] ieee80211 phy0: Hardware restart was requested
[  110.858982] mt76x2e 0000:02:00.0: MCU message 2 (seq 9) timed out
[  111.330043] mt76x2e 0000:02:00.0: Firmware Version: 0.0.00
[  111.330046] mt76x2e 0000:02:00.0: Build: 1
[  111.330049] mt76x2e 0000:02:00.0: Build Time: 201507311614____
[  111.352522] mt76x2e 0000:02:00.0: Firmware running!
[  111.353199] ieee80211 phy0: Hardware restart was requested
[  112.379652] mt76x2e 0000:02:00.0: MCU message 30 (seq 10) timed out
[  112.647357] mt76x2e 0000:02:00.0: Firmware Version: 0.0.00
[  112.647361] mt76x2e 0000:02:00.0: Build: 1
[  112.647363] mt76x2e 0000:02:00.0: Build Time: 201507311614____
[  112.666496] mt76x2e 0000:02:00.0: Firmware running!
[  112.667164] ieee80211 phy0: Hardware restart was requested
[  112.667295] NOHZ: local_softirq_pending 08
[  113.686942] mt76x2e 0000:02:00.0: MCU message 2 (seq 12) timed out
[  114.054811] mt76x2e 0000:02:00.0: Firmware Version: 0.0.00
[  114.054815] mt76x2e 0000:02:00.0: Build: 1
[  114.054817] mt76x2e 0000:02:00.0: Build Time: 201507311614____
[  114.073825] mt76x2e 0000:02:00.0: Firmware running!
[  114.074497] ieee80211 phy0: Hardware restart was requested
[  115.100932] mt76x2e 0000:02:00.0: MCU message 2 (seq 14) timed out
[  115.361839] mt76x2e 0000:02:00.0: Firmware Version: 0.0.00
[  115.361842] mt76x2e 0000:02:00.0: Build: 1
[  115.361844] mt76x2e 0000:02:00.0: Build Time: 201507311614____
[  115.381139] mt76x2e 0000:02:00.0: Firmware running!
[  115.381811] ieee80211 phy0: Hardware restart was requested
[  115.381940] NOHZ: local_softirq_pending 08
[  116.408384] mt76x2e 0000:02:00.0: MCU message 2 (seq 1) timed out
[  116.776075] mt76x2e 0000:02:00.0: Firmware Version: 0.0.00
[  116.776079] mt76x2e 0000:02:00.0: Build: 1
[  116.776082] mt76x2e 0000:02:00.0: Build Time: 201507311614____
[  116.795146] mt76x2e 0000:02:00.0: Firmware running!
[  116.795834] ieee80211 phy0: Hardware restart was requested
[  117.822220] mt76x2e 0000:02:00.0: MCU message 2 (seq 3) timed out
[  118.293347] mt76x2e 0000:02:00.0: Firmware Version: 0.0.00
[  118.293351] mt76x2e 0000:02:00.0: Build: 1
[  118.293353] mt76x2e 0000:02:00.0: Build Time: 201507311614____
[  118.309199] mt76x2e 0000:02:00.0: Firmware running!
[  118.309873] ieee80211 phy0: Hardware restart was requested
[  119.316248] mt76x2e 0000:02:00.0: MCU message 30 (seq 4) timed out
[  119.694179] mt76x2e 0000:02:00.0: Firmware Version: 0.0.00
[  119.694182] mt76x2e 0000:02:00.0: Build: 1
[  119.694184] mt76x2e 0000:02:00.0: Build Time: 201507311614____
[  119.716536] mt76x2e 0000:02:00.0: Firmware running!
[  119.717225] ieee80211 phy0: Hardware restart was requested
[  120.730313] mt76x2e 0000:02:00.0: MCU message 2 (seq 6) timed out
[  121.098394] mt76x2e 0000:02:00.0: Firmware Version: 0.0.00
[  121.098398] mt76x2e 0000:02:00.0: Build: 1
[  121.098400] mt76x2e 0000:02:00.0: Build Time: 201507311614____
[  121.117183] mt76x2e 0000:02:00.0: Firmware running!
[  121.117850] ieee80211 phy0: Hardware restart was requested
[  122.144355] mt76x2e 0000:02:00.0: MCU message 2 (seq 8) timed out
[  122.512034] mt76x2e 0000:02:00.0: Firmware Version: 0.0.00
[  122.512036] mt76x2e 0000:02:00.0: Build: 1
[  122.512038] mt76x2e 0000:02:00.0: Build Time: 201507311614____
[  122.527800] mt76x2e 0000:02:00.0: Firmware running!
[  122.528358] ieee80211 phy0: Hardware restart was requested
[  123.531649] mt76x2e 0000:02:00.0: MCU message 2 (seq 10) timed out
[  123.796029] mt76x2e 0000:02:00.0: Firmware Version: 0.0.00
[  123.796033] mt76x2e 0000:02:00.0: Build: 1
[  123.796035] mt76x2e 0000:02:00.0: Build Time: 201507311614____
[  123.821484] mt76x2e 0000:02:00.0: Firmware running!
[  123.822178] ieee80211 phy0: Hardware restart was requested
[  123.822315] NOHZ: local_softirq_pending 08
[  124.838907] mt76x2e 0000:02:00.0: MCU message 2 (seq 12) timed out
[  125.103620] mt76x2e 0000:02:00.0: Firmware Version: 0.0.00
[  125.103623] mt76x2e 0000:02:00.0: Build: 1
[  125.103626] mt76x2e 0000:02:00.0: Build Time: 201507311614____
[  125.129071] mt76x2e 0000:02:00.0: Firmware running!
[  125.129755] ieee80211 phy0: Hardware restart was requested
[  125.129782] NOHZ: local_softirq_pending 08
[  126.146210] mt76x2e 0000:02:00.0: MCU message 31 (seq 13) timed out
[  126.410444] mt76x2e 0000:02:00.0: Firmware Version: 0.0.00
[  126.410447] mt76x2e 0000:02:00.0: Build: 1
[  126.410449] mt76x2e 0000:02:00.0: Build Time: 201507311614____
[  126.435874] mt76x2e 0000:02:00.0: Firmware running!
[  126.436572] ieee80211 phy0: Hardware restart was requested
[  126.436763] NOHZ: local_softirq_pending 08
[  127.453455] mt76x2e 0000:02:00.0: MCU message 2 (seq 1) timed out
[  127.821415] mt76x2e 0000:02:00.0: Firmware Version: 0.0.00
[  127.821419] mt76x2e 0000:02:00.0: Build: 1
[  127.821421] mt76x2e 0000:02:00.0: Build Time: 201507311614____
[  127.840389] mt76x2e 0000:02:00.0: Firmware running!
[  127.841058] ieee80211 phy0: Hardware restart was requested
[  128.867516] mt76x2e 0000:02:00.0: MCU message 2 (seq 3) timed out
[  129.132228] mt76x2e 0000:02:00.0: Firmware Version: 0.0.00
[  129.132231] mt76x2e 0000:02:00.0: Build: 1
[  129.132234] mt76x2e 0000:02:00.0: Build Time: 201507311614____
[  129.157755] mt76x2e 0000:02:00.0: Firmware running!
[  129.158427] ieee80211 phy0: Hardware restart was requested
[  129.158562] NOHZ: local_softirq_pending 08
[  130.174806] mt76x2e 0000:02:00.0: MCU message 2 (seq 5) timed out
[  130.439227] mt76x2e 0000:02:00.0: Firmware Version: 0.0.00
[  130.439231] mt76x2e 0000:02:00.0: Build: 1
[  130.439233] mt76x2e 0000:02:00.0: Build Time: 201507311614____
[  130.464669] mt76x2e 0000:02:00.0: Firmware running!
[  130.465375] ieee80211 phy0: Hardware restart was requested
[  130.465515] NOHZ: local_softirq_pending 08

Running:

Linux 5.5.4-arch1-1 #1 SMP PREEMPT Sat, 15 Feb 2020 00:36:29 +0000 x86_64 GNU/Linux

The same hardware works fine with the iwlwifi card which originally shipped with it and also ath9k-based mPCIe modules work flawlessly.

nbd168 commented 4 years ago

Could you please bisect this? Thanks

dangowrt commented 4 years ago

Turns out to be unrelated to recent commits but was just due to the fast connectivity I enjoyed in a specific environment. It reliably occurs when RX goes beyond ~35MBit/s, sitting right next to the AP, then even with disable_ht=1 and disable_vht=1 802.11a 54MBit/s data rates would kill it. It then ends up in a reset-loop (see logs above), even unloading mt76x2e kernel module doesn't help at this point, only reboot does. Probably Linux doesn't control PCI hardware reset but it's done by firmware.

nbd168 commented 4 years ago

The reset loop should be fixed now, please test

pfjgeraedts commented 4 years ago

I had similar experience as dangowrt. Now from OpenWrt master 0f8c806eb8 (that includes your fix) the reset loops seem to have turned into series of resets (that stop when data transfer is interrupted); I may have encountered a single reset loop, but I cannot confirm this. Some details: I have a remote setup with two MT7612E (one in an x86_32 and one in an x86_64), both running the above OpenWrt state (same 32bit image for both). The x86_64 is setup as AP and the x86_32 as STA. Only when disabling (V)HT modes data transfer is steady (at ~20 Mbit/s as reported by iperf). Higher modes result in a series of resets and (<=) Kbit/s rates. After a series of resets, disabling (V)HT also results in Kbit/s rates; rebooting does help. This behavior occurs in both directions. As a cross check I connected the STA with another AP (ath9k) at HT40 (Netgear WNDR3800 running OpenWrt 19.07.1; ath79). When STA is running iperf as server, reception rates is again at Kbit/s, but when running as client transmission rates is at ~190 Mbit/s. Promising! Please let me know how I can be of assistance to further debug this. Thanks.

pfjgeraedts commented 4 years ago

After updating to OpenWrt master fd0cc72d9c (that includes upto and including mt76 master 85c516081338) the series of resets seem to have turned into a single reset followed by a kernel panic.

Again I connected the STA with another AP (ath9k) at HT40 (Netgear WNDR3800 now running at OpenWrt 19.07.2; ath79). When STA is running iperf as client, transmission is still fine. When STA is running iperf as server, the following kernel messages result.

[ 75.652197] mt76x2e 0000:02:00.0: MCU message 31 (seq 11) timed out [ 75.653745] BUG: kernel NULL pointer dereference, address: 0000000c [ 75.655316] #PF: supervisor read access in kernel mode [ 75.656124] #PF: error_code(0x0000) - not-present page [ 75.656124] pdpt = 0000000000000000 pde = f000eef3f000eef3 [ 75.656124] Oops: 0000 [#1] SMP NOPTI [ 75.656124] CPU: 0 PID: 962 Comm: kworker/u4:6 Not tainted 5.4.45 #0 [ 75.656124] Hardware name: MICRO-STAR INTERNATIONAL CO., LTD U90/U100/U90/U100, BIOS 4.6.3 12/01/2009 [ 75.656124] Workqueue: phy0 mt76x02_wdt_work [mt76x02_lib] [ 75.656124] EIP: hrtimer_active+0xb/0x50 [ 75.656124] Code: 6f 00 29 c6 8b 43 20 19 d7 8b 14 24 8b 00 e8 1c 4c 6f 00 89 f0 89 fa 83 c4 04 5b 5e 5f 5d c3 55 89 e5 8d 74 26 00 90 8b 50 20 <8b> 4a 0c f6 c1 01 75 1d 80 78 24 00 75 27 39 42 10 74 22 39 4a 0c [ 75.656124] EAX: f50f2860 EBX: f50f2860 ECX: f6204038 EDX: 00000000 [ 75.656124] ESI: 00000000 EDI: f5f8c9c0 EBP: f5da9ee0 ESP: f5da9ee0 [ 75.656124] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00010286 [ 75.656124] CR0: 80050033 CR2: 0000000c CR3: 01bf2000 CR4: 000006f0 [ 75.656124] Call Trace: [ 75.656124] hrtimer_try_to_cancel+0x11/0x80 [ 75.656124] hrtimer_cancel+0xd/0x20 [ 75.656124] napi_disable+0x5c/0x70 [ 75.656124] mt76x02_wdt_work+0x136/0x520 [mt76x02_lib] [ 75.656124] ? mt76x2_register_device+0x2a0/0x2f0 [mt76x2e] [ 75.656124] process_one_work+0x1b2/0x2c0 [ 75.656124] worker_thread+0x37/0x400 [ 75.656124] kthread+0xe6/0x100 [ 75.656124] ? process_one_work+0x2c0/0x2c0 [ 75.656124] ? kthread_park+0x90/0x90 [ 75.656124] ret_from_fork+0x19/0x38 [ 75.656124] Modules linked in: pppoe ppp_async mt76x0u mt76x0e mt76x0_common iptable_nat xt_state xt_nat xt_conntrack xt_REDIRECT xt_MASQUERADE xt_FLOWOFFLOAD pppox ppp_generic nf_nat nf_flow_table_hw nf_flow_table nf_conntrack_rtcache nf_conntrack mt7915e mt76x2u mt76x2e mt76x2_common mt76x02_usb mt76x02_lib mt7663u mt7615e mt7615_common mt7603e mt7601u mt76_usb mt76 mac80211 ipt_REJECT cfg80211 xt_time xt_tcpudp xt_multiport xt_mark xt_mac xt_limit xt_comment xt_TCPMSS xt_LOG via_velocity via_rhine slhc sis900 r8169 pcnet32 nf_reject_ipv4 nf_log_ipv4 nf_defrag_ipv6 nf_defrag_ipv4 ne2k_pci iptable_mangle iptable_filter ip_tables forcedeth e100 crc_ccitt compat 8390 8139too nf_log_ipv6 nf_log_common ip6table_mangle ip6table_filter ip6_tables ip6t_REJECT x_tables nf_reject_ipv6 3c59x e1000 vfat fat nls_utf8 nls_iso8859_1 nls_cp437 natsemi button_hotplug tg3 ptp realtek pps_core mii libphy [ 75.656124] CR2: 000000000000000c [ 75.656124] ---[ end trace bc2b9ea75c64c356 ]--- [ 75.656124] EIP: hrtimer_active+0xb/0x50 [ 75.656124] Code: 6f 00 29 c6 8b 43 20 19 d7 8b 14 24 8b 00 e8 1c 4c 6f 00 89 f0 89 fa 83 c4 04 5b 5e 5f 5d c3 55 89 e5 8d 74 26 00 90 8b 50 20 <8b> 4a 0c f6 c1 01 75 1d 80 78 24 00 75 27 39 42 10 74 22 39 4a 0c [ 75.656124] EAX: f50f2860 EBX: f50f2860 ECX: f6204038 EDX: 00000000 [ 75.656124] ESI: 00000000 EDI: f5f8c9c0 EBP: f5da9ee0 ESP: f5da9ee0 [ 75.656124] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00010286 [ 75.656124] CR0: 80050033 CR2: 0000000c CR3: 01bf2000 CR4: 000006f0 [ 75.656124] Kernel panic - not syncing: Fatal exception [ 75.656124] Kernel Offset: disabled [ 75.656124] Rebooting in 3 seconds.. [ 75.656124] ACPI MEMORY or I/O RESET_REG.

pfjgeraedts commented 3 years ago

After updating to OpenWrt master 4682d4d770 (that includes upto and including mt76 master b22977c2727d) the series of resets seem to have returned; I can even trigger it with (V)HT modes disabled now.

It is as if the series of resets get triggered when the incoming data rate exceeds a certain threshold; The outgoing data rate seems to be unaffected; i.e. I now see the same as what dangowrt already described on Feb, 24.

Also, the MT7612U seems unaffected, while both chips seem to run the same firmware; e.g. when I let the (x86_64) AP switch from the MT7612E to the MT7612U and let the (x86_32) STA run iperf as client transmission rate is up to ~310 Mbit/s!*

It could well be that all this is strictly PCI interface related, like dangowrt already hinted at.

Please let me know when there is anything I can do to help solve this.

*Although the earlier reported ~190 Mbit/s is a steady flow, this ~310 Mbit/s is not; my working assumption is that the accompanying incoming data rate exceeds the described threshold in this latter case.

dangowrt commented 3 years ago

I'm also still seeing this with recent mt76 and developed another theory which I will try to pursue once I have measurement tools for that around: mPCIe of laptops build for other WiFi modules might not provide enough power. according to stackexchange quoting the mPCIe spec: 1100 mA (2750 mA) normal (peak) current on 3.3V. That's quite a lot and certainly more than those centrino modules typically found in laptops would draw... Intuitively I'd also rather expect that to affect TX first of all, but I'm not into the electronic details here and just wanted to share that thought... It'd be interesting to try mt76 on x86 systems with supposedly more spec conforming mPCIe slots (APU board or such) and see if we can reproduce it there as well.

dangowrt commented 3 years ago

In recent versions, resets still happen but the card then in most cases subsequently recovers (at least a bit) and functions again after the reset, just with much lower transfer rates and more frequent repetitions of that reset...

pfjgeraedts commented 3 years ago

Indeed good to keep options open and yes, I would also not expect RX to draw most power. Three laptops and one mPCIe card would also lead me to suspect the card and not the laptops.

If I recall correctly ath10k chips/drivers early on had PCI related quirks. Maybe this case is something similar? I am not aware of what detail can be observed from the kernel side when it comes to the state of the PCI bus. Maybe enough to figure out what is going on?*

I am also curious if the only difference between the MT7612E and MT7612U is the host interface.

*My background in IC design unfortunately limits me here, as I have not yet been able to find any serious time to spend on low-level Linux stuff and gain sufficient experience.