MT7986: crash with wed enabled

zhaojh329 commented 1 year ago

-----------------------------------------------------
 OpenWrt 23.05-SNAPSHOT, r23001+428-38c150612c
 -----------------------------------------------------
root@GL-MT6000:~# 
root@GL-MT6000:~# [  550.126969] Unable to handle kernel paging request at virtual address deacffc037317060
[  550.134896] Mem abort info:
[  550.137694]   ESR = 0x0000000096000004
[  550.141430]   EC = 0x25: DABT (current EL), IL = 32 bits
[  550.146723]   SET = 0, FnV = 0
[  550.149773]   EA = 0, S1PTW = 0
[  550.152901]   FSC = 0x04: level 0 translation fault
[  550.157771] Data abort info:
[  550.160638]   ISV = 0, ISS = 0x00000004
[  550.164457]   CM = 0, WnR = 0
[  550.167409] [deacffc037317060] address between user and kernel address ranges
[  550.174535] Internal error: Oops: 96000004 [#1] SMP
[  550.179401] Modules linked in: kmwan pppoe ppp_async option wireguard usb_wwan rndis_host qmi_wwan pppox ppp_generic nft_fib_inet nf_flow_table_ipv6 nf_flow_table_ipv4 nf_flow_table_inet mt7915e mt76_connac_lib mt76 mac80211 libchacha20poly1305 ipt_REJECT huawei_cdc_ncm chacha_neon cfg80211 cdc_ncm cdc_ether xt_time xt_tcpudp xt_state xt_quota xt_pkttype xt_owner xt_nat xt_multiport xt_mark xt_mac xt_limit xt_conntrack xt_comment xt_cgroup xt_addrtype xt_TCPMSS xt_REDIRECT xt_MASQUERADE xt_LOG usbserial usbnet slhc poly1305_neon nft_reject_ipv6 nft_reject_ipv4 nft_reject_inet nft_reject nft_redir nft_quota nft_objref nft_numgen nft_nat nft_masq nft_log nft_limit nft_hash nft_flow_offload nft_fib_ipv6 nft_fib_ipv4 nft_fib nft_ct nft_counter nft_compat nft_chain_nat nf_tables nf_reject_ipv4 nf_log_syslog nf_flow_table nf_conntrack_netlink libcurve25519_generic libcrc32c libchacha iptable_nat iptable_mangle iptable_filter ipheth ip_tables crc_ccitt compat cdc_wdm cdc_acm br_netfilter
[  550.179553]  arptable_filter arpt_mangle arp_tables crypto_safexcel fuse sch_tbf sch_ingress sch_htb sch_hfsc em_u32 cls_u32 cls_route cls_matchall cls_fw cls_flow cls_basic act_skbedit act_mirred act_gact pwm_fan xt_set ip_set_list_set ip_set_hash_netportnet ip_set_hash_netport ip_set_hash_netnet ip_set_hash_netiface ip_set_hash_net ip_set_hash_mac ip_set_hash_ipportnet ip_set_hash_ipportip ip_set_hash_ipport ip_set_hash_ipmark ip_set_hash_ipmac ip_set_hash_ip ip_set_bitmap_port ip_set_bitmap_ipmac ip_set_bitmap_ip ip_set nfnetlink ip6table_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip6t_NPT ip6table_mangle ip6table_filter ip6_tables ip6t_REJECT x_tables nf_reject_ipv6 nfsv4 nfsd nfs ifb ip6_udp_tunnel udp_tunnel rpcsec_gss_krb5 auth_rpcgss tun ntfs lockd sunrpc grace dns_resolver nls_utf8 nls_iso8859_1 nls_cp437 crypto_user algif_skcipher algif_rng algif_hash algif_aead af_alg sha1_generic seqiv md5 des_generic libdes cts authencesn authenc arc4 mtdoops uas usb_storage
[  550.266361]  gl_fan_driver leds_gpio xhci_plat_hcd xhci_pci xhci_mtk_hcd xhci_hcd uhci_hcd ohci_platform ohci_hcd fsl_mph_dr_of ehci_platform ehci_fsl ehci_hcd gpio_button_hotplug gl_sdk4_tertf gl_sdk4_black_white_list vfat fat exfat dm_mirror dm_region_hash dm_log dm_crypt dm_mod dax usbcore usb_common mii cbc encrypted_keys trusted tpm oid_registry asn1_encoder asn1_decoder gl_sdk4_hw_info
[  550.387754] CPU: 3 PID: 1504 Comm: napi/phy0-10 Not tainted 5.15.130 #0
[  550.394351] Hardware name: GL.iNet GL-MT6000 (DT)
[  550.399037] pstate: a0400005 (NzCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[  550.405978] pc : page_pool_put_page+0x26c/0x34c
[  550.410503] lr : mt76_dma_rx_poll+0x270/0x4f0 [mt76]
[  550.415462] sp : ffffffc00916bcf0
[  550.418760] x29: ffffffc00916bcf0 x28: 0000000000000040 x27: 0000000000000000
[  550.425878] x26: 0000000000000000 x25: ffffff8006259dc0 x24: ffffff8005143cc0
[  550.432994] x23: ffffff8005142020 x22: 0000000000000001 x21: 0000000000000001
[  550.440109] x20: fffffffe001b4408 x19: fffffffe00348980 x18: 0000000000000014
[  550.447225] x17: 000000006421e6a5 x16: 00000000ab3797f2 x15: 000000005db49658
[  550.454340] x14: 0000000000000009 x13: 0000000000000000 x12: ffffff80045e5740
[  550.461456] x11: 0000000000000040 x10: ffffff80045e56d0 x9 : ffffff8005144a30
[  550.468572] x8 : ffffff80045e56f8 x7 : 0000000000000000 x6 : ffffff8004608900
[  550.475687] x5 : 0000000000000228 x4 : fffffffe003b1c87 x3 : 0000000000000001
[  550.482804] x2 : 00000000ffffffff x1 : ffffffc037317000 x0 : deacffc037317060
[  550.489919] Call trace:
[  550.492353]  page_pool_put_page+0x26c/0x34c
[  550.496524]  mt76_dma_rx_poll+0x270/0x4f0 [mt76]
[  550.501132]  __napi_poll+0x54/0x1b0
[  550.504606]  napi_threaded_poll+0x84/0xe4
[  550.508599]  kthread+0x11c/0x130
[  550.511819]  ret_from_fork+0x10/0x20
[  550.515385] Code: d2800035 91006021 d538d080 8b000020 (c85f7c03) 
[  550.521458] ---[ end trace e941b53b4f433d99 ]---
[  550.529925] Kernel panic - not syncing: Oops: Fatal exception in interrupt
[  550.536784] SMP: stopping secondary CPUs
[  550.540693] Kernel Offset: disabled
[  550.544165] CPU features: 0x0,00000000,20000802
[  550.548681] Memory Limit: none
[  550.555622] Rebooting in 3 seconds..

F0: 102B 0000
FA: 1040 0000
FA: 1040 0000 [0200]
F9: 103F 0000
F3: 1006 0033 [0200]
F3: 4001 00E0 [0200]
F3: 0000 0000
V0: 0000 0000 [0001]
00: 0000 0000
BP: 2400 0041 [0000]
G0: 1190 0000
EC: 0000 0000 [2000]
T0: 0000 0258 [010F]
Jump to BL

zhaojh329 commented 1 year ago

@nbd168 It's confirmed this issue occured with WED enabled.

OpenWrt version: https://github.com/openwrt/openwrt/commit/6577b550df89766d10375aacc87c56a4658c3776 mt76 version: https://github.com/openwrt/mt76/commit/b14c2351ddb8601c322576d84029e463d456caef

patrykk commented 12 months ago

Here probably we have that same problem with router Asus TUF AX6000 https://github.com/openwrt/openwrt/issues/14019

For 2.4GHz I use WPA2 , for 5GHz I use WPA2/3 and 802.11r/v/k.

zhaojh329 commented 11 months ago

@nbd168 @LorenzoBianconi The issue is bound to occur with wed enabled follow these steps.

STA1 ---2g---
             \
                MT7986 -- wired LAN ----- PC
             /
STA2 ---5g---

Assume the IP address of PC is 192.168.1.100

Run iperf3 in PC
```
iperf -s -p 9000
```
Run another iperf3 in PC
```
iperf -s -p 9001
```

Run iperf3 in STA1

iperf -c 192.168.1.100 -p 9000 -P 10 -t 1000

Run iperf3 in STA2

iperf -c 192.168.1.100 -p 9001 -P 10 -t 1000

PussAzuki commented 11 months ago

I may have run into this problem too. My device is Redmi AX6000, but my clients is connected wirelessly and the router crashes instantly every time then goes into tftp recovery mode (I really don't have a recovery image installed to get the pstore logs)

zhaojh329 commented 11 months ago

@nbd168 @LorenzoBianconi

This issue can be resolved with this patch.

--- a/dma.c
+++ b/dma.c
@@ -902,7 +902,14 @@ int mt76_dma_rx_poll(struct napi_struct
        rcu_read_lock();

        do {
-               cur = mt76_dma_rx_process(dev, &dev->q_rx[qid], budget - done);
+               static spinlock_t wed_lock = __SPIN_LOCK_UNLOCKED(wed_lock);
+               if (mtk_wed_device_active(&dev->mmio.wed)) {
+                       spin_lock_bh(&wed_lock);
+                       cur = mt76_dma_rx_process(dev, &dev->q_rx[qid], budget - done);
+                       spin_unlock_bh(&wed_lock);
+               } else {
+                       cur = mt76_dma_rx_process(dev, &dev->q_rx[qid], budget - done);
+               }
                mt76_rx_poll_complete(dev, qid, napi);
                done += cur;
        } while (cur && done < budget);

patrykk commented 11 months ago

static spinlock_t wed_lock = __SPIN_LOCK_UNLOCKED(wed_lock);can be putted in the if statement.

nbd168 commented 11 months ago

@zhaojh329, what part of the rx processing does the extra spinlock protect? Do you have any idea how exactly the issue triggers?

zhaojh329 commented 11 months ago

@zhaojh329, what part of the rx processing does the extra spinlock protect? Do you have any idea how exactly the issue triggers?

This issue is triggered when both bands operate the wed at the same time.

zhaojh329 commented 11 months ago

@zhaojh329, what part of the rx processing does the extra spinlock protect? Do you have any idea how exactly the issue triggers?

This issue is triggered when both bands operate the wed at the same time.

I tried to configure the CPU to be single-core and this issue did not occur.

ptpt52 commented 11 months ago

@nbd168

CPU0--> mt76_dma_rx_process --> op on q --> op on wed regs
CPU1--> mt76_dma_rx_process --> op on q --> op on wed regs (same wed regs above)

zhaojh329 commented 11 months ago

Still a little confused. The issue only occurs with the condition of data flow.

sta0(2.4G) --> lan1
sta1(5G)   --> lan1

nbd168 commented 11 months ago

@ptpt52 mt76_dma_rx_process should only run simultaneously for different queues (unless there is a different bug). For different queues, the WED regs point to different queues as well. I don't see why concurrent access needs to be prevented there. @zhaojh329 I agree with you that there is a concurrency problem here. The problem is that I don't see what part of the code in the mt76_dma_rx_process call triggers it, and how. I need to make sure that the change is actually fixing the bug properly instead of accidentally making it disappear, only for it to resurface elsewhere later.

ptpt52 commented 11 months ago

@nbd168

For different queues, the WED regs point to different queues as well

this seems not true for MT76_WED_Q_TXFREE q?

q->wed_regs = q->wed->txfree_ring.reg_base;

= MTK_WED_RING_RX(1);

so maybe the issue is:

CPU0--> mt76_dma_rx_process --> op on q(WED_Q_TXFREE) --> op on wed regs
CPU1--> mt76_dma_rx_process --> op on q(WED_Q_TXFREE) --> op on wed regs

nbd168 commented 11 months ago

@ptpt52 there is only one queue assigned to MT_WED_Q_TXFREE. On MT7915 it is MT_RXQ_MCU_WA, on newer chips it's MT_RXQ_MAIN_WA.

ptpt52 commented 11 months ago

@nbd168 so it is 2 cpu handle one queue concurrent?

nbd168 commented 11 months ago

I don't see how that's possible. rx processing is serialized through the NAPI poll function.

nbd168 commented 11 months ago

@zhaojh329 I have a different idea. Could you please test if this patch helps? https://nbd.name/p/79643093

zhaojh329 commented 11 months ago

@zhaojh329 I have a different idea. Could you please test if this patch helps? https://nbd.name/p/79643093

I tested your patch. It works fine.

ptpt52 commented 11 months ago

should also check allow_direct in mt76_add_fragment() call?

static void 
mt76_add_fragment(struct mt76_dev *dev, struct mt76_queue *q, void *data,
                  int len, bool more, u32 info)
{
        struct sk_buff *skb = q->rx_head;
        struct skb_shared_info *shinfo = skb_shinfo(skb);
        int nr_frags = shinfo->nr_frags;

        if (nr_frags < ARRAY_SIZE(shinfo->frags)) {
                struct page *page = virt_to_head_page(data);
                int offset = data - page_address(page) + q->buf_offset;

                skb_add_rx_frag(skb, nr_frags, page, offset, len, q->buf_size);
        } else {
                mt76_put_page_pool_buf(data, true);
        }    

        if (more)
                return;

        q->rx_head = NULL;
        if (nr_frags < ARRAY_SIZE(shinfo->frags))
                dev->drv->rx_skb(dev, q - dev->q_rx, skb, &info);
        else 
                dev_kfree_skb(skb);
}

nbd168 commented 11 months ago

@zhaojh329, thanks for testing, fix pushed. @ptpt52, you're right, thanks. The fix that I pushed includes your suggestion.

openwrt / mt76

MT7986: crash with wed enabled #830