Closed zhaojh329 closed 11 months ago
@nbd168 It's confirmed this issue occured with WED enabled.
OpenWrt version: https://github.com/openwrt/openwrt/commit/6577b550df89766d10375aacc87c56a4658c3776 mt76 version: https://github.com/openwrt/mt76/commit/b14c2351ddb8601c322576d84029e463d456caef
Here probably we have that same problem with router Asus TUF AX6000 https://github.com/openwrt/openwrt/issues/14019
For 2.4GHz I use WPA2 , for 5GHz I use WPA2/3 and 802.11r/v/k.
@nbd168 @LorenzoBianconi
The issue is bound to occur with wed
enabled follow these steps.
STA1 ---2g---
\
MT7986 -- wired LAN ----- PC
/
STA2 ---5g---
Assume the IP address of PC is 192.168.1.100
iperf -s -p 9000
iperf -s -p 9001
iperf -c 192.168.1.100 -p 9000 -P 10 -t 1000
iperf -c 192.168.1.100 -p 9001 -P 10 -t 1000
I may have run into this problem too. My device is Redmi AX6000, but my clients is connected wirelessly and the router crashes instantly every time then goes into tftp recovery mode (I really don't have a recovery image installed to get the pstore logs)
@nbd168 @LorenzoBianconi
This issue can be resolved with this patch.
--- a/dma.c
+++ b/dma.c
@@ -902,7 +902,14 @@ int mt76_dma_rx_poll(struct napi_struct
rcu_read_lock();
do {
- cur = mt76_dma_rx_process(dev, &dev->q_rx[qid], budget - done);
+ static spinlock_t wed_lock = __SPIN_LOCK_UNLOCKED(wed_lock);
+ if (mtk_wed_device_active(&dev->mmio.wed)) {
+ spin_lock_bh(&wed_lock);
+ cur = mt76_dma_rx_process(dev, &dev->q_rx[qid], budget - done);
+ spin_unlock_bh(&wed_lock);
+ } else {
+ cur = mt76_dma_rx_process(dev, &dev->q_rx[qid], budget - done);
+ }
mt76_rx_poll_complete(dev, qid, napi);
done += cur;
} while (cur && done < budget);
static spinlock_t wed_lock = __SPIN_LOCK_UNLOCKED(wed_lock);
can be putted in the if statement.
@zhaojh329, what part of the rx processing does the extra spinlock protect? Do you have any idea how exactly the issue triggers?
@zhaojh329, what part of the rx processing does the extra spinlock protect? Do you have any idea how exactly the issue triggers?
This issue is triggered when both bands operate the wed at the same time.
@zhaojh329, what part of the rx processing does the extra spinlock protect? Do you have any idea how exactly the issue triggers?
This issue is triggered when both bands operate the wed at the same time.
I tried to configure the CPU to be single-core and this issue did not occur.
@nbd168
CPU0--> mt76_dma_rx_process --> op on q --> op on wed regs
CPU1--> mt76_dma_rx_process --> op on q --> op on wed regs (same wed regs above)
Still a little confused. The issue only occurs with the condition of data flow.
sta0(2.4G) --> lan1
sta1(5G) --> lan1
@ptpt52 mt76_dma_rx_process should only run simultaneously for different queues (unless there is a different bug). For different queues, the WED regs point to different queues as well. I don't see why concurrent access needs to be prevented there. @zhaojh329 I agree with you that there is a concurrency problem here. The problem is that I don't see what part of the code in the mt76_dma_rx_process call triggers it, and how. I need to make sure that the change is actually fixing the bug properly instead of accidentally making it disappear, only for it to resurface elsewhere later.
@nbd168
For different queues, the WED regs point to different queues as well
this seems not true for MT76_WED_Q_TXFREE q?
q->wed_regs = q->wed->txfree_ring.reg_base;
= MTK_WED_RING_RX(1);
so maybe the issue is:
CPU0--> mt76_dma_rx_process --> op on q(WED_Q_TXFREE) --> op on wed regs
CPU1--> mt76_dma_rx_process --> op on q(WED_Q_TXFREE) --> op on wed regs
@ptpt52 there is only one queue assigned to MT_WED_Q_TXFREE. On MT7915 it is MT_RXQ_MCU_WA, on newer chips it's MT_RXQ_MAIN_WA.
@nbd168 so it is 2 cpu handle one queue concurrent?
I don't see how that's possible. rx processing is serialized through the NAPI poll function.
@zhaojh329 I have a different idea. Could you please test if this patch helps? https://nbd.name/p/79643093
@zhaojh329 I have a different idea. Could you please test if this patch helps? https://nbd.name/p/79643093
I tested your patch. It works fine.
should also check allow_direct
in mt76_add_fragment() call?
static void
mt76_add_fragment(struct mt76_dev *dev, struct mt76_queue *q, void *data,
int len, bool more, u32 info)
{
struct sk_buff *skb = q->rx_head;
struct skb_shared_info *shinfo = skb_shinfo(skb);
int nr_frags = shinfo->nr_frags;
if (nr_frags < ARRAY_SIZE(shinfo->frags)) {
struct page *page = virt_to_head_page(data);
int offset = data - page_address(page) + q->buf_offset;
skb_add_rx_frag(skb, nr_frags, page, offset, len, q->buf_size);
} else {
mt76_put_page_pool_buf(data, true);
}
if (more)
return;
q->rx_head = NULL;
if (nr_frags < ARRAY_SIZE(shinfo->frags))
dev->drv->rx_skb(dev, q - dev->q_rx, skb, &info);
else
dev_kfree_skb(skb);
}
@zhaojh329, thanks for testing, fix pushed. @ptpt52, you're right, thanks. The fix that I pushed includes your suggestion.