openwrt / mt76

mac80211 driver for MediaTek MT76x0e, MT76x2e, MT7603, MT7615, MT7628 and MT7688
742 stars 342 forks source link

mt7615e crash with 5Ghz #508

Closed ivanich closed 3 years ago

ivanich commented 3 years ago

Device GL.Inet mt-1300 openwrt master with commit https://git.openwrt.org/?p=openwrt/openwrt.git;a=commit;h=12b5f898f9222d35b80058fee5bd2ae935226070 Except not resolved issue with radio ordering as stated in the https://github.com/openwrt/mt76/issues/474 now it crashes when connected via 5Ghz

<1>[  268.046338] CPU 0 Unable to handle kernel paging request at virtual address 00000000, epc == 8e40289c, ra == 8e44a958
<4>[  268.057063] Oops[#1]:
<4>[  268.059341] CPU: 0 PID: 195 Comm: kworker/u9:0 Not tainted 5.4.105 #0
<4>[  268.070138] $ 0   : 00000000 00000001 8e685ec0 00000000
<4>[  268.080544] $ 8   : 00000000 00000e3e 8e718040 00000033
<4>[  268.090949] $16   : 8a190000 8e685ec0 00800000 00000002
<4>[  268.101355] $24   : 00000010 00000001                  
<4>[  268.111761] Hi    : 00000181
<4>[  268.117522] epc   : 8e40289c mt76_rx+0x124/0x320 [mt76]
<4>[  268.129753] Status: 11008403      KERNEL EXL IE 
<4>[  268.137904] BadVA : 00000000
<4>[  268.144838] Modules linked in: pppoe ppp_async iptable_nat xt_state xt_nat xt_conntrack xt_REDIRECT xt_MASQUERADE xt_FLOWOFFLOAD wireguard rndis_host pppox ppp_generic nf_nat nf_flow_table_hw nf_flow_table nf_conntrack_rtcache nf_conntrack mt7615e mt7615_common mt76_connac_lib mt76 mac80211 libchacha20poly1305 libblake2s ipt_REJECT cfg80211 cdc_ether xt_time xt_tcpudp xt_multiport xt_mark xt_mac xt_limit xt_comment xt_TCPMSS xt_LOG usbnet slhc poly1305_mips nf_reject_ipv4 nf_log_ipv4 nf_defrag_ipv6 nf_defrag_ipv4 libcurve25519_generic libblake2s_generic iptable_mangle iptable_filter ip_tables crc_ccitt compat chacha_mips nf_log_ipv6 nf_log_common ip6table_mangle ip6table_filter ip6_tables ip6t_REJECT x_tables nf_reject_ipv6 ip6_udp_tunnel udp_tunnel vfat fat ntfs nls_utf8 nls_iso8859_1 nls_cp437 usb_storage leds_gpio xhci_plat_hcd xhci_pci xhci_mtk xhci_hcd sd_mod scsi_mod gpio_button_hotplug ext4 mbcache jbd2 usbcore nls_base usb_common mii crc32c_generic
<4>[  268.246088]         0045ce95 00004188 00000000 00000080 0000e3e0 00000002 00000000 8e686e50
<4>[  268.262726]         8a1c1000 00000000 00000000 8e686034 8e74f000 00001000 00000000 00000050
<4>[  268.279364]         ...
<4>[  268.284239] [<8e40289c>] mt76_rx+0x124/0x320 [mt76]
<4>[  268.298844] 
<4>[  268.300969] ---[ end trace a30a1c7ed9b93111 ]---

Can be reproduced when running speedtest android app on phone connected via 5Ghz network to mt-1300 router

ryderlee1110 commented 3 years ago

Does this only happen on 5G phy?

ivanich commented 3 years ago

Does this only happen on 5G phy?

Yes, I tried to do the same(speedtest from phone) with 2.4 and it didn't crash Just for reference my config, had to run wifi command in the rc.local in order to have working both radios after reboot

config wifi-device 'radio0'
        option type 'mac80211'
        option hwmode '11g'
        option htmode 'HT40'
        option noscan '1'
        option path '1e140000.pcie/pci0000:00/0000:00:00.0/0000:01:00.0'
        option legacy_rates '0'
        option channel '6'
        option country 'US'

config wifi-iface 'default_radio0'
        option device 'radio0'
        option network 'lan'
        option mode 'ap'
        option ssid 'wifi2'
        option encryption 'psk2+ccmp'
        option key 'mypassword'

config wifi-device 'radio1'
        option type 'mac80211'
        option channel '149'
        option txpower '23'
        option hwmode '11a'
        option path '1e140000.pcie/pci0000:00/0000:00:00.0/0000:01:00.0+1'
        option htmode 'VHT80'
        option country 'US'
        option noscan '1'

config wifi-iface 'default_radio1'
        option device 'radio1'
        option network 'lan'
        option mode 'ap'
        option ssid 'wifi5'
        option encryption 'psk2+ccmp'
        option key 'mypassword'
        option ieee80211r '1'
        option mobility_domain '4f11'
        option reassociation_deadline '20000'
        option ft_over_ds '1'
        option ft_psk_generate_local '1'
        option pmk_r1_push '1'
ryderlee1110 commented 3 years ago

Can you try to revert this? https://github.com/openwrt/mt76/commit/bfa8d5a6a9a1cd43d4ce4090f53fdc52c6247d66

ivanich commented 3 years ago

If you tell me how to do it. I mean mt76 is a package with Makefile and not a repo which i can easily revert commits

ivanich commented 3 years ago

Reverted https://github.com/openwrt/mt76/commit/bfa8d5a6a9a1cd43d4ce4090f53fdc52c6247d66 and unfortunatelly same crash.

Hold on, it makes no sense, Makefile rewriting mt76-2021-03-15-ca22cc22 directory content EDIT: Was able to build with reverted commit by forking mt76 repo, changed url and last commit hash in the makefile, but router still crashes

<1>[  147.273909] CPU 1 Unable to handle kernel paging request at virtual address 00000000, epc == 8e4f288c, ra == 8e58a968
<4>[  147.284570] Oops[#1]:
<4>[  147.286854] CPU: 1 PID: 3724 Comm: kworker/u9:3 Not tainted 5.4.105 #0
<4>[  147.293403] Workqueue: napi_workq napi_workfn
<4>[  147.297742] $ 0   : 00000000 00000001 8e595ec0 00000000
<4>[  147.302950] $ 4   : 8e595ec0 00000000 8e5ca600 00000000
<4>[  147.308161] $ 8   : 00000000 000004c1 80808080 fefefeff
<4>[  147.313370] $12   : 00000000 00000000 80704f1c 00000040
<4>[  147.323781] $20   : 00000050 00000044 00000011 01ce9502
<4>[  147.334195] $28   : 8e004000 8e005d18 8ddf2020 8e58a968
<4>[  147.342266] Lo    : 00000000
<4>[  147.350397] ra    : 8e58a968 mt7615_queue_rx_skb+0x94c/0xd0c [mt7615_common]
<4>[  147.361580] Cause : 4080000c (ExcCode 03)
<4>[  147.368434] PrId  : 0001992f (MIPS 1004Kc)
<4>[  147.456965] Process kworker/u9:3 (pid: 3724, threadinfo=9ece68d6, task=b16d155f, tls=00000000)
<4>[  147.473857]         0001ce95 00004188 00000000 00000080 00004c10 00000002 8074a2b8 80043dcc
<4>[  147.490505]         00000000 8e4f0110 8069ddf4 806ab148 8dfc0000 00001000 00000000 00000050
<4>[  147.507150]         ...
<4>[  147.512033] [<8e4f288c>] mt76_rx+0x124/0x320 [mt76]
<4>[  147.526633] 
<4>[  147.528512] ---[ end trace 3d137e75bb109149 ]---
MeIsReallyBa commented 3 years ago

This should be a problem with DBDC. MT7615+MT7615 does not have this problem.

ryderlee1110 commented 3 years ago

@ivanich My bad. Can you try this out ? https://github.com/ryderlee1110/mt76/commit/49cc3d6fc79587cb5e90f8e22a3e1d4e5486ffb8

MeIsReallyBa commented 3 years ago

My bad. Can you try this out ? ryderlee1110@49cc3d6

7621+7615.During the iperf3 test, the RX and TX speed of the client can hit 630M with udp . But the client download is only 550M and the upload is still 630M with tcp. Is it because the upload has some offload like rx decap offload while router tx doesn't have ?

ryderlee1110 commented 3 years ago

My bad. Can you try this out ? ryderlee1110@49cc3d6

7621+7615.During the iperf3 test, the RX and TX speed of the client can hit 630M with udp . But the client download is only 550M and the upload is still 630M with tcp. Is it because the upload has some offload like rx decap offload while router tx doesn't have ?

Not sure. How much you can get without Rx decap series? HW can support Tx enacp but I don't have time to add that.

MeIsReallyBa commented 3 years ago

My bad. Can you try this out ? ryderlee1110@49cc3d6

7621+7615.During the iperf3 test, the RX and TX speed of the client can hit 630M with udp . But the client download is only 550M and the upload is still 630M with tcp. Is it because the upload has some offload like rx decap offload while router tx doesn't have ?

Not sure. How much you can get without Rx decap series? HW can support Tx enacp but I don't have time to add that.

I dropped this https://github.com/MeIsReallyBa/mt76/commit/86ce24f182ec7ea2ee70648c0cf5e3478b89d173 and it seems that rx decap didn't make a big diffrence as the upload speed can still hit 800+Mb/S(TCP) in 160MHZ. But the client download speed is even slower in 160mhz than 80mhz and the wireless info shows that tx runs in 80MHZ while rx runs in 160MHZ

1

ryderlee1110 commented 3 years ago

I don't think rc_minstrel can support 160Mhz, and it's better to keep your test w/ w/o entire rx series in BW80.

MeIsReallyBa commented 3 years ago

I don't think rc_minstrel can support 160Mhz, and it's better to keep your test w/ w/o entire rx series in BW80.

No diffrence .

ryderlee1110 commented 3 years ago

@MeIsReallyBa there should be 3 patches. did you drop rx-amsdu as well?

MeIsReallyBa commented 3 years ago

@MeIsReallyBa there should be 3 patches. did you drop rx-amsdu as well?

revert 3 patches like this https://github.com/MeIsReallyBa/mt76/commits/master Now udp upload is still 600+ and tcp upload decreased to 450Mb.

ryderlee1110 commented 3 years ago

@MeIsReallyBa there should be 3 patches. did you drop rx-amsdu as well?

revert 3 patches like this https://github.com/MeIsReallyBa/mt76/commits/master Now udp upload is still 600+ and tcp upload decreased to 450Mb.

It looks like the series improves Rx performance. BTW. this is off-topic .. :)

ivanich commented 3 years ago

@ivanich My bad. Can you try this out ? ryderlee1110@49cc3d6

Tried, the same crash, or should I try it in combination of some reverts?

ryderlee1110 commented 3 years ago

@ivanich can you do git bisect to find out the offending commit?

ivanich commented 3 years ago

@ivanich can you do git bisect to find out the offending commit?

sorry i can't spend whole day doing this

ivanich commented 3 years ago

I checked out to commit https://github.com/ivanich/mt76/commit/760adce291007b44322d233caf9fc1dffce8015d and it fails, looks like one of the these commits causing it

mt76: mt7615: fix memory leak in mt7615_coredump_work
mt76: mt7615: add support for rx decapsulation offload
mt76: mt7615: add rx checksum offload support
mt76: mt7615: enable hw rx-amsdu de-aggregation

Also going to check this one https://github.com/ivanich/mt76/commit/1706fb6c48e8cea6feb9ab5c6d60454aeb5db267

ryderlee1110 commented 3 years ago

I checked out to commit ivanich@760adce and it fails, looks like one of the these commits causing it

mt76: mt7615: add support for rx decapsulation offload
mt76: mt7615: add rx checksum offload support
mt76: mt7615: enable hw rx-amsdu de-aggregation

Does dbdc work after reverting these commits?

ivanich commented 3 years ago

@ryderlee1110 It doesn't crash with those 3 commits reverted, as far as I can say dbdc works because I can see both radios up and speed is more or less fine, getting ~500Mbits with 5Ghz with speedtest(don't want to bother with iperf3 right now)

ryderlee1110 commented 3 years ago

@ivanich Thanks. Can you try to just keep mt76: mt7615: enable hw rx-amsdu de-aggregation?

ivanich commented 3 years ago

@ryderlee1110 Just tried with mt76: mt7615: enable hw rx-amsdu de-aggregation - crash

ryderlee1110 commented 3 years ago

@ivanich So, the latest mt76 works w/o this commit, but I don't see how this affects second phy at the moment. Does this crash happen on sta association or speedtest?

ivanich commented 3 years ago

@ryderlee1110 speedtest and only on upload test, download passes fine

ryderlee1110 commented 3 years ago

@ivanich have to add some logs in mt76_rx(), but unfortunately I don't have dbdc to debug :(

ivanich commented 3 years ago

Drop me a commit and I'll test it

porentak commented 3 years ago

As it looks it crashes in both directions. Just in download (AP->Client) takes more time. Tested with iperf3.

ryderlee1110 commented 3 years ago

As it looks it crashes in both directions. Just in download (AP->Client) takes more time. Tested with iperf3.

Does Tx work fine without that commit in you case? Can you test udp tx?

porentak commented 3 years ago

@ryderlee1110

I can confirm using ca22cc221ae7c87535d6f24b12c0280f6b560b31 as a base and manually revert only changes from c5aca6692c41546084e3f1c920fdde6d5f9b0f36 (mt76: mt7615: enable hw rx-amsdu de-aggregation) it does not crash anymore.

This is test with iperf3 (TCP). Will do UDP tests asap.

porentak commented 3 years ago

@ryderlee1110 Managed to make few tests. Here are the results:

mt76: ca22cc221ae7c87535d6f24b12c0280f6b560b31 TCP client->AP: CRASH AP->client: CRASH UDP client->AP: CRASH AP->client: no crash

mt76: ca22cc221ae7c87535d6f24b12c0280f6b560b31 with reverted commit c5aca6692c41546084e3f1c920fdde6d5f9b0f36 TCP client->AP: OK AP->client: OK UDP client->AP: OK AP->client: OK

porentak commented 3 years ago

@ryderlee1110, it is failing at *dev->rx_amsdu[q].tail = skb, function mt76_rx_release_burst.

dev->rx_amsdu[q].tail is NULL.

ryderlee1110 commented 3 years ago

@porentak can you add log to check if it goes into if (status->first_amsdu) in mt76_rx_release_burst(), and goes into if (status->amsdu). Dump first phy as well.

porentak commented 3 years ago

@ryderlee1110 I hope this is what you asked for. I took a little longer with this patch to crash. But it crashed at the end.

+++ b/mac80211.c if (status->first_amsdu) { + printk("l2\n"); dev->rx_amsdu[q].tail = &skb_shinfo(skb)->frag_list; + if (dev->rx_amsdu[q].tail == NULL) + printk("l3\n"); ` / trailing amsdu subframes / + if (dev->rx_amsdu[q].tail == NULL) + printk("l4\n"); + *dev->rx_amsdu[q].tail = skb; diff --git a/mt7615/mac.c b/mt7615/mac.c index 59aa8f8..ce494aa 100644 +++ b/mt7615/mac.c if (status->amsdu) { + printk("l1: 0x%x\n", phy);`

Logs are:

[ 94.704932] l1: 0x8e6c60b0 [ 94.710340] l2 [ 94.713667] l1: 0x8e6c60b0 [ 94.722628] l1: 0x8e6c60b0 [ 94.728037] l2 [ 94.731365] l1: 0x8e6c60b0 [ 94.743749] l1: 0x8e6c60b0 [ 94.749148] l2 .... [ 105.095552] l2 [ 105.098853] l1: 0x8e6c60b0 [ 105.104265] l1: 0x8e6c60b0 [ 105.109665] l2 [ 105.112995] l1: 0x8e6c60b0 [ 105.118399] l1: 0x8e6c60b0 [ 105.123799] l2 [ 105.128422] l1: 0x8e6c60b0 [ 105.133863] l4

ryderlee1110 commented 3 years ago

@porentak Thanks. can you modify patch a bit? you can remove printk("l3\n");

if (status->amsdu) {
    printk("l1: amsdu_info:%d\n", amsdu_info);
 ......

if (dev->rx_amsdu[q].tail == NULL)
    printk("l4: q%d, is_amsdu:%d, out-of-order%d\n", q, status->amsdu, status->seqno != dev->rx_amsdu[q].seqno);

Suspect it's out of order here.

ptpt52 commented 3 years ago
diff --git a/mt7615/mac.c b/mt7615/mac.c
index 59aa8f8..561d573 100644
--- a/mt7615/mac.c
+++ b/mt7615/mac.c
@@ -469,6 +469,7 @@ static int mt7615_mac_fill_rx(struct mt7615_dev *dev, struct sk_buff *skb)
                status->first_amsdu = amsdu_info == MT_RXD1_FIRST_AMSDU_FRAME;
                status->last_amsdu = amsdu_info == MT_RXD1_LAST_AMSDU_FRAME;
                if (!hdr_trans) {
+                       BUG_ON(skb_tail_pointer(skb) - skb->data < ieee80211_get_hdrlen_from_skb(skb));
                        memmove(skb->data + 2, skb->data,
                                ieee80211_get_hdrlen_from_skb(skb));
                        skb_pull(skb, 2);

@porentak could you try this patch?

ptpt52 commented 3 years ago

very curious if it is intend to remove the last 2 bytes in skb, why not just do it simple skb->len -= 2; there is no need to memove and pull

ryderlee1110 commented 3 years ago

@ivanich @porentak could you try this fixup out https://pastebin.com/GdCCiGtU ?

btw are you guys running both phy at the same time?

porentak commented 3 years ago

@porentak Thanks. can you modify patch a bit? you can remove printk("l3\n");

if (status->amsdu) {
  printk("l1: amsdu_info:%d\n", amsdu_info);
 ......

if (dev->rx_amsdu[q].tail == NULL)
  printk("l4: q%d, is_amsdu:%d, out-of-order%d\n", q, status->amsdu, status->seqno != dev->rx_amsdu[q].seqno);

Suspect it's out of order here.

[ 119.799863] l1: amsdu_info:3 [ 119.805630] l2 [ 119.808984] l1: amsdu_info:1 [ 119.816920] l1: amsdu_info:3 [ 119.822659] l2 [ 119.825995] l1: amsdu_info:1 [ 119.838827] l1: amsdu_info:3 [ 119.844570] l2 .... [ 175.855029] l1: amsdu_info:1 [ 175.860781] l1: amsdu_info:3 [ 175.866515] l2 [ 175.869835] l1: amsdu_info:1 [ 175.875575] l1: amsdu_info:3 [ 175.881313] l2 [ 175.884654] l1: amsdu_info:1 [ 175.891744] l1: amsdu_info:3 [ 175.897526] l2 [ 175.900862] l1: amsdu_info:1 [ 175.906602] l1: amsdu_info:3 [ 175.912335] l2 [ 175.915655] l1: amsdu_info:1 [ 175.921400] l1: amsdu_info:3 [ 175.927132] l2 [ 175.930458] l1: amsdu_info:1 [ 175.936195] l4: q0, is_amsdu:1, out-of-order0

porentak commented 3 years ago

@ivanich @porentak could you try this fixup out https://pastebin.com/GdCCiGtU ?

@ryderlee1110 with this patch I cannot reproduce crash.

On the other hand, I have a feeling throughput is lower than without (mt76: mt7615: enable hw rx-amsdu de-aggregation). But that is just a feeling at the moment. Need to make more tests.

ptpt52 commented 3 years ago

@ivanich @porentak could you try this fixup out https://pastebin.com/GdCCiGtU ?

@ryderlee1110 with this patch I cannot reproduce crash.

On the other hand, I have a feeling throughput is lower than without (mt76: mt7615: enable hw rx-amsdu de-aggregation). But that is just a feeling at the moment. Need to make more tests.

it is likely a wrong work around, it broken the rx-amsdu

ryderlee1110 commented 3 years ago

@ivanich @porentak could you try this fixup out https://pastebin.com/GdCCiGtU ?

@ryderlee1110 with this patch I cannot reproduce crash. On the other hand, I have a feeling throughput is lower than without (mt76: mt7615: enable hw rx-amsdu de-aggregation). But that is just a feeling at the moment. Need to make more tests.

it is likely a wrong work around, it broken the rx-amsdu

we cannot always guarantee first_amsdu arrives first since out of order reported by the fw, I think. maybe this is not right. will do more tests.

ryderlee1110 commented 3 years ago

@ivanich @porentak could you try this fixup out https://pastebin.com/GdCCiGtU ?

@ryderlee1110 with this patch I cannot reproduce crash.

On the other hand, I have a feeling throughput is lower than without (mt76: mt7615: enable hw rx-amsdu de-aggregation). But that is just a feeling at the moment. Need to make more tests.

Have to combine with the further patches.

ptpt52 commented 3 years ago
                        memmove(skb->data + 2, skb->data,
                                ieee80211_get_hdrlen_from_skb(skb));

it seems the memmove may crash the memory please try my BUG_ON patch

ryderlee1110 commented 3 years ago
                        memmove(skb->data + 2, skb->data,
                                ieee80211_get_hdrlen_from_skb(skb));

it seems the memmove may crash the memory please try my BUG_ON patch

Did you run into this?

ptpt52 commented 3 years ago
                        memmove(skb->data + 2, skb->data,
                                ieee80211_get_hdrlen_from_skb(skb));

it seems the memmove may crash the memory please try my BUG_ON patch

Did you run into this?

not really, for now I have no device at hand. but for my understanding, if that case happen, it crash

LorenzoBianconi commented 3 years ago

@porentak @ptpt52 can you please try this patch?

porentak commented 3 years ago

@porentak @ptpt52 can you please try this patch?

With this patch it does not crash, but throughput is 3-10Mb/s in client -> AP direction.

porentak commented 3 years ago

btw are you guys running both phy at the same time?

@ryderlee1110 yes. Issue is only if both phys are enabled.

porentak commented 3 years ago
diff --git a/mt7615/mac.c b/mt7615/mac.c
index 59aa8f8..561d573 100644
--- a/mt7615/mac.c
+++ b/mt7615/mac.c
@@ -469,6 +469,7 @@ static int mt7615_mac_fill_rx(struct mt7615_dev *dev, struct sk_buff *skb)
                status->first_amsdu = amsdu_info == MT_RXD1_FIRST_AMSDU_FRAME;
                status->last_amsdu = amsdu_info == MT_RXD1_LAST_AMSDU_FRAME;
                if (!hdr_trans) {
+                       BUG_ON(skb_tail_pointer(skb) - skb->data < ieee80211_get_hdrlen_from_skb(skb));
                        memmove(skb->data + 2, skb->data,
                                ieee80211_get_hdrlen_from_skb(skb));
                        skb_pull(skb, 2);

@porentak could you try this patch?

It crashes.