mt76 crash issue, and proposed resolution

Brain2000 commented 1 year ago

Greetings,

I've been chasing down an mt76 crash issue on a Belkin RT3200 for the past month. It happens once or twice a day when many devices are roaming.

The stack traces were not always consistent, but here was one that I pulled:

<6>[34253.568878] br-vlan12: port 3(wl1-ap0.4096) entered forwarding state
<4>[34253.603584] wl1-ap0.4102 selects TX queue 0, but real number of TX queues is 0
<1>[34253.603618] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
<7>[34253.841853] Call trace:
<7>[34253.844290]  eth_type_trans+0x44/0x19c
<7>[34253.848032]  ieee80211_rx_list+0x1cc/0xbcc [mac80211]
<7>[34253.853100]  mt76_rx_complete+0x220/0x35c [mt76]
<7>[34253.857718]  mt76_rx_poll_complete+0x2c8/0x4f0 [mt76]
<7>[34253.862767]  mt76_dma_rx_poll+0x2a4/0x4f0 [mt76]
<7>[34253.867383]  __napi_poll+0x54/0x1b0
<7>[34253.870866]  napi_threaded_poll+0x84/0xe4
<7>[34253.874869]  kthread+0x11c/0x130
<7>[34253.878091]  ret_from_fork+0x10/0x20
<0>[34253.881664] Code: 91003880 f9006440 f9419020 f9400083 (f9400000)
<4>[34253.887749] ---[ end trace 85019ca9a65c2328 ]---

I found that after adding a potentially missing spin_lock in a mac_sta_poll( ) function, and moving the sta_remove to the sta_pre_rcu_remove with a synchronize_rcu at the end, I was able to get a consistent stack trace, so, that seemed like good progress. Here is that stack trace:

<1>[  113.323540] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000008
<7>[  113.561515] Call trace:
<7>[  113.563953]  mt7615_mac_sta_remove+0x100/0x930 [mt7615_common]
<7>[  113.569783]  __mt76_sta_remove+0x68/0xe4 [mt76]
<7>[  113.574316]  mt76_sta_pre_rcu_remove+0x44/0x10c [mt76]
<7>[  113.579452]  ieee80211_find_sta_by_link_addrs+0x430/0x540 [mac80211]
<7>[  113.585829]  sta_info_destroy_addr_bss+0x38/0x70 [mac80211]
<7>[  113.591415]  ieee80211_color_change_finish+0x1278/0x1500 [mac80211]
<7>[  113.597694]  cfg80211_check_station_change+0x1384/0x4720 [cfg80211]
<7>[  113.603970]  genl_family_rcv_msg_doit+0xb4/0x110
<7>[  113.608587]  genl_rcv_msg+0xd0/0x1c0
<7>[  113.612156]  netlink_rcv_skb+0x58/0x120
<7>[  113.615986]  genl_rcv+0x34/0x50
<7>[  113.619121]  netlink_unicast+0x1f0/0x2ec
<7>[  113.623037]  netlink_sendmsg+0x19c/0x3d0
<7>[  113.626954]  ____sys_sendmsg+0x258/0x2a0
<7>[  113.630873]  ___sys_sendmsg+0x78/0xc0
<7>[  113.634528]  __sys_sendmsg+0x54/0xb0
<7>[  113.638096]  __arm64_sys_sendmsg+0x20/0x30
<7>[  113.642183]  invoke_syscall+0x44/0x110
<7>[  113.645927]  el0_svc_common.constprop.0+0x48/0xf0
<7>[  113.650623]  do_el0_svc+0x18/0x20
<7>[  113.653930]  el0_svc+0x14/0x50
<7>[  113.656979]  el0t_64_sync_handler+0xe0/0x110
<7>[  113.661240]  el0t_64_sync+0x158/0x15c
<0>[  113.664899] Code: eb02001f 540000e0 f94276a2 f9427aa1 (f9000441)
<4>[  113.670983] ---[ end trace 6e69b3da17b58560 ]---

After a bit more debugging, I believe I discovered the root cause. mac80211 is calling sta_remove( ) with an already removed 80211_sta pointer. This happens when a wifi device rapidly connects and disconnects over a very short period. Five minutes after the rapid connect/disconnect event (probably the inactivity timeout), mac80211 then calls the sta_remove( ) function a 2nd time with the same 80211_sta pointer. The drv_priv portion of this 80211_sta structure is zeroed out and that's where things go terribly wrong.

Such as a kernel oops due to the poll_list/rc_list lists having an invalid dereference to 0. Or depending on the timing, if it gets past that portion, then idx 0 ends up being targeted, which is needed for the beacon/frame management. This causes all kinds of weird timeout errors in the log followed by a spectacular crash.

Here is a snippet of logs that shows what is happening:

Mar 16 14:01:25 [ 6907.783912] mt7915e start sta_add=0x00000000f53bf1c3 idx=2
Mar 16 14:01:25 [ 6907.795647] mt7915e start sta_remove=0x00000000f53bf1c3 idx=2
Mar 16 14:01:26 [ 6907.902339] mt7915e start sta_add=0x00000000f53bf1c3 idx=2
Mar 16 14:01:26 [ 6907.913964] mt7915e start sta_remove=0x00000000f53bf1c3 idx=2
...
Mar 16 14:07:10 [ 7251.911771] mt7915e start sta_remove=0x00000000f53bf1c3 idx=0 <--- oops

I believe the correct fix here is to check for if (!wcid->idx && !wcid->sta) return; as the first line in the sta_remove( ) function

@Fail-Safe has also been compiling various commits along the way to assist in verifying that this is fixed.

I am offering up a PR to fix this, and I hope we can get it so it meets your level of standards.

nbd168 commented 1 year ago

Did you reproduce that sta_remove with idx=0 without your change that handles removal in sta_pre_rcu_remove? Because treating sta_pre_rcu_remove as the station removal point is wrong, because that makes it out of sync with mac80211's sta->uploaded state. Please run any further tests without that change.

Brain2000 commented 1 year ago

@nbd168 I split and reworked the patches leaving the sta_remove( ) and sta_pre_rcu_remove( ) back in their original functions.

I believe my original assessment of sta_remove being called twice was actually incorrect. Let me explain. If you call kfree( ) then kalloc( ) more than once for the same memory size, it may give you the same pointer back. I believe that's why I thought it was calling sta_remove( ) twice on the same sta, but it was actually a DIFFERENT sta.

Yesterday I found that mac80211 is actually calling just the sta_pre_rcu_remove( ) function when a wireless station flaps quickly, without calling sta_add( ) or sta_remove( ). I discovered a sta pointer value passed to sta_pre_rcu_remove( ) that was never used with sta_add or sta_remove, and that pointer contained the wcid->idx == 0.

Perhaps sta_info_destroy_part1 should not call the sta_pre_rcu_remove( ) function unless sta->uploaded is set? That way sta_pre_rcu_remove( ) and sta_remove( ) are always called together, instead of potentially with split logic.

In the meantime, I am testing the patches I sent you today (and will for a couple of days) to make sure they are solid.

Brain2000 commented 1 year ago

@nbd168 Oh, one more item that I didn't see in sta_remove( ), is removing the sta from the dev->sta_poll_list. I don't think this is causing any crashes, but if so, they might be very rare. Is there any possibility that an sta pointer could get added to the dev->sta_poll_list and then the sta_remove( ) called before the sta is processed from the list? That would potentially lead to either a poll on a different sta device (if a new subsequently allocated pointer was the same as the one freed), or a crash if the memory is no longer valid.

It was just something that popped up in my head today.

nbd168 commented 1 year ago

I don't think there is any race with the dev->sta_poll_list removal, and here's why: Any additions to dev->sta_poll_list can only be in a rcu-protected section after getting the pointer using rcu_dereference. When a station is deleted, it happens in this order:

sta_pre_rcu_remove
rcu_synchronize (called by mac80211)
station free After the sta_pre_rcu_remove, no new references to the wcid are taken, and after rcu_synchronize, all functions that could add a sta to the poll list are guaranteed to be done.

Regarding your other fixes:

extra rcu_read_lock in mt76_release_buffered_frames is unnecessary, because mac80211 passes in the sta pointer and it guarantees that the sta is valid at least until the end of the call. Same applies to other data structures belonging to the sta (e.g. the txq).
spin_lock_bh in mt76_rx_complete() is unnecessary, because that function is only called from callers that already have bh disabled.

I do agree that sta_pre_rcu_remove should only be called if sta->uploaded is set, we should fix that. I would suggest running another test with only these fixes applied:

The sta_poll_lock fix
mac80211 fix for sta_pre_rcu_remove
a check for !wcid->idx in __mt76_sta_remove with a WARN_ON so that we get a stack trace in case a remaining bug hits.

Brain2000 commented 1 year ago

Ok. I will remove those two patches and create a new one for mac80211. I will also modify the !wcid->idx to add a WARN_ON. I will try and have those to you by later tonight (EST).

Brain2000 commented 1 year ago

In case you are interested, I captured more logs showing the idx 0. I caught a station swapping between the 5Ghz and 2.4Ghz bands, with a lone pre_rcu_remove( ) called. This supports the idea of only calling sta_pre_rcu_remove if uploaded is set.

I highlighted what I think are the important lines near the bottom. wl0-ap0 is the 2.4Ghz and wl1-ap0 is the 5Ghz:

Mar 22 16:22:36 hostapd: wl0-ap0: STA 4c:d5:77:e2:2d:c1 IEEE 802.11: authenticated Mar 22 16:22:36 hostapd: wl0-ap0: STA 4c:d5:77:e2:2d:c1 IEEE 802.11: associated (aid 2) Mar 22 16:22:36 kernel: mt7622-wmac 18000000.wmac: sta_add(0x00000000bd356818):idx=2 ret=0 Mar 22 16:22:36 hostapd: wl0-ap0: CTRL-EVENT-EAP-STARTED 4c:d5:77:e2:2d:c1 Mar 22 16:22:39 hostapd: wl0-ap0: CTRL-EVENT-EAP-RETRANSMIT 4c:d5:77:e2:2d:c1 Mar 22 16:22:40 hostapd: Beacon request: 4c:d5:77:e2:2d:c1 is not connected Mar 22 16:22:42 hostapd: Beacon request: 4c:d5:77:e2:2d:c1 is not connected Mar 22 16:22:45 hostapd: wl0-ap0: CTRL-EVENT-EAP-RETRANSMIT 4c:d5:77:e2:2d:c1 Mar 22 16:22:45 hostapd: Beacon request: 4c:d5:77:e2:2d:c1 is not connected Mar 22 16:22:53 hostapd: wl1-ap0: STA 4c:d5:77:e2:2d:c1 IEEE 802.11: did not acknowledge authentication response Mar 22 16:22:54 hostapd: wl1-ap0: STA 4c:d5:77:e2:2d:c1 IEEE 802.11: did not acknowledge authentication response Mar 22 16:22:55 hostapd: wl1-ap0: STA 4c:d5:77:e2:2d:c1 IEEE 802.11: authenticated Mar 22 16:22:55 hostapd: wl1-ap0: STA-OPMODE-N_SS-CHANGED 4c:d5:77:e2:2d:c1 2 Mar 22 16:22:55 hostapd: wl1-ap0: STA 4c:d5:77:e2:2d:c1 IEEE 802.11: associated (aid 6) Mar 22 16:22:55 hostapd: wl0-ap0: Prune association for 4c:d5:77:e2:2d:c1 <-- station switching from 2.4 to 5ghz Mar 22 16:22:55 kernel: mt7915e 0000:01:00.0: sta_add(0x00000000e762ae72):idx=6 ret=0 Mar 22 16:22:55 hostapd: wl1-ap0: CTRL-EVENT-EAP-STARTED 4c:d5:77:e2:2d:c1 Mar 22 16:22:55 hostapd: wl1-ap0: CTRL-EVENT-EAP-STARTED 4c:d5:77:e2:2d:c1 Mar 22 16:22:57 kernel: mt7622-wmac 18000000.wmac: sta_pre_rcu_remove(0x00000000bd356818):idx=2 Mar 22 16:22:57 kernel: mt7622-wmac 18000000.wmac: sta_remove(0x00000000bd356818):idx=2 Mar 22 16:22:58 hostapd: wl1-ap0: CTRL-EVENT-EAP-RETRANSMIT 4c:d5:77:e2:2d:c1 Mar 22 16:22:58 hostapd: wl1-ap0: CTRL-EVENT-EAP-SUCCESS2 4c:d5:77:e2:2d:c1 Mar 22 16:22:58 hostapd: wl1-ap0: EAPOL-4WAY-HS-COMPLETED 4c:d5:77:e2:2d:c1 Mar 22 16:22:58 hostapd: wl1-ap0: AP-STA-CONNECTED 4c:d5:77:e2:2d:c1 auth_alg=open Mar 22 16:22:58 hostapd: wl1-ap0: STA 4c:d5:77:e2:2d:c1 IEEE 802.1X: authenticated - EAP type: 25 (PEAP) Mar 22 16:23:00 hostapd: Beacon request: 4c:d5:77:e2:2d:c1 is not connected Mar 22 16:23:02 hostapd: Beacon request: 4c:d5:77:e2:2d:c1 is not connected Mar 22 16:23:05 hostapd: Beacon request: 4c:d5:77:e2:2d:c1 is not connected Mar 22 16:23:17 hostapd: Beacon request: 4c:d5:77:e2:2d:c1 is not connected Mar 22 16:23:18 hostapd: wl1-ap0: AP-STA-DISCONNECTED 4c:d5:77:e2:2d:c1 Mar 22 16:23:18 hostapd: wl1-ap0: STA 4c:d5:77:e2:2d:c1 IEEE 802.11: disassociated Mar 22 16:23:18 kernel: mt7915e 0000:01:00.0: sta_pre_rcu_remove(0x00000000e762ae72):idx=6 <-- station switching back to 2.4ghz Mar 22 16:23:18 kernel: mt7915e 0000:01:00.0: sta_remove(0x00000000e762ae72):idx=6 Mar 22 16:23:18 hostapd: wl0-ap0: STA 4c:d5:77:e2:2d:c1 IEEE 802.11: did not acknowledge authentication response Mar 22 16:23:19 hostapd: wl1-ap0: STA 4c:d5:77:e2:2d:c1 IEEE 802.11: deauthenticated due to inactivity (timer DEAUTH/REMOVE) Mar 22 16:23:19 hostapd: wl0-ap0: STA 4c:d5:77:e2:2d:c1 IEEE 802.11: did not acknowledge authentication response Mar 22 16:23:20 hostapd: Beacon request: 4c:d5:77:e2:2d:c1 is not connected Mar 22 16:23:20 hostapd: wl0-ap0: STA 4c:d5:77:e2:2d:c1 IEEE 802.11: did not acknowledge authentication response Mar 22 16:23:22 hostapd: Beacon request: 4c:d5:77:e2:2d:c1 is not connected Mar 22 16:23:25 hostapd: wl0-ap0: STA 4c:d5:77:e2:2d:c1 IEEE 802.11: deauthenticated due to inactivity (timer DEAUTH/REMOVE) Mar 22 16:23:25 kernel: mt7622-wmac 18000000.wmac: sta_pre_rcu_remove(0x000000009c54be23):idx=0 <-- no sta_add or sta_remove with this sta pointer

Brain2000 commented 1 year ago

@nbd168 Different crash this time. This did not follow an idx==0, however, there were about three idx==0, each about an hour apart, which would have made this crash hard to find.

Going to dig into ieee80211_calc_expected_tx_airtime( ) to see if I can find it.

<3>[209361.647990] mt7622-wmac 18000000.wmac: sta_pre_rcu_remove(0x0000000000d7314c)=idx8 <3>[209361.768048] mt7622-wmac 18000000.wmac: sta_remove(0x0000000000d7314c)=idx8 <3>[209368.813834] mt7622-wmac 18000000.wmac: sta_add(0x0000000015bd9310)=idx8 ret=0 <1>[209390.417998] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000b90 <1>[209390.426920] Mem abort info: <1>[209390.429840] ESR = 0x0000000096000006 <1>[209390.433733] EC = 0x25: DABT (current EL), IL = 32 bits <1>[209390.439186] SET = 0, FnV = 0 <1>[209390.442330] EA = 0, S1PTW = 0 <1>[209390.445550] FSC = 0x06: level 2 translation fault <1>[209390.450524] Data abort info: <1>[209390.453487] ISV = 0, ISS = 0x00000006 <1>[209390.457403] CM = 0, WnR = 0 <1>[209390.460464] user pgtable: 4k pages, 39-bit VAs, pgdp=0000000043beb000 <1>[209390.466989] [0000000000000b90] pgd=0800000040acd003, p4d=0800000040acd003, pud=0800000040acd003, pmd=0000000000000000 <0>[209390.477703] Internal error: Oops: 96000006 [#1] SMP <7>[209390.552886] CPU: 0 PID: 818 Comm: mt76-tx phy0 Tainted: G S 5.15.98 #0 <7>[209390.560797] Hardware name: Linksys E8450 (UBI) (DT) <7>[209390.565752] pstate: 20000005 (nzCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) <7>[209390.572795] pc : ieee80211_calc_expected_tx_airtime+0x8/0x1b0 [mac80211] <7>[209390.579625] lr : ieee80211_tx_dequeue+0xc10/0xcd0 [mac80211] <7>[209390.585387] sp : ffffffc008e9bc20 <7>[209390.588779] x29: ffffffc008e9bc20 x28: ffffff8001f10880 x27: ffffff80059d7e28 <7>[209390.595998] x26: ffffff80023c8378 x25: ffffff8006b9e0e8 x24: ffffff8006b9e0a0 <7>[209390.603216] x23: ffffff8006b9e000 x22: ffffffc008b6d010 x21: ffffff80059d7e00 <7>[209390.610435] x20: ffffff80059d7e00 x19: ffffffc008e9bce0 x18: 0000000000000000 <7>[209390.617653] x17: 0000000000000e60 x16: ffffffc008d8f000 x15: 0000000000000730 <7>[209390.624870] x14: 0000000000000000 x13: ffffffc008d8f730 x12: 0000000000000003 <7>[209390.632088] x11: 0000000000000002 x10: ffffff8006e2d910 x9 : 00000000ffffffff <7>[209390.639306] x8 : ffffff8001f10880 x7 : 0000000000000090 x6 : 00000001013f0c94 <7>[209390.646523] x5 : 0000000000000207 x4 : 0000000000000001 x3 : 000000000000006a <7>[209390.653740] x2 : ffffff800a2d0aa8 x1 : 0000000000000840 x0 : ffffff8001f10880 <7>[209390.660958] Call trace: <7>[209390.663483] ieee80211_calc_expected_tx_airtime+0x8/0x1b0 [mac80211] <7>[209390.669940] __mt76_tx_complete_skb+0x454/0x62c [mt76] <7>[209390.675166] mt76_tx_worker_run+0x34/0x60 [mt76] <7>[209390.679869] mt7615_tx_worker+0xc4/0x200 [mt7615_common] <7>[209390.685268] __mt76_worker_fn+0x94/0x1f0 [mt76] <7>[209390.689884] kthread+0x11c/0x130 <7>[209390.693195] ret_from_fork+0x10/0x20 <0>[209390.696854] Code: d65f03c0 d503201f aa0003e8 11009867 (f941a820)

nbd168 commented 1 year ago

Did you get a idx0 case with sta_remove or only with sta_pre_rcu_remove? Regarding the crash, I assume that the vif pointer is NULL for some reason.

Brain2000 commented 1 year ago

I missed one line:

<0>[209390.743377] Kernel panic - not syncing: Oops: Fatal exception in interrupt

I did not see an idx 0 within an hour of this happening.

I'm trying to make heads or tails how the stack went from mt76_tx_complete_skb( ) to mt76_tx_dequeue( ), which is the only place I can find that calls ieee80211_tx_dequeue( )

Brain2000 commented 1 year ago

@nbd I'm working with three of these units and it's possible that one had the previous flash where sta_remove and pre_rcu_remove combined. I just reflashed it to be sure. I'll let you know if it has any further crashes over the next few days.

Brain2000 commented 1 year ago

@nbd168 This happened this morning. It's the third time I've seen this crash in the last few months. I don't know why the device sometimes "unregisters".

void mt7615_unregister_device(struct mt7615_dev *dev)
{
    bool mcu_running;

    mcu_running = mt7615_wait_for_mcu_init(dev);

    mt7615_unregister_ext_phy(dev);
    mt76_unregister_device(&dev->mt76);
    if (mcu_running)
        mt7615_mcu_exit(dev);

    mt7615_tx_token_put(dev);
    mt7615_dma_cleanup(dev);
    tasklet_disable(&dev->irq_tasklet);

    mt76_free_device(&dev->mt76);
}

Do you think the tasklet_disable( ) should be moved up to the top?

<1>[131376.277794] Unable to handle kernel paging request at virtual address 000000641a432109
<1>[131376.285829] Mem abort info:
<1>[131376.288785]   ESR = 0x0000000096000005
<1>[131376.292632]   EC = 0x25: DABT (current EL), IL = 32 bits
<1>[131376.298051]   SET = 0, FnV = 0
<1>[131376.301205]   EA = 0, S1PTW = 0
<1>[131376.304433]   FSC = 0x05: level 1 translation fault
<1>[131376.309425] Data abort info:
<1>[131376.312400]   ISV = 0, ISS = 0x00000005
<1>[131376.316317]   CM = 0, WnR = 0
<1>[131376.319385] user pgtable: 4k pages, 39-bit VAs, pgdp=00000000439de000
<1>[131376.325922] [000000641a432109] pgd=0000000000000000, p4d=0000000000000000, pud=0000000000000000
<0>[131376.334744] Internal error: Oops: 96000005 [#1] SMP
<7>[131376.409935] CPU: 0 PID: 808 Comm: napi/phy0-5 Tainted: G S      W 5.15.98 #0
<7>[131376.417760] Hardware name: Linksys E8450 (UBI) (DT)
<7>[131376.422716] pstate: 80000005 (Nzcv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
<7>[131376.429759] pc : ieee80211_sta_ps_transition+0xa08/0x1270 [mac80211]
<7>[131376.436242] lr : ieee80211_sta_ps_transition+0x9d4/0x1270 [mac80211]
<7>[131376.442700] sp : ffffffc008e1ba90
<7>[131376.446092] x29: ffffffc008e1ba90 x28: ffffff80024f3828 x27: 0000000000000001
<7>[131376.453313] x26: ffffff8006cc2200 x25: ffffff800248e852 x24: 0000000000000000
<7>[131376.460531] x23: ffffff800248eec0 x22: ffffffc008e1bc30 x21: ffffffc008e1bb48
<7>[131376.467749] x20: ffffff8007b42280 x19: ffffff8006cc2200 x18: ffffffffffffabd0
<7>[131376.474967] x17: 000000000000c600 x16: ffffff80028f8000 x15: ffffffffffffac00
<7>[131376.482184] x14: 937b1fabd005169f x13: 0000000000003fd0 x12: ffffff800248e85e
<7>[131376.489402] x11: 0000000000000000 x10: ffffff8005e4a735 x9 : ffffff8006cc2250
<7>[131376.496619] x8 : ffffff800248e852 x7 : 000000000000002a x6 : ffffff8005e4a708
<7>[131376.503837] x5 : 000000000000b0a5 x4 : 000000000000002a x3 : 270000641a432101
<7>[131376.511054] x2 : 0000000000000000 x1 : ffffffc01742a000 x0 : 270000a403008101
<7>[131376.518273] Call trace:
<7>[131376.520797]  ieee80211_sta_ps_transition+0xa08/0x1270 [mac80211]
<7>[131376.526908]  ieee80211_rx_list+0x1cc/0xbcc [mac80211]
<7>[131376.532062]  mt76_rx_complete+0x1c8/0x27c [mt76]
<7>[131376.536767]  mt76_rx_poll_complete+0x2c8/0x4f0 [mt76]
<7>[131376.541903]  mt76_dma_rx_poll+0x2a4/0x4f0 [mt76]
<7>[131376.546605]  mt7615_unregister_device+0x404/0x560 [mt7615e]
<7>[131376.552261]  __napi_poll+0x54/0x1b0
<7>[131376.555833]  napi_threaded_poll+0x84/0xe4
<7>[131376.559923]  kthread+0x11c/0x130
<7>[131376.563233]  ret_from_fork+0x10/0x20
<0>[131376.566893] Code: d538d081 b9407264 f9427c00 8b010003 (f9400462)
<4>[131376.573065] ---[ end trace 1de9a020af2bcc73 ]---

Nothing happened before this event (i.e. no idx == 0 or any other warnings)

nbd168 commented 1 year ago

One thing you need to understand about these traces is that symbols are only resolved for non-static functions. Because of that, you can't rely on the function name alone. Offsets are provided from the start of the nearest symbol, but if you resolve the actual line number, you could land in a static function somewhere else. I usually use aarch64-openwrt-linux-gdb from the toolchain bin directory on the relevant kernel module and do this:

(gdb) l *ieee80211_sta_ps_transition+0xa08
0x28b58 is in ieee80211_rx_8023 (./include/linux/netdevice.h:2727).
warning: Source file is more recent than executable.
2722    static inline void dev_sw_netstats_rx_add(struct net_device *dev, unsigned int len)
2723    {
2724            struct pcpu_sw_netstats *tstats = this_cpu_ptr(dev->tstats);
2725
2726            u64_stats_update_begin(&tstats->syncp);
2727            tstats->rx_bytes += len;
2728            tstats->rx_packets++;
2729            u64_stats_update_end(&tstats->syncp);
2730    }

...

(gdb) l *mt7615_unregister_device+0x404
0xd64 is in mt7615_poll_rx (/Users/nbd/builds/openwrt/build_dir/target-aarch64_cortex-a53_musl/linux-mediatek_mt7622/mt76-2023-03-01-c32d6d84/mt7615/dma.c:98).
93              if (!mt76_connac_pm_ref(&dev->mphy, &dev->pm)) {
94                      napi_complete(napi);
95                      queue_work(dev->mt76.wq, &dev->pm.wake_work);
96                      return 0;
97              }
98              done = mt76_dma_rx_poll(napi, budget);
99              mt76_connac_pm_unref(&dev->mphy, &dev->pm);
100
101             return done;
102     }

So in this case it looks like it might be an instance of sta->sdata being NULL right in the middle of the rx path. I need to do some digging to figure out if this is something that the driver needs to deal with (e.g. rcu related issue), or if it's a mac80211 bug. Without having taken a closer look I'm slightly leaning towards the latter. Either way, this trace was very helpful. I will let you know when I've figured something out.

Brain2000 commented 1 year ago

aarch64-openwrt-linux-gdb! I think that might be the piece I've been looking for. Thank you! I do recall these crashes from a while ago, which is why I originally added the spinlock around all the ieee80211_tx/rx( ) functions. As a test, I'm going to add that back in to see if they subside.

nbd168 commented 1 year ago

Please add this patch instead of the extra spinlocks: https://nbd.name/p/4ece1b4f

Brain2000 commented 1 year ago

@nbd168 Np, I'll do that right now. Noone is in the office over the weekend so they won't crash, but I'll report back on Monday.

Brain2000 commented 1 year ago

@nbd168 Wait a minute, if I add that patch, the sta_remove( ) will return immediately because wcid->sta is now 0. Should I take out the sta_remove( ) !wcid->sta check?

EDIT I did remove the extra check in sta_remove so it will not skip sta_remove( ). I'll report back Monday.

Brain2000 commented 1 year ago

@nbd168 It crashed within an hour this morning. (I also switched the compile optimization flag from -O2 to -Os in hopes that we get an easier stack trace to look at)

<1>[220470.265394] Unable to handle kernel paging request at virtual address 00333a3231203732 <1>[220470.273452] Mem abort info: <1>[220470.276416] ESR = 0x0000000096000004 <1>[220470.280267] EC = 0x25: DABT (current EL), IL = 32 bits <1>[220470.285674] SET = 0, FnV = 0 <1>[220470.288854] EA = 0, S1PTW = 0 <1>[220470.292112] FSC = 0x04: level 0 translation fault <1>[220470.297130] Data abort info: <1>[220470.300128] ISV = 0, ISS = 0x00000004 <1>[220470.304057] CM = 0, WnR = 0 <1>[220470.307149] [00333a3231203732] address between user and kernel address ranges <0>[220470.314406] Internal error: Oops: 96000004 [#1] SMP <7>[220470.390033] CPU: 0 PID: 831 Comm: napi/phy1-7 Tainted: G S 5.15.98 #0 <7>[220470.397877] Hardware name: Linksys E8450 (UBI) (DT) <7>[220470.402845] pstate: 40000005 (nZcv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) <7>[220470.409910] pc : __dev_queue_xmit+0x134/0x4f8 <7>[220470.414378] lr : __dev_queue_xmit+0x12c/0x4f8 <7>[220470.418835] sp : ffffffc008d0ba70 <7>[220470.422237] x29: ffffffc008d0ba70 x28: 0000000000000000 x27: ffffff8000bc8d76 <7>[220470.429484] x26: ffffffc0008e6060 x25: ffffff80026c0128 x24: ffffff8003fad6c0 <7>[220470.436732] x23: 0000000000000000 x22: ffffff8006ffc000 x21: ffffff8003b2f800 <7>[220470.443979] x20: 32333a3231203732 x19: ffffff8005beff00 x18: 0000000000000000 <7>[220470.451226] x17: 7369207365756575 x16: 7120585420666f20 x15: 7265626d756e206c <7>[220470.465719] x11: 000000000000077a x10: 00000000ffffffea x9 : ffffffc0089822c8 <7>[220470.472965] x8 : ffffffc008d0b790 x7 : ffffffc008d0b798 x6 : 00000000ffffefff <7>[220470.480211] x5 : 0000000000000000 x4 : 0000000000000000 x3 : 0000000000000000 <7>[220470.487454] x2 : 0000000000000140 x1 : 0000000000000000 x0 : ffffff8003b2f800 <7>[220470.494701] Call trace: <7>[220470.497235] __dev_queue_xmit+0x134/0x4f8 <7>[220470.501345] dev_queue_xmit+0x10/0x18 <7>[220470.505109] ieee80211_sta_ps_transition+0x448/0x5c8 [mac80211] <7>[220470.511235] ieee80211_rx_list+0x43c/0x854 [mac80211] <7>[220470.516471] mt76_rx_complete+0x15c/0x1f4 [mt76] <7>[220470.521212] mt76_rx_poll_complete+0x70/0x424 [mt76] <7>[220470.526297] mt76_dma_rx_poll+0x2f0/0x458 [mt76] <7>[220470.531031] __napi_poll.constprop.0+0x6c/0x154 <7>[220470.535665] napi_threaded_poll+0x7c/0xac <7>[220470.539777] kthread+0xe0/0xf0 <7>[220470.542933] ret_from_fork+0x10/0x20 <0>[220470.546616] Code: aa1603e0 97ffff85 f9400414 aa0003f5 (f9400280) <4>[220470.552802] ---[ end trace d159cd0974fa10c0 ]---

nbd168 commented 1 year ago

Could you please resolve the line numbers of ieee80211_sta_ps_transition+0x448 and ieee80211_rx_list+0x43c for me?

Brain2000 commented 1 year ago

@nbd168 Let me know if I did this properly:

(gdb) add-symbol-file ~/openwrt/build_dir/target-aarch64_cortex-a53_musl/linux-mediatek_mt7622/backports-6.1-rc8/net/mac80211/mac80211.o
add symbol table from file "/home/osboxes/openwrt/build_dir/target-aarch64_cortex-a53_musl/linux-mediatek_mt7622/backports-6.1-rc8/net/mac80211/mac80211.o"
(y or n) y
Reading symbols from /home/osboxes/openwrt/build_dir/target-aarch64_cortex-a53_musl/linux-mediatek_mt7622/backports-6.1-rc8/net/mac80211/mac80211.o...

(gdb) l *ieee80211_sta_ps_transition+0x448
0x23a20 is in ieee80211_rx_8023 (/home/osboxes/openwrt/build_dir/target-aarch64_cortex-a53_musl/linux-mediatek_mt7622/backports-6.1-rc8/net/mac80211/rx.c:4771).
4766                 */
4767                xmit_skb->priority += 256;
4768                xmit_skb->protocol = htons(ETH_P_802_3);
4769                skb_reset_network_header(xmit_skb);
4770                skb_reset_mac_header(xmit_skb);
4771                dev_queue_xmit(xmit_skb);
4772            }
4773    
4774            if (!skb)
4775                return;

(gdb) l *ieee80211_rx_list+0x43c
0x26d0c is in ieee80211_rx_list (/home/osboxes/openwrt/build_dir/target-aarch64_cortex-a53_musl/linux-mediatek_mt7622/backports-6.1-rc8/net/mac80211/rx.c:5048).
5043        fast_rx = rcu_dereference(rx.sta->fast_rx);
5044        if (!fast_rx)
5045            goto drop;
5046    
5047        ieee80211_rx_8023(&rx, fast_rx, skb->len);
5048        return;
5049    
5050    drop:
5051        dev_kfree_skb(skb);
5052    }

Brain2000 commented 1 year ago

@nbd168 I may have spotted a possibility: On or around net/mac80211/sta_info.c line 150:

    for (i = 0; i < ARRAY_SIZE(sta->sta.txq); i++) {
        struct txq_info *txqi;

        if (!sta->sta.txq[i])
            continue;

        txqi = to_txq_info(sta->sta.txq[i]);

        ieee80211_txq_purge(local, txqi);
    }

    for (ac = 0; ac < IEEE80211_NUM_ACS; ac++) {
        local->total_ps_buffered -= skb_queue_len(&sta->ps_tx_buf[ac]);
        ieee80211_purge_tx_queue(&local->hw, &sta->ps_tx_buf[ac]);
        ieee80211_purge_tx_queue(&local->hw, &sta->tx_filtered[ac]);
    }

    if (ieee80211_vif_is_mesh(&sdata->vif))
        mesh_sta_cleanup(sta);

    cancel_work_sync(&sta->drv_deliver_wk);   <---- this cancels calls to release_buffered_frames

There's a lot of txq purging going on before release_buffered_frames( ) is guaranteed to be stopped. mt76_release_buffered_frames makes calls to ieee80211_tx_dequeue( ) Possible race condition between the returned skb from ieee80211_tx_dequeue and it also being free'd?

If this is an issue, looks like a potential fix would be to move the cancel_work_sync( ) up to before the tx queues are purged...

mrkiko commented 1 year ago

mark. Congrats for the interesting work.

nbd168 commented 1 year ago

@Brain2000, the critical parts of ieee80211_txq_purge and ieee80211_tx_dequeue are protected from racing against each other via fq->lock.

Brain2000 commented 1 year ago

@nbd168 Ah, I was confused how the skb was being protected after returned from ieee80211_tx_dequeue( ). I see now that it is removed from a list and returned, so any txq_purge would skip over it because it's no longer in the list.

I'm still trying to figure out this crash. I have two identical wifi units, one compiled with the extra two spinlocks, and one without. The one with the spinlock does not crash. The one without crashes every few hours. Any luck finding what the true cause of this is? The ieee80211 documentation states "do not call ieee80211_tx( ) or ieee80211_rx_list( ) concurrently", but you seem very sure that this is not an issue.

nbd168 commented 1 year ago

Good point regarding concurrency of ieee80211_tx_status and ieee80211_rx_list. Though I can't point at any specific racy codepath, I do agree that we should protect against concurrent status/rx calls. However, I don't think we need a separate spinlock for this. Please try this patch: https://nbd.name/p/090e6ccd

Brain2000 commented 1 year ago

@nbd168 That patch looks good!

I found one more crash underneath all of this in ieee80211_rx_list( ) as it calls the xxxx_8023( ) routines between mac80211 and ethernet. I believe that when a STA disconnects but has a multicast packet in the queue, possibly something in the SKB becomes wiped out before drv_sta_pre_rcu_remove( ) is called. Perhaps at the top of __sta_info_destroy_part1( ), or deeper in.

I'm adding various WARN( ) sanity checks in strategic places within net/mac80211/rx.c and net/ethernet/eth.c until I can pinpoint what is being null'd, then perhaps this can be corrected.

Note the line below where it thinks the departing station has a TX queue size of 0... AFTER the station has left. Followed by an immediate crash. This is the first time I've seen this entry.

br-vlan15: port 7(wl0-ap0.4097) entered disabled state
device wl0-ap0.4097 left promiscuous mode
wl0-ap0.4097 selects TX queue 0, but real number of TX queues is 0   <-- I thought this entry was interesting
Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
pc : eth_type_trans+0x44/0x190
lr : ieee80211_sta_ps_transition+0x12a0/0x3090 [mac80211]

Brain2000 commented 1 year ago

@nbd168 Final item:

You mentioned this may also be needed to avoid a race condition:

diff --git a/mt7615/mac.c b/mt7615/mac.c
index da1d17b7..a73a11d2 100644
--- a/mt7615/mac.c
+++ b/mt7615/mac.c
@@ -907,7 +907,6 @@ void mt7615_mac_sta_poll(struct mt7615_dev *dev)
        INIT_LIST_HEAD(&sta_poll_list);
        spin_lock_bh(&dev->sta_poll_lock);
        list_splice_init(&dev->sta_poll_list, &sta_poll_list);
-       spin_unlock_bh(&dev->sta_poll_lock);

        while (!list_empty(&sta_poll_list)) {
                bool clear = false;
@@ -915,6 +914,7 @@ void mt7615_mac_sta_poll(struct mt7615_dev *dev)
                msta = list_first_entry(&sta_poll_list, struct mt7615_sta,
                                        poll_list);
                list_del_init(&msta->poll_list);
+               spin_unlock_bh(&dev->sta_poll_lock);

                addr = mt7615_mac_wtbl_addr(dev, msta->wcid.idx) + 19 * 4;

@@ -937,8 +937,10 @@ void mt7615_mac_sta_poll(struct mt7615_dev *dev)
                        memset(msta->airtime_ac, 0, sizeof(msta->airtime_ac));
                }

-               if (!msta->wcid.sta)
+               if (!msta->wcid.sta) {
+                       spin_lock_bh(&dev->sta_poll_lock);
                        continue;
+               }

                sta = container_of((void *)msta, struct ieee80211_sta,
                                   drv_priv);
@@ -953,7 +955,10 @@ void mt7615_mac_sta_poll(struct mt7615_dev *dev)
                        ieee80211_sta_register_airtime(sta, tid, tx_cur,
                                                       rx_cur);
                }
+
+               spin_lock_bh(&dev->sta_poll_lock);
        }
+       spin_unlock_bh(&dev->sta_poll_lock);
 }
 EXPORT_SYMBOL_GPL(mt7615_mac_sta_poll);

badsmoke commented 1 year ago

has anything happened here yet?

I also have a rt3200 and problems with the vlan, maybe this could solve the problem

rany2 commented 6 months ago

I've started getting this crashlog https://github.com/openwrt/mt76/issues/763#issuecomment-1481548061 after I enabled FT over DS. It appears FT over the Air doesn't have this issue.

Edit: this is with latest mt76/mac80211 from snapshot.

openwrt / mt76

mt76 crash issue, and proposed resolution #763