MT7981 ap_vlan crash on small unicast TX packets (IP packets of 482 bytes or less)

zekica commented 4 months ago

I already made a comment on #866, but I don't think this is related.

I'm experiencing crashes with multi-psk on MT7981.

when connecting to the network using the main WPA2 PSK, everything is stable.
when connecting to the network using a secondary WPA2 PSK, if station doesn't move to another ap_vlan everything is stable.
when connecting to the network using any of the secondary PSKs, if a station moves to a secondary ap_vlan, everything works until a station starts sending traffic to the AP
then the chip hangs, and a combination of 00005aed and 000026ed timeouts happens

When using OpenWrt snapshot without any patches to the mt76 driver, the chip completely restarts on it's own and the wifi network appears in a couple of seconds. All clients including ones connected via the main PSK get disconnected.

Then I tried rany2/openwrt@18cc739 patch and 0x5a messages stop appearing but the chip still hangs, the driver shows 0x26 timeout and restarts.

I then tried to compile the rany2/openwrt fork and since it applies a bunch of patches, when the chip hangs, it manages to recover without disconnecting clients, but shows the following:

[  447.275349] mt798x-wmac 18000000.wifi: send message 000130ed timeout, try again(1).
[  447.283349] mt798x-wmac 18000000.wifi: 
[  447.283349] phy0 L1 SER recovery completed.
[  447.821897] mt798x-wmac 18000000.wifi: phy0 SER recovery state: 0x00000004
[  447.828811] mt798x-wmac 18000000.wifi: 
[  447.828811] phy0 L1 SER recovery start.
[  447.837695] mt798x-wmac 18000000.wifi: phy0 SER recovery state: 0x00000008
[  447.854270] mt798x-wmac 18000000.wifi: phy0 SER recovery state: 0x00000010
[  447.861219] mt798x-wmac 18000000.wifi: phy0 SER recovery state: 0x00000020
[  447.868360] mt798x-wmac 18000000.wifi: 
[  447.868360] phy0 L1 SER recovery completed.

I'm assuming that 0x00130ed is message type 0x30 MCU_EXT_CMD_GET_TX_STAT.

The same setup works on MT7613, MT7612, MT7615, MT7603, client optimized MT7921k (n, ac, ax), and appears to not hang even on MT7975 (Asus RT-AX53U) even though it uses the same mt7915e module.

So I'm assuming that this is a firmware bug, so I tried all five firmware versions published on mtk-feeds, and it's similar with all, but the crashes don't happen as often with the latest firmware.

Additionally, I assumed that this is related to GTK (as it is different ) so I applied a patch similar to this mt7615 workaround from a few years back, but it didn't do anything.

If possible, can someone explain to me what's the difference between stations connected to the main AP interface vs ones connected to AP_VLAN interface? The GTK is different, but why would it cause it to crash the firmware?

atmospher3 commented 4 months ago

The same problem occurs on 802.1x with VLAN.

zekica commented 3 months ago

I have done further digging into this, and found the following:

In mac80211 framework, there is this snippet in key.c:

    if (sdata->vif.type == NL80211_IFTYPE_AP_VLAN) {
        /*
         * The driver doesn't know anything about VLAN interfaces.
         * Hence, don't send GTKs for VLAN interfaces to the driver.
         */
        if (!(key->conf.flags & IEEE80211_KEY_FLAG_PAIRWISE)) {
            ret = 1;
            goto out_unsupported;
        }
    }

which means that vlan-specific GTK is never sent to the driver and never gets KEY_FLAG_UPLOADED_TO_HARDWARE set. This in turn causes all multicast/broadcast packets for VLAN stations to be software-encrypted and sent to the hardware.

So it looks like the firmware crashes when sending software-encrypted packets and the receive buffer is full.

evelyn3648 commented 3 months ago

AP_VLAN share the same entry with AP, and it can support hardware-encrypted packets even packets from AP_VLAN.

Does it happen only on security mode? Could you try whether it happens under VLAN + OPEN security?

zekica commented 3 months ago

@evelyn3648 thank you for looking into this.

I have created an open network with one ap_vlan and assigned a single station to it. When that station is sending data, the chip crashes, so it looks like it's not directly related to software-encrypted packets. When a station that's not assigned to ap_vlan sends data, the chip doesn't crash.

Do you have any idea on where to look next?

evelyn3648 commented 3 months ago

@zekica
Could you check whether you hit this if condition? If yes, then it's also the same problem with Multicast-to-Unicast in mac80211. https://elixir.bootlin.com/linux/latest/source/net/mac80211/tx.c#L4517

zekica commented 3 months ago

I now have an open network (same setup as in https://github.com/openwrt/mt76/issues/881#issuecomment-2154449070

I have added printk for each multicast tx packet:

netdev_tx_t ieee80211_subif_start_xmit(struct sk_buff *skb,
                       struct net_device *dev)
{
    struct ieee80211_sub_if_data *sdata = IEEE80211_DEV_TO_SUB_IF(dev);
    const struct ethhdr *eth = (void *)skb->data;

    if (likely(!is_multicast_ether_addr(eth->h_dest)))
        goto normal;

    if (unlikely(!ieee80211_sdata_running(sdata))) {
        kfree_skb(skb);
        return NETDEV_TX_OK;
    }
        printk("multicast-packet");

    if (unlikely(ieee80211_multicast_to_unicast(skb, dev))) {
        struct sk_buff_head queue;
        printk("multicast-to-unicast-packet");

        __skb_queue_head_init(&queue);
        ieee80211_convert_to_unicast(skb, dev, &queue);
        while ((skb = __skb_dequeue(&queue)))
            __ieee80211_subif_start_xmit(skb, dev, 0,
                             IEEE80211_TX_CTRL_MLO_LINK_UNSPEC,
                             NULL);
    } else if (ieee80211_vif_is_mld(&sdata->vif) &&
           sdata->vif.type == NL80211_IFTYPE_AP &&
           !ieee80211_hw_check(&sdata->local->hw, MLO_MCAST_MULTI_LINK_TX)) {
        ieee80211_mlo_multicast_tx(dev, skb);
    } else {
normal:
        __ieee80211_subif_start_xmit(skb, dev, 0,
                         IEEE80211_TX_CTRL_MLO_LINK_UNSPEC,
                         NULL);
    }

    return NETDEV_TX_OK;
}

And I can see that the packets are not transmitted as multicast-to-unicast, but I can't be sure, as the second printk doesn't fire even if I set up the bridge as:

br-lan/brif/lan1/multicast_to_unicast:0
br-lan/brif/lan2/multicast_to_unicast:0
br-lan/brif/lan3/multicast_to_unicast:0
br-lan/brif/phy1-ap0-10/multicast_to_unicast:1
br-lan/brif/phy1-ap0/multicast_to_unicast:1

DanielRIOT commented 3 months ago

@zekica I did the same ( but used mesh in a bridge with one of the ethernet ports ) and also did not see my "printk("multicast-to-unicast-packet");" being called :(

( CUDY WR3000 V1 - MT7981 )

zekica commented 3 months ago

@DanielRIOT

I am using the same device for most of my testing, but also Unifi U6+ as it has the same wifi chip, and both have the same issue.

zekica commented 3 months ago

@evelyn3648 I have managed to trigger this without any RX or encryption at all.

The TX packet size makes all the difference:

IP packets with size of 483 bytes or larger don't crash the firmware
IP packets with size of 482 bytes or smaller crash the firmware

RX packets of any size don't cause any crashes.

This only happens with vlan, not with standard ap

zekica commented 3 months ago

I have found a workaround that probably still crashes on multicast_to_unicast but doesn't crash on standard unicast small packets and that can't be in any way accepted upstream.

This won't work if you have multiple wifi drivers, but patching mac80211 framework iface.c with the following works around the issue:

 static const struct net_device_ops ieee80211_dataif_ops = {
    .ndo_open       = ieee80211_open,
    .ndo_stop       = ieee80211_stop,
    .ndo_uninit     = ieee80211_uninit,
-   .ndo_start_xmit     = ieee80211_subif_start_xmit,
+   .ndo_start_xmit     = ieee80211_subif_start_xmit_8023,
    .ndo_set_rx_mode    = ieee80211_set_multicast_list,
    .ndo_set_mac_address    = ieee80211_change_mac,
    .ndo_get_stats64    = ieee80211_get_stats64,
    .ndo_setup_tc       = ieee80211_netdev_setup_tc,
 };

Can someone take a look and see why ieee80211_subif_start_xmit_8023 doesn't crash? This looks completely like a firmware issue.

Headcrabed commented 3 months ago

I have found a workaround that probably still crashes on multicast_to_unicast but doesn't crash on standard unicast small packets and that can't be in any way accepted upstream.

This won't work if you have multiple wifi drivers, but patching mac80211 framework iface.c with the following works around the issue:
 static const struct net_device_ops ieee80211_dataif_ops = {
  .ndo_open       = ieee80211_open,
  .ndo_stop       = ieee80211_stop,
  .ndo_uninit     = ieee80211_uninit,
-     .ndo_start_xmit     = ieee80211_subif_start_xmit,
+ .ndo_start_xmit     = ieee80211_subif_start_xmit_8023,
  .ndo_set_rx_mode    = ieee80211_set_multicast_list,
  .ndo_set_mac_address    = ieee80211_change_mac,
  .ndo_get_stats64    = ieee80211_get_stats64,
  .ndo_setup_tc       = ieee80211_netdev_setup_tc,
 };
Can someone take a look and see why ieee80211_subif_start_xmit_8023 doesn't crash? This looks completely like a firmware issue.

@nbd168 Can you have a look at this issue?

steveej commented 1 month ago

I have found a workaround that probably still crashes on multicast_to_unicast but doesn't crash on standard unicast small packets and that can't be in any way accepted upstream.

This won't work if you have multiple wifi drivers, but patching mac80211 framework iface.c with the following works around the issue:
 static const struct net_device_ops ieee80211_dataif_ops = {
  .ndo_open       = ieee80211_open,
  .ndo_stop       = ieee80211_stop,
  .ndo_uninit     = ieee80211_uninit,
-     .ndo_start_xmit     = ieee80211_subif_start_xmit,
+ .ndo_start_xmit     = ieee80211_subif_start_xmit_8023,
  .ndo_set_rx_mode    = ieee80211_set_multicast_list,
  .ndo_set_mac_address    = ieee80211_change_mac,
  .ndo_get_stats64    = ieee80211_get_stats64,
  .ndo_setup_tc       = ieee80211_netdev_setup_tc,
 };
Can someone take a look and see why ieee80211_subif_start_xmit_8023 doesn't crash? This looks completely like a firmware issue.

i applied this patch and am testing it on the Sinovoip BananaPi BPI-R3 using mainline linux 6.10 where i had severe wifi dropouts accompanied by these messages:

kernel: mt798x-wmac 18000000.wifi: Message 000026ed (seq 3) timeout

kernel: mt798x-wmac 18000000.wifi: Message 00005aed (seq 7) timeout

after half a day of testing so far the patch seems to help in my situation and i'm not seeing any of these messages and the wifi is stable so far.

that said, i wasn't able to produce the issue at will before this patch. is there a simple shell command that would let me send a problematic packet on purpose instead of waiting for an eventual yet random occurrence?

zekica commented 1 month ago

The way I tested was to set a small mtu and do iperf or just nc. The direction should be that the AP transmits.

On the sending pc do:

ip link set enxxx set mtu 482

rany2 commented 1 month ago

MT7915 (Belkin RT3200) doesn't seem to have this issue. I've attempted to reproduce with ping -4 -M do -s $((482 - 28)) -i0 -f <wifi ipv4> (running from the AP) but no firmware crash occurred (i.e., it didn't even attempt to recover from a crash and continued functioning normally). As MT7915 and and MT7981 use the same drivers, I think it's likely a firmware problem. I'm also using a AP_VLAN interface via the per-STA VIF option so that remains the same in my case.

I've noticed that there are some fixes for mt7915 related to error recovery in the latest mt76, so maybe it handles a firmware crash better and could recover reliably now on MT7981?

xize commented 1 month ago

I have tested this patch just now but with my ayaneo (wifi chip: ax210, br-lan.90, 10.87.32.4) I get these messages:

Thu Aug 22 14:25:05 2024 kern.err kernel: [  622.611990] mt798x-wmac 18000000.wifi: Retry message 00000010 (seq 5)
Thu Aug 22 14:25:26 2024 kern.err kernel: [  643.069199] mt798x-wmac 18000000.wifi: Retry message 00000010 (seq 5)
Thu Aug 22 14:25:46 2024 kern.err kernel: [  663.525705] mt798x-wmac 18000000.wifi: Retry message 00000010 (seq 5)

then also the full ethernet crashes however that is not due this change because it seems this patch suspressed likely other things so I tested it also without it.

without it I came into this full stacktrace (excuse me if it is a long trace, but it shows some interesting pointers in my case).

https://pastebin.com/DWgz71b7

it throwed something about radar detection and RCU and then it really locks down my router, I was able to save the log, it survived due to tcp but there was no communication possible, commands also failed there were no resources anymore to complete any task.

lukasz1992 commented 1 month ago

Would be good to check if new mtk firmware fixes the issue.

xize commented 1 month ago

for me it fixed it, my previous error was due to a bad commit which is believe was fixed in openwrt/openwrt@580ad3e6bb57216706dfb9bc44875cfc4ca41feb now running with the latest firmware without the patch by @zekica and I already played 2 days with my ayaneo in gta online, more than 4 hours per day and not a single crash :)

there was a time it even crashed on regular windows updates too but this also didn't happen with this update.

I can't tell for others though since the issue also had similar time out messages on other things.

my router uses MT7986 (GL-MT6000).

zekica commented 1 month ago

I have also tested this and with mtu 482 or less it hasn't crashed yet with the new firmware - it always crashed will all previous versions. I have transferred over 200GB with small packets and it didn't crash once - with previous versions of the firmware, it would crash with less than 30MB transmitted 100% of the time.

I'll close this issue, but I was wondering what the best way to backport this firmware to OpenWrt 23.05.x would be, as it works fine on both snapshot and 23.05 branch without any additional patches.

openwrt / mt76

MT7981 ap_vlan crash on small unicast TX packets (IP packets of 482 bytes or less) #881