openwrt / mt76

mac80211 driver for MediaTek MT76x0e, MT76x2e, MT7603, MT7615, MT7628 and MT7688
746 stars 342 forks source link

mt7915e 0000:00:0b.0: Message 00004eed (seq 12) timeout #852

Open matzehh84 opened 10 months ago

matzehh84 commented 10 months ago

Upgrade from OpenWrt 22.03.05 to 22.03.06 breaks AP. Reverted back to 22.03.05 and AP works again.

System-Log 22.03.05 OpenWrt 22.03.5 r20134-5f15225c1e.txt

System-Log 22.03.06 OpenWrt 22.03.6 r20265-f85a79bcb4.txt

Djfe commented 10 months ago

This is Qemu combined with mt7915e (looking at the logs)

matzehh84 commented 10 months ago

That is true. The name of the module is AW7915-BMD (AsiaRF). I have a second Qemu with an mt7915e chip but with another antenna configuration (4x4 dual band selectable instead of 2x2 dual band dual concurrent) AW7915-NP1 (AsiaRF). The second VM is not affected by this issue.

cristian-ciobanu commented 9 months ago

@matzehh84 are you using this module AW7915-BMD (AsiaRF) in OpenWRT as AP in DBDC mode ?

I have a similar card from AsiaRF (AW7916-AED) which I want to use it as AP in OpenWRT. https://asiarf.com/product/wi-fi-6e-m-2-ae-key-module-mt7916-aw7916-aed/

I compiled some time ago a OpenWRT trunk image and ran it as an x86 VM in Proxmox, the card was detected but I could not see two radios only one. I think both these two cards use the same driver mt7915e but probably different firmware.

Are you able to run this AW7915-BMD card as an access point with DBDC mode (two radios) with OpenWRT 22.03 ?

matzehh84 commented 7 months ago

Yes, AW7915-BMD is one of the devices I am using. Works until OpenWrt 22.03.5. Guest shows two wifi-interfaces and both can be used at the same time. I had to compile the hosts kernel with pci-quirk because PCI device reset is kind of broken:

linux-6.6.8/drivers/pci/quirks.c

<<<

/*
 * Mediatek MT7915, disable pci bus reset in order to
 * prevent host freeze on VM shutdown / restart when
 * using VFIO.
 */
DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_MEDIATEK, 0x7915, quirk_no_bus_reset);

>>>

Without this patch: cat /sys/class/vfio-dev/vfio0/device/reset_method shows "bus", but issuing 1 to /sys/class/vfio-dev/vfio0/device/reset freezes the host (Which happens automatically each time you shut down and restart the qemu guest).

With patch, the file /sys/class/vfio-dev/vfio0/device/reset_method disappears and I can stop and start VM without host freeze.

I am using the AW7915-BMD in M.2 B-Key slot on this mainboard: https://www.mitacmct.com/IndustrialMotherboard_PD10EHI_PD10EHI

Headcrabed commented 4 months ago

Possibly a duplicate of https://github.com/openwrt/mt76/issues/690 ?

cristian-ciobanu commented 4 months ago

@matzehh84 how can I apply this pci-quirk patch to the OpenWRT kernel ?

By simply creating a new patch file with the content you pasted above and then put it into the target/linux/x86/patches-6.6 and then compile again ?

I'm running OpenWRT trunk snapshot in a x86 VM.

matzehh84 commented 4 months ago

@matzehh84 how can I apply this pci-quirk patch to the OpenWRT kernel ?

By simply creating a new patch file with the content you pasted above and then put it into the target/linux/x86/patches-6.6 and then compile again ?

I'm running OpenWRT trunk snapshot in a x86 VM.

Patch is for hypervisor/host-kernel (5.15+) not for OpenWRT (guest). First start of guest should always work even without patching, if not, you most likely face another problem. Patch comes into play on guest shutdown/restart.

cristian-ciobanu commented 4 months ago

Right now I'm using Proxmox 8.2.2 with kernel 6.8.4 so whenever I boot the system with the AsiaRF (AW7916-AED) card connected using PCI passthrough to the OpenWRT VM I see these messages on Proxmox console repeating many times

`vfio-pci 0000:02:00.0: Unable to change power state from D3cold to D0, device inaccessible

vfio-pci 0000:02:00.0: timed out waiting for pending transaction; performing function level reset anyway

vfio-pci 0000:02:00.0: not ready 32767ms after FLR; waiting

vfio-pci 0000:02:00.0: not ready 65535ms after FLR; giving up ` The OpenWRT VM fails to start when I have the AsiaRF (AW7916-AED) card PCI passthrough to it.

matzehh84 commented 4 months ago

Right now I'm using Proxmox 8.2.2 with kernel 6.8.4 so whenever I boot the system with the AsiaRF (AW7916-AED) card connected using PCI passthrough to the OpenWRT VM I see these messages on Proxmox console repeating many times

`vfio-pci 0000:02:00.0: Unable to change power state from D3cold to D0, device inaccessible

vfio-pci 0000:02:00.0: timed out waiting for pending transaction; performing function level reset anyway

vfio-pci 0000:02:00.0: not ready 32767ms after FLR; waiting

vfio-pci 0000:02:00.0: not ready 65535ms after FLR; giving up ` The OpenWRT VM fails to start when I have the AsiaRF (AW7916-AED) card PCI passthrough to it.

Sounds familar...

  1. Try add Proxmox kernel command line option pcie_aspm=off

  2. Or try add Proxmox kernel command line option vfio_pci.ids=14c3:7915

  3. Or try add both Kernel options together vfio_pci.ids=14c3:7915 pcie_aspm=off

  4. Or try disable pcie aspm in computer bios

matzehh84 commented 4 months ago

Your device may have a different pci device id 14c3:7916 instead of 14c3:7915, please check output of lspci, lspci -n

cristian-ciobanu commented 4 months ago

I tested with the different options and I had success two times when I set the vfio_pci.ids=14c3:7906 option in Proxmox kernel command line and had the ASPM disabled in BIOS for the PCI express port.

The VM booted successfully and the card was detected and I saw the SSID being advertised to clients. I did a reload and a manual power on and it worked.

Then I shut down the system and later after some hours I turned it on again and the above messages started again to appear on the Proxmox console. I do not understand why it works randomly.

matzehh84 commented 4 months ago

I tested with the different options and I had success two times when I set the vfio_pci.ids=14c3:7906 option in Proxmox kernel command line and had the ASPM disabled in BIOS for the PCI express port.

The VM booted successfully and the card was detected and I saw the SSID being advertised to clients. I did a reload and a manual power on and it worked.

Then I shut down the system and later after some hours I turned it on again and the above messages started again to appear on the Proxmox console. I do not understand why it works randomly.

I think the device has got broken reset support, or reports unsupported reset methods to kernel. The only thing that made my system stable was disabling pci device reset completely on hypervisor/host-level by adding "DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_MEDIATEK, 0x7915, quirk_no_bus_reset);" to linux-6.6.8/drivers/pci/quirks.c and build a custom kernel.

But in your case it is probably DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_MEDIATEK, 0x7906, quirk_no_bus_reset);

cristian-ciobanu commented 4 months ago

Ok thank you. Can you provide a quick guidance how to proceed and compile the custom kernel and apply this patch ?

DebdutBiswas commented 1 week ago

I am able to build my own proxmox custom kernel with this patch. As stated earlier it solved the host freeze issue while OpenWrt vm shutdown or restart. But after a complete power cycle of host this timeout issue still pops up randomly. After some digging through other similar bugs with AMD graphics card I get to know that it is a power management issue with D3cold state.

I am not sure disabling D3cold state might be a workaround here.

cat /sys/bus/pci/devices/0000:01:00.0/d3cold_allowed returns: 1

Also after using this custom proxmox kernel I checked:

cat /sys/bus/pci/devices/0000:01:00.0/reset_method returns: cat: '/sys/bus/pci/devices/0000:05:00.0/reset_method': No such file or directory

Which should be the behavior we want here.

DebdutBiswas commented 1 week ago

While looking into pve-kernel/submodules/ubuntu-kernel/drivers/pci/quirks.c I have found an interesting method at line number 3802.

static void quirk_no_pm_reset(struct pci_dev *dev)
{
    /*
     * We can't do a bus reset on root bus devices, but an ineffective
     * PM reset may be better than nothing.
     */
    if (!pci_is_root_bus(dev->bus))
        dev->dev_flags |= PCI_DEV_FLAGS_NO_PM_RESET;
}

This method "quirk_no_pm_reset" is similar to "quirk_no_bus_reset" method which disables the PM (Power Management) reset for certain PCIe endpoint devices.

I will try to add MT7915, MT7916 and MT7906 to this "quirk_no_pm_reset" and rebuild the kernel with it.