quo / ithc-linux

Linux driver for Intel Touch Host Controller
35 stars 6 forks source link

Fix irq source-id error #2

Closed quo closed 1 month ago

quo commented 2 years ago

The IOMMU gives the following error when trying to use the irq:

DMAR: DRHD: handling fault status reg 2 DMAR: [INTR-REMAP] Request device [01:05.0] fault index 0x2f [fault reason 0x26] Blocked an interrupt request due to source-id verification failure

No clue how to fix this. May be an ACPI bug?

quo commented 2 years ago

It says 01:05.0, but the actual address of the THC is 00:10.6. I guess that's the problem? But I don't really know how any of this IOMMU stuff works. According to lspci there isn't even anything at 01:05.0...

Should probably compile a kernel with INTEL_IOMMU_DEBUGFS to get some more info.

quo commented 2 years ago

The source/request ID check is set up in the kernel by set_msi_sid(). This checks for DMA alias ids with pci_for_each_dma_alias(). As far as I can tell, the only ways you can have aliases are:

  1. If the device is on a VMD bus (via pci_real_dma_dev()), which I don't think it is,
  2. If an alias was set up with pci_add_dma_alias(), but this can only create aliases on the same bus, or
  3. If the device is behind a PCI bridge, which I also don't think it is.

So since there are no aliases, set_msi_sid() creates a strict check for the 00:10.6 sid. And then for some reason the interrupt has request id 01:05.0.

We can't use a quirk to add an alias since the bus number differs. We can add a hack to set_msi_sid() to disable id checking for the THC device only (instead of just disabling it for all devices with intremap=nosid). We could also leave the check enabled, but hardcode the sid to 01:05.0 for THC, but since I don't understand where that value comes from, I don't know if the value might change.

Patch to disable the sid check for ITHC device only:

diff --git a/drivers/iommu/intel/irq_remapping.c b/drivers/iommu/intel/irq_remapping.c
index a67319597884..9f9322a17810 100644
--- a/drivers/iommu/intel/irq_remapping.c
+++ b/drivers/iommu/intel/irq_remapping.c
@@ -396,6 +396,22 @@ static int set_msi_sid(struct irte *irte, struct pci_dev *dev)
    data.busmatch_count = 0;
    pci_for_each_dma_alias(dev, set_msi_sid_cb, &data);

+   /*
+    * The Intel Touch Host Controller is at 00:10.6, but for some reason
+    * the MSI interrupts have request id 01:05.0.
+    * Disable id verification to work around this.
+    * FIXME Find proper fix or turn this into a quirk.
+    */
+   if (dev->vendor == PCI_VENDOR_ID_INTEL && (dev->class >> 8) == PCI_CLASS_INPUT_PEN) {
+       switch(dev->device) {
+       case 0x98d0: case 0x98d1: // LKF
+       case 0xa0d0: case 0xa0d1: // TGL LP
+       case 0x43d0: case 0x43d1: // TGL H
+           set_irte_sid(irte, SVT_NO_VERIFY, SQ_ALL_16, 0);
+           return 0;
+       }
+   }
+
    /*
     * DMA alias provides us with a PCI device and alias.  The only case
     * where the it will return an alias on a different bus than the
Headcrabed commented 1 year ago

Any news about this? Firmware seems to have upgraded since then, did acpi table changed?

quo commented 1 year ago

It looks like nosid is maybe no longer required for the new Alder Lake devices, so I suspect it was a hardware bug on Tiger Lake.

I don't know if something could be done on the ACPI side to fix/workaround the problem (I don't really know how ACPI interacts with the iommu).

I believe the above patch will be added to the Surface kernel, which will remove the need for using nosid with that kernel at least.

I think a proper fix will involve adding support to the kernel for DMA aliases with different bus numbers, then I could add an alias in the ithc driver. But someone who really understands how PCI MSI and the Intel iommu work should have a look at this.

Edit: I could also add some code to detect if the irq is working, and automatically switch to polling mode if it isn't. Not optimal, but maybe the easiest fix.

quo commented 1 month ago

https://www.intel.com/content/www/us/en/content-details/630747/intel-500-series-chipset-family-on-package-platform-controller-hub-specification-update.html

29 Incorrect MSI BDF Returned By Touch Host Controller. Problem: When a hypervisor is enabled, the Message Signaled Interrupt for the Touch Host Controller (THC) returns an incorrect bus device function number.

So I guess that confirms it's a Tiger Lake hardware bug.

Closing this as the iommu patch seems to do the job.