Wrong MSI detection for S1 in the reference model leads to wrong OS driver quirk in the recent Linux kernel

zetalog commented 2 months ago

In iommu_translate.c, the reference model is implemented in this way:

    // Count misses in TLB
    count_events(PV, PID, PSCV, PSCID, DID, GV, GSCID, IOATC_TLB_MISS);
    if ( two_stage_address_translation(req->tr.iova, check_access_perms, DID, is_read, is_write, is_exec,
                                        PV, PID, PSCV, PSCID, iosatp, priv, SUM, DC.tc.SADE,
                                        GV, GSCID, iohgatp, DC.tc.GADE, DC.tc.SXL,
                                        &cause, &iotval2, &gpa, &page_sz, &vs_pte) )
        goto stop_and_report_fault;

    // 18. If MSI address translations using MSI page tables is enabled
    //     (i.e., `DC.msiptp.MODE != Off`) then the MSI address translation process
    //     specified in <<MSI_TRANS>> is invoked. If the GPA `A` is not determined to be
    //     the address of a virtual interrupt file then the process continues at step 19.
    //     If a fault is detected by the MSI address translation process then stop and
    //     report the fault else the process continues at step 20.
    if ( msi_address_translation(gpa, is_exec, &DC, &is_msi, &is_mrif, &mrif_nid, &dest_mrif_addr,
                                 &cause, &iotval2, &pa, &gst_page_sz, &g_pte, check_access_perms) )
        goto stop_and_report_fault;
    if ( is_msi == 1 ) goto skip_gpa_trans;

That means for an S1 only translation, an address will be translated prior than to be detected as MSI.

While in the specification, MSI configurations are resident in the DC, which is independent of S1. Also we can see a different model in qemu: https://gitlab.com/danielhb/qemu/-/blob/riscv_iommu_v5_rc1/hw/riscv/riscv-iommu.c?ref_type=heads Which detects MSI prior than performing S1:

static int riscv_iommu_spa_fetch(RISCVIOMMUState *s, RISCVIOMMUContext *ctx,
    IOMMUTLBEntry *iotlb)
{
    ....

    /* Early check for MSI address match when IOVA == GPA */
    if ((iotlb->perm & IOMMU_WO) &&
        riscv_iommu_msi_check(s, ctx, iotlb->iova)) {
        iotlb->target_as = &s->trap_as;
        iotlb->translated_addr = iotlb->iova;
        iotlb->addr_mask = ~TARGET_PAGE_MASK;
        return 0;
    }

The wrong reference model requires a special quirk to be introduced in the recent Linux kernel IOMMU driver, which requires a VA=PA mapping to be created for the MSI table. And now this is known to be the significant difference between Linux 6.6 and Linux 6.10.

    imsic_global = imsic_get_global_config();
    if (!imsic_global || !imsic_global->nr_ids)
        return 0;

    base = imsic_global->base_addr;
    stride = IMSIC_MMIO_PAGE_SZ << imsic_global->guest_index_bits;
    for (i = 0; i < BIT(imsic_global->hart_index_bits); i++) {
        if (riscv_iommu_map_pages(&domain->domain, base, base,
                      IMSIC_MMIO_PAGE_SZ, 1, prot,
                      GFP_KERNEL_ACCOUNT, &mapped)) {
            /* unroll mapping */
            do {
                riscv_iommu_unmap_pages(&domain->domain, base,
                            IMSIC_MMIO_PAGE_SZ, 1,
                            NULL);
                base -= stride;
            } while (i-- > 0);

            return -ENOMEM;
        }
        base += stride;
    }

ved-rivos commented 2 months ago

Two stage address translation is always active in the IOMMU and there is no option to disable it. However, any stage can be effectively disabled by programming the mode for that stage as Bare. If VS-stage is not Bare for a transaction then the transaction carries an VA and not a GPA. The MSI detection is performed on a GPA and not an VA and the VS-stage address translation must be performed even if the VA is identity mapped (which the IOMMU cannot guess) in the VS-stage. So all of this is as per specification. Please take it up on LKML if there are questions about the Linux kernel.

zetalog commented 2 months ago

The usage model is not related to the 2-stage address translation. The problem is seen in a supervisor Linux kernel where a 6.6 based IOMMU driver patchset is applied. This version of IOMMU patchset is lack of the above mentioned MSI table identity mapping: ... Same Linux kernel runs fine on Qemu.

zetalog commented 2 months ago

This looks weird that in an S1 only environment, without MSI PT feature, only S1 PT translation should be required for MSI writes. While with MSI PT feature, not only S1 PT translation but also MSI PT translation is required for MSI writes.

ved-rivos commented 2 months ago

This looks weird that in an S1 only environment, without MSI PT feature, only S1 PT translation should be required for MSI writes. While with MSI PT feature, not only S1 PT translation but also MSI PT translation is required for MSI writes.

If software has configured the IOMMU to do VS-stage address translations then the IOMMU does VS-stage address translations. If further software has configured MSI page tables to translate MSIs in Basic or MRIF mode then the IOMMU translates further using the MSI page table.

zetalog commented 2 months ago

Yes, OSen like Linux only provides MSI specific operations in the vfiommu framework and the feature looks tightly related to the S2. But let me try to express the data flow, and please correct me if they are wrong.

My understanding is:

MSI PT unmapped with DMA remapped When a guest OS (VS) is using the MSI feature, it need not to map the MSI address into a virtual address while still be able to perform the DMA remapping for the IO addresses used by the DMA devices, allowing such devices to send MSIs by treating the MSI address as a physical address. While the hypervisor OS (HS) detects the MSI writes from other remapped DMA transactions by matching them with MSI PT entries. NOTE that the physical address is only used by the device thus it won't be accessed by CPU, and not remapped into the virtual address space. When the hypervisor OS is using the MSI feature, it can also not to map the MSI address into an virtual address while still be able to perform the DMA remapping for the IO addresses used by the DMA devices, allowing devices to send MSIs by treating the MSI address as a physical address.
MSI PT remapped with DMA remapped When a guest OS is using the MSI feature, it can map the MSI address into a virtual address while still be able to perform the S1 DMA remapping for the IO addresses used by the DMA devices, allowing devices to send MSIs by treating MSI address as a virtual address. While the hypervisor OS detects the MSI writes from other remapped DMA transactions by matching them with MSI PT entries after performing the DMA translations. NOTE that the virtual address is not used by CPU, and there might not be a virtual address space region ready to identity map the physical MSI address. When the hypervisor OS is using the MSI feature, it can also map the MSI address into a virtual address while still be able to perform the S1 DMA remapping for the IO addresses used by the DMA devices, allowing devices to send MSIs by treating the MSI address as a virtual address.

Either way, the MSI PT acts like an ITS (ARM world feature) filtering MSI writes from the DMA transactions. Among the above scenarios, which one is the better practice in OSen?

The QEMU model allows both usage models, RIVOS model only allows the 2nd usage model. Is there any concern to restrict MSI PT for being used only in vfiommu framework?

zetalog commented 2 months ago

NOTE that in an SoC design, devices including MSI write addresses should be designed to be resident in the higher physical address space while DMA remapped zone should be resident in the lower virtual address space. Such detection should be safe without worrying about the conflict of DMA transactions and MSI writes.

ved-rivos commented 2 months ago

Please see IOMMU specification section "Process to translate IOVA".

step 17, the IOVA is first translated to a GPA. If iosatp.mode is Bare then IOVA is same as GPA else IOVA is as determined by walking the VS-stage page tables.
step 18, the GPA is used to determine if it is a MSI if MSI address translation is enable. The VS-stage cannot be skipped before doing MSI detection.
step 19, if the GPA is not translated by the MSI page tables then the GPA is translated by the G-stage page tables.

I briefly looked through the QEMU patches and see somewhere along the line it dropped en_s from the patch and this change makes it not compliant with the RISC-V IOMMU specification since when VS stage address translation is enabled whether the IOVA is identity mapped or not cannot be inferred i.e. IOVA may not be same as GPA when iosatp.MODE != Bare.

      /* Early check for MSI address match when IOVA == GPA */
-    if (!en_s && (iotlb->perm & IOMMU_WO) &&
+    if ((iotlb->perm & IOMMU_WO) &&

The GPA corresponding to the virtual IMSIC is mapped into the guest. It is also mapped into the virtual address space of the guest OS since that mapping is required for the OS to do IPIs.

The QEMU model allows both usage models, RIVOS model only allows the 2nd usage model.

What is RIVOS model?

ved-rivos commented 1 month ago

Please ask if there are further questions/comments.

riscv-non-isa / riscv-iommu

Wrong MSI detection for S1 in the reference model leads to wrong OS driver quirk in the recent Linux kernel #392