Size and permissions returned in the Completion of PCIe ATS translation request

just-for-fun-too commented 6 months ago

Hi

The specification in Section 2.1.3 says:

"If the EN_ATS bit is 1 and the T2GPA bit is set to 1 the IOMMU performs the two-stage address translation to determine the permissions and the size of the translation to be provided in the completion of a PCIe ATS Translation Request from the device."

I just want to confirm that the permissions and the (page) size returned in the response correspond to the second-stage leaf entry and not the first-stage leaf entry. We are providing, in the completion of the PCIe ATS Translation request, the GPA from the leaf entry of the first-stage translation tables, but the permissions and size from the second-stage (against which we checked the request permissions because of doing two-stage translation). Is this correct?

If it is correct, then when we do a translation later on, from GPA to SPA, we will be comparing the permissions that came with the request and were from the second stage translation against the actual second stage permissions at that time. What is the benefit or reason to do this?

Thanks

ved-rivos commented 6 months ago

When T2GPA is on, the IOMMU does the complete address translation and permission checking. However, the response to the address translation request provides the GPA to the device and not the SPA. When the device performs a memory access using the GPA, it is then translated to an SPA and the permissions are verified again as part of that process.

What is the benefit or reason to do this?

Please see the discussion in Section 2.1.3.

just-for-fun-too commented 6 months ago

Hi Thanks. I had read that section. I also read a number of other questions and responses logged here in the past relating to T2GPA, and as such, that part of the issue is pretty clear now.

However, my primary question remains unanswered. The question is, what are the permissions and (page) size that get returned with the GPA in the completion message of the address translation request? Is it from the first-stage or the second-stage when T2GPA is on.

ved-rivos commented 6 months ago

The question is, what are the permissions and (page) size that get returned with the GPA in the completion message of the address translation request? Is it from the first-stage or the second-stage when T2GPA is on.

The permissions and page sizes returned are always the final permission and page sizes i.e. the most restrictive of either stages. The determination of permission to provide does not depend on T2GPA configuration.

just-for-fun-too commented 6 months ago

Okay. Let me restate the question.

As the specification says ...If the EN_ATS bit is 1 and the T2GPA bit is set to 1 the IOMMU performs the two-stage address translation to determine the permissions and the size of the translation. So we: 1- Perform two stage translation. 2- We check the requested permissions against the first-stage translation and second-stage translation. 3- We return the first-stage address (GPA) 4- The question is: Which permissions and (page) size do we return?

You answered above that the "permissions and page sizes are always the final permissions and page sizes" which to me implies in the scenario listed above its from the second-stage translation because that is the final stage. But then you also said that it is "the most restrictive of either stages" which implies it can be from either of the two stages, which ever is most restrictive.

Sorry but this is what I am confused about.

ved-rivos commented 6 months ago

When two-stage address translation is performed the permission and page sizes output by the process are always the most restrictive permission and page size. For instance if the page is writable in VS-stage but not writable in G-stage or vice versa then the accumulated permission is not-writable. If the page size in VS-stage is 2 MiB and in G-stage is 4 KiB or vice versa then page size is 4 KiB and so on. So while the GPA is returned to the device the permission and page size returned are the output of the two stage address translation.

The question is: Which permissions and (page) size do we return?

We return the permissions and page size that is the result of the two-stage address translation. It is the most restrictive of the permissions provided by either stage and is the smallest page size encoded in either stage.

just-for-fun-too commented 6 months ago

Thank you very much for the examples! This answers my questions very clearly.

BTW I did check the RISC-V Instruction Set Manual Volume II: Privileged Architecture also just in case I missed this. But even there I haven't been able to find this concept of restrictive permissions and the smallest page size explicitly called out anywhere. Neither did I find it anywhere in the IOMMU specification.

yanhe234 commented 1 month ago

Hello sir, why is there a permission check here? Why does ATS not have permission? I think in this case, when translating both the first and second stages without ATS, is there a missing permission check for one stage? Please correct my mistake and look forward to your reply

    check_access_perms = ( TTYP != PCIE_ATS_TRANSLATION_REQUEST ) ? 1 : 0;
    if ( (ioatc_status = lookup_ioatc_iotlb(req->tr.iova, check_access_perms, priv, is_read, is_write,
                  is_exec, SUM, PSCV, PSCID, GV, GSCID, &cause, &pa, &page_sz,
                  &vs_pte, &g_pte, &is_msi)) == IOATC_FAULT )
        goto stop_and_report_fault;

ved-rivos commented 1 month ago

This follows the ATS protocol.

If the IOMMU was unable to translate an address because of an error in the IOMMU then it returns a CA
If the IOMMU was unable to translate an address due to an error which can be corrected by the OS - but till it is corrected the IOMMU will not be able to provide any translation completions then it returns a UR
If the IOMMU could not find a valid translation it returns success with R=W=0
If a valid translation could be found then it returns the permissions available. There is no fault caused. This is to allow for recovery from missing permissions. If the device requests write permission but write permissions were not available then its not a fault - the IOMMU just returns W=0 in the completion. In this case the device may generate a Page Request to request the additional permissions.

yanhe234 commented 1 month ago

Many thanks, can you also explain the following code in the fourth scenario of your answer,? Why do I find that all the PTEs read here are the same, and why do I look at amo_gpte. w? I don't think it's much different from gpte. w?

gpte_changed = (amo_gpte.raw == gpte->raw) ? 0 : 1;

    if ( gpte_changed == 0 ) {
        amo_gpte.A = 1;
        // The case for is_write == 1 && pte.W == 0 is to address ATS translation
        // requests that may request write permission when write permission does not
        // exist. If write permission exists then the D bit is set else D bit is not
        // set and the write permission is returned in responses as 0.
        if ( (is_write == 1) && (amo_gpte.W == 1) ) amo_gpte.D = 1;
    }

Another question is, if what I said above is correct, why is there a comment in the isw_write==1and pte. w==0

ved-rivos commented 1 month ago

I don't think it's much different from gpte. w?

Between the first load and the amo-load, the PTE may have been changed by SW or it may have been updated by another IOMMU or CPU MMU.

Another question is, if what I said above is correct, why is there a comment in the isw_write==1and pte. w==0

The IOMMU cannot grant write permission to the device without setting the D bit. This is because once write permission has been granted the device may use translated requests to modify memory and if D bit is not set the the OS cannot know that the page was modified. The IOMMU does not set the D bit unless the device indicates intent to write by setting no-write to 0. For more details see section 10.2.2.5 of the PCIe 6.0 specification and also the discussion in section 10.1.2 of the PCIe 6.0 specification about conservative dirty-on-write-permission-grant behavior.

yanhe234 commented 1 month ago

Between the first load and the amo-load, the PTE may have been changed by SW or it may have been updated by another IOMMU or CPU MMU.

But the premise of this is that gpte_changed==0, that is　amo_gpte.raw == gpte->raw，That is to say, there has been no change

gpte_changed = (amo_gpte.raw == gpte->raw) ? 0 : 1;

ved-rivos commented 1 month ago

if amo_gpte.raw == gpte->raw is 1, then gpte_changed is 0 => gpte did not change if amo_gpte.raw == gpte->raw is 0, then gpte_changed is 1 => gpte changed

If gpte did not change then the A and/or D bit is updated, else the process retries by going back to step 2.

riscv-non-isa / riscv-iommu

Size and permissions returned in the Completion of PCIe ATS translation request #335