riscv-non-isa / riscv-iommu

RISC-V IOMMU Specification
https://jira.riscv.org/browse/RVG-55
Creative Commons Attribution 4.0 International
82 stars 16 forks source link

How should IOMMU recover the faulty request? #261

Closed 18772820305 closed 10 months ago

18772820305 commented 11 months ago

After receiving a request from the device, IOMMU reported a malfunction during the translation process. If the fault is recoverable, how should we recover the faulty request? The RISC-V IOMMU spec does not seem to have developed a stall recovery mechanism like the ARM SMMU spec, such as software sending CMD_ RESUME command to recover transactions.

ved-rivos commented 11 months ago

The DMA transaction that encountered the fault is not recoverable unless the device is using the PCIe ATS protocol. The DMA transaction would receive an error response that is appropriate to the protocol in use (e.g., UR/CA, SLVERR, etc.). The model of stalling transactions and resuming stalled transactions is not supported. Only the terminate on fault model is supported.

18772820305 commented 11 months ago

Thank you for your reply. Does this mean that the fault request needs to be cleared from the IOMMU, and then the software will resend the request after receiving the fault notification?

ved-rivos commented 11 months ago

A memory read or write from a device may hit a fault condition. When this happens, two things are triggered:

  1. An error response is sent back to the device. This varies depending on the IO protocol. For PCIe, this could be a UR or CA response; for AXI, a SLVERR response.
  2. A fault record is generated and an interrupt is fired off to the IOMMU driver.

Device-specific behavior in response to this error response varies. A NIC might drop the packet, while an NVMe controller might mark the command as failed. In more severe cases, like a DMA read failure of a command descriptor, the device might become non-functional.

The fault report sent to the IOMMU driver is essentially a post-mortem; the faulting transaction itself can't be undone. The device has already been notified via the error response.

Faults like these should be rare and generally point to issues in the device driver or the device itself. Normally, the device driver should use the OS DMA APIs to ensure that memory addresses submitted to the device are both resident (pinned) and DMA-accessible. A fault triggered by absent pages in the page tables or insufficient permissions likely indicates a bug in this process. Similarly, if the device is buggy or misbehaving, unauthorized memory access attempts could also trigger faults. The IOMMU fault report serves as a diagnostic tool for identifying such issues.

ved-rivos commented 10 months ago

Thanks for raising this question, @18772820305. Hope that was helpful. If you have any more questions or concerns in the future, please don't hesitate to ask. Closing this issue now.