RAS exception codes may need to change

jhauser-us commented 1 year ago

The Architecture Review Committee will probably need to discuss the IOMMU's new use of RAS exception codes as a result of commit 4f6fa21. In particular, when page tables are discovered to be corrupted, it's not obvious to me that the type of the "data corruption" exception code to report should be for the original access type. Reporting page table corruption as "instruction data corruption" or "store data corruption" seems misleading to me.

This GitHub issue is essentially a "hold" pending the ARC's assessment of the matter.

jhauser-us commented 1 year ago

(Before despairing, be sure to read this entire comment.)

Yesterday, the Architecture Review Committee discussed the exception codes that harts may use for memory corruption traps. Although a proposal for these exception codes

| 32 | Instruction corruption | | 33 | Load data corruption | | 35 | Store/AMO data corruption |

was earlier presented to the ARC, on further review the committee realized yesterday it wasn't certain whether this really was the best set to define for RISC-V harts. In particular, there were concerns that software may want page table corruption to be distinguished from corruption of the "actual" memory access, and may want guest page tables distinguished from regular page tables (much as page fault exceptions are already distinct from access faults, and guest-page faults are separate from regular page faults). On the other hand, it wasn't obvious there was a need to distinguish corruption detected during loads versus that detected during stores/AMOs.

The ARC was concerned also that the contents of caches may be "poisoned" for reasons other than accidental corruption, and that some hardware might be unable to separate, after the fact, cache poisonings caused by corruption versus for other reasons. Consequently, the word corrupt may not adequately describe the reason for this fault in all cases.

The ARC definitely wishes to continue the policy that the IOMMU's use of standard exception codes in the range 0 through 63 matches the RISC-V Privileged Architecture's definition of those codes. Unfortunately, the ARC had more questions than answers about what the Privileged Architecture's exception codes should be, and fully resolving the matter might take many months.

In the meantime, to prevent the IOMMU specification from being blocked, the ARC is proposing a compromise. The IOMMU can allocate a single exception code > 256 to indicate a generic memory poisoned or corrupt fault. A standard-conforming implementation of a RISC-V IOMMU will be allowed to report poisoned or corrupt memory either with this single memory poisoned or corrupt exception code, or with a more specific standard exception code for poisoned or corrupt memory in the range 0 to 63, defined for traps on RISC-V harts. However, in the near term, until such codes in the range 0 to 63 have been standardized, only the IOMMU's single generic exception code for poisoned or corrupt memory will actually be usable.

Of course, a fault record with this generic memory poisoned or corrupt cause code should have the iotval field set to the original IOVA for the fault. If the IOMMU TG believes it is crucial to supply additional information with this "generic" fault, the ARC proposes that the TG standardize use of iotval2 for this purpose.

ved-rivos commented 1 year ago

Some thoughts on the choice of using the error codes indicating the original access type:

The RAS architecture typically augments these faults with additional information in RAS error records to provide additional information such as the physical address (if available and appropriate), set/way where the error may have been encountered (if appropriate), etc.
Such corrupted data may not always originate from a cache so it may not be appropriate to consider these as only cache corruptions. For example, the load may be to uncacheable memory or performed with a NC/IO PBMT. Eventually, it is this supplementary information that is available in the RAS error records of the consumer component - e.g. the MMU or IOMMU or the load/store unit or the instruction fetch unit - and the error record of the component that deferred an uncorrected error - e.g. a device, memory, register, etc. - that would help guide it.
The intent of the data corruption exception being to precisely identify the load/store or instruction that attempted to consume the corrupted data and to contain further propagation of that corrupted data. This cannot be achieved reliably with an asynchronous RAS interrupt. The error handler may not access any PTE entries corresponding to the tval or attempt to read/decode any instructions at the epc. It may offline all pages that hold these PTEs and the eventual mapped page. However the further actions may be guided on knowing whether the original access was an instruction fetch - for example, such error can be recovered by reloading the code a non-volatile store, read-only data may perhaps be reloaded, etc.
Some implementations, due to lack of a cause code defined, tend to report such data corruptions as access violations of the original access type - including access faults on accessing G-stage PTEs.

Specifically for this version of the IOMMU specification:

Define a cause code beyond 256 as - page-table-corruption. The TTYP field of the fault record identifies the type of the transaction. The fault record includes the IOVA and the device/process-id for the transaction. The use of iotval2 is not required.
Add a non-normative note that if a future standardized cause code is defined in the RISC-V privileged specification to report attempted consumption of corrupted data then such error code may be also used.

jhauser-us commented 1 year ago

The RAS architecture typically augments these faults with additional information in RAS error records to provide additional information such as the physical address (if available and appropriate), set/way where the error may have been encountered (if appropriate), etc.

If you believe that all available information about a corruption will be provided elsewhere, then I interpret that to mean you won't object to relying on a single poisoned-or-corrupt cause code for such faults, as suggested.

Such corrupted data may not always originate from a cache so it may not be appropriate to consider these as only cache corruptions.

That doesn't matter. The point is that the fault might in some cases be indicated by a poison bit in a cache, and such poison bits can be set for other reasons too. This situation will lead to memory corruption potentially being indistinguishable from other reasons for the poison bit to get set.

The fact that this loss of information doesn't occur in certain circumstances (even if it's most circumstances) doesn't negate that it may occur.

ved-rivos commented 1 year ago

That doesn't matter. The point is that the fault might in some cases be indicated by a poison bit in a cache, and such poison bits can be set for other reasons too. This situation will lead to memory corruption potentially being indistinguishable from other reasons for the poison bit to get set.

Yes, it is possible that a poison indicator may be used for other reasons but the intent is the same to prevent consumption of that data. It is also not required that in all cases where data is poisoned the backing memory is indeed corrupted - it may very well not be. Was the reason to use the term "data corruption" and not "memory corruption". The intent of the data poisoning should be indicate that the data is not fit for consumption and its attempted consumption should cause a fault. A single bit can only encode one bit of information. So I am not sure how an implementation that uses of one bit of information in a cache block can expect to provide a reason for setting that bit in the cache block when that cache block is accessed, likely a long time in future after the bit was set, without also holding along with that cache block additional bits of information. However as you noted we should discuss this outside the scope of this issue.

ved-rivos commented 1 year ago

PR #184 to replace the codes 32, 33, and 35 - with a single code.

riscv-non-isa / riscv-iommu

RAS exception codes may need to change #180