Closed jhauser-us closed 1 year ago
(Before despairing, be sure to read this entire comment.)
Yesterday, the Architecture Review Committee discussed the exception codes that harts may use for memory corruption traps. Although a proposal for these exception codes
| 32 | Instruction corruption | | 33 | Load data corruption | | 35 | Store/AMO data corruption |
was earlier presented to the ARC, on further review the committee realized yesterday it wasn't certain whether this really was the best set to define for RISC-V harts. In particular, there were concerns that software may want page table corruption to be distinguished from corruption of the "actual" memory access, and may want guest page tables distinguished from regular page tables (much as page fault exceptions are already distinct from access faults, and guest-page faults are separate from regular page faults). On the other hand, it wasn't obvious there was a need to distinguish corruption detected during loads versus that detected during stores/AMOs.
The ARC was concerned also that the contents of caches may be "poisoned" for reasons other than accidental corruption, and that some hardware might be unable to separate, after the fact, cache poisonings caused by corruption versus for other reasons. Consequently, the word corrupt may not adequately describe the reason for this fault in all cases.
The ARC definitely wishes to continue the policy that the IOMMU's use of standard exception codes in the range 0 through 63 matches the RISC-V Privileged Architecture's definition of those codes. Unfortunately, the ARC had more questions than answers about what the Privileged Architecture's exception codes should be, and fully resolving the matter might take many months.
In the meantime, to prevent the IOMMU specification from being blocked, the ARC is proposing a compromise. The IOMMU can allocate a single exception code > 256 to indicate a generic memory poisoned or corrupt fault. A standard-conforming implementation of a RISC-V IOMMU will be allowed to report poisoned or corrupt memory either with this single memory poisoned or corrupt exception code, or with a more specific standard exception code for poisoned or corrupt memory in the range 0 to 63, defined for traps on RISC-V harts. However, in the near term, until such codes in the range 0 to 63 have been standardized, only the IOMMU's single generic exception code for poisoned or corrupt memory will actually be usable.
Of course, a fault record with this generic memory poisoned or corrupt cause code should have the iotval field set to the original IOVA for the fault. If the IOMMU TG believes it is crucial to supply additional information with this "generic" fault, the ARC proposes that the TG standardize use of iotval2 for this purpose.
Some thoughts on the choice of using the error codes indicating the original access type:
Specifically for this version of the IOMMU specification:
TTYP
field of the fault record identifies the type of the transaction. The fault record includes the IOVA
and the device/process-id for the transaction. The use of iotval2
is not required. The RAS architecture typically augments these faults with additional information in RAS error records to provide additional information such as the physical address (if available and appropriate), set/way where the error may have been encountered (if appropriate), etc.
If you believe that all available information about a corruption will be provided elsewhere, then I interpret that to mean you won't object to relying on a single poisoned-or-corrupt cause code for such faults, as suggested.
Such corrupted data may not always originate from a cache so it may not be appropriate to consider these as only cache corruptions.
That doesn't matter. The point is that the fault might in some cases be indicated by a poison bit in a cache, and such poison bits can be set for other reasons too. This situation will lead to memory corruption potentially being indistinguishable from other reasons for the poison bit to get set.
The fact that this loss of information doesn't occur in certain circumstances (even if it's most circumstances) doesn't negate that it may occur.
That doesn't matter. The point is that the fault might in some cases be indicated by a poison bit in a cache, and such poison bits can be set for other reasons too. This situation will lead to memory corruption potentially being indistinguishable from other reasons for the poison bit to get set.
Yes, it is possible that a poison indicator may be used for other reasons but the intent is the same to prevent consumption of that data. It is also not required that in all cases where data is poisoned the backing memory is indeed corrupted - it may very well not be. Was the reason to use the term "data corruption" and not "memory corruption". The intent of the data poisoning should be indicate that the data is not fit for consumption and its attempted consumption should cause a fault. A single bit can only encode one bit of information. So I am not sure how an implementation that uses of one bit of information in a cache block can expect to provide a reason for setting that bit in the cache block when that cache block is accessed, likely a long time in future after the bit was set, without also holding along with that cache block additional bits of information. However as you noted we should discuss this outside the scope of this issue.
PR #184 to replace the codes 32, 33, and 35 - with a single code.
The Architecture Review Committee will probably need to discuss the IOMMU's new use of RAS exception codes as a result of commit 4f6fa21. In particular, when page tables are discovered to be corrupted, it's not obvious to me that the type of the "data corruption" exception code to report should be for the original access type. Reporting page table corruption as "instruction data corruption" or "store data corruption" seems misleading to me.
This GitHub issue is essentially a "hold" pending the ARC's assessment of the matter.