struggles with the premise

sorear commented 6 months ago

I am concerned about the testing impact of having two ways for S-mode or M-mode to receive a hardware error exception, one of which will occurs much rarer than the other. We would achieve much more confidence in the correctness of hardware error handling if we picked a single delivery method and.
In all previous cases, exception delegation primarily supports speed. Do we expect hardware error exceptions to be frequent enough to require sub-microsecond delivery latency?
If a supervisor execution environment is delivering synchronous exceptions other than those authorized by the ratified privileged architecture, running without double trap is not a theoretical reliability problem but an exploitable security vulnerability. Malicious U-mode code can put a supervisor address into whatever register the S-mode trap handler initially swaps with sscratch, then spin on a fast syscall until a hardware error happens immediately after the swap, resulting in an attacker-controlled write. This suggests that a SEE needs to deliver hardware errors via some other mechanism until the supervisor has set up the double-trap extension. But if we're doing that, why not use the safe mechanism exclusively and not set up double-traps at all?

This was proposed by Andrew Lutomirski in 2016 and briefly considered as riscv/riscv-isa-manual#3 before being rejected as unnecessary, although discussion at the time focused on use cases other than hardware error exceptions.

I'd like to see discussion somewhere about the benefits of the proposed extension over a strategy where hardware errors are delivered to M-mode as normal exceptions or RNMIs, and forwarded to S-mode via the SBI Supervisor Software Event extension or something functionally equivalent.

ved-rivos commented 6 months ago

I am concerned about the testing impact of having two ways for S-mode or M-mode to receive a hardware error exception, one of which will occurs much rarer than the other. We would achieve much more confidence in the correctness of hardware error handling if we picked a single delivery method and.

As with anything related to hardware error it is very hard to cause a real hardware error. The FIT rates on most systems will be high enough that waiting for hardware error to naturally occur is not a good strategy. To address the issue about creating stimulus for testing handling of such errors, hardware provides injection capabilities to cause such events - not errors. Hardware errors due to their nature are unpredictable and can occur at any point in the program execution. If they happen in the critical phase of trap handling then they can lead to the double trap condition which is what is addressed in this extension.

In all previous cases, exception delegation primarily supports speed. Do we expect hardware error exceptions to be frequent enough to require sub-microsecond delivery latency?

There are traditionally two ways of handling RAS errors - OS first and firmware first. While for some applications the latency of the handling of a RAS error may not be very strict, for applications focused on safety - industrial, automotive, avionics, etc. - the latency may be critical. The architecture should not preclude the use either choices for RAS handling.

If a supervisor execution environment is delivering synchronous exceptions other than those authorized by the ratified privileged architecture, running without double trap is not a theoretical reliability problem but an exploitable security vulnerability. Malicious U-mode code can put a supervisor address into whatever register the S-mode trap handler initially swaps with sscratch, then spin on a fast syscall until a hardware error happens immediately after the swap, resulting in an attacker-controlled write.

Such a supervisor design does not seem robust. The Linux kernel determines whether the trap originated in user mode or in supervisor mode based on the sscratch in entry.S. For traps originating in user mode it expects the sscratch to be 0 and when non-zero i.e. a supervisor mode trap it does not use the contents of the scratch.

This suggests that a SEE needs to deliver hardware errors via some other mechanism until the supervisor has set up the double-trap extension. But if we're doing that, why not use the safe mechanism exclusively and not set up double-traps at all?

Setting up the double trap mechanism should happen early - before the OS launches - and no later than transition to user mode execution. We would want the double trap extension enabled before the OS launches - by the boot loader. It is enabled by default at reset for machine mode. The option to not have it enabled by default is to provide for backward compatibility.

This was proposed by Andrew Lutomirski in 2016 and briefly considered as https://github.com/riscv/riscv-isa-manual/issues/3 before being rejected as unnecessary, although discussion at the time focused on use cases other than hardware error exceptions.

We did consider the option of overloading SIE as a way to indicate that exceptions are unexpected. But this assumption does not hold always as at least parts, such as during guest memory read/write, of the Linux kernel may disable interrupts but expect exceptions to occur and be handled.

sorear commented 6 months ago

To address the issue about creating stimulus for testing handling of such errors, hardware provides injection capabilities to cause such events - not errors.

I am aware of this, and was arguing that keeping one mechanism tested using injections would be easier than two.

There are traditionally two ways of handling RAS errors - OS first and firmware first. While for some applications the latency of the handling of a RAS error may not be very strict, for applications focused on safety - industrial, automotive, avionics, etc. - the latency may be critical. The architecture should not preclude the use either choices for RAS handling.

Do "safety" applications want low average latency or low worst-case latency? Does this proposal improve worst case latency over a primarily firmware approach?

Such a supervisor design does not seem robust. The Linux kernel determines whether the trap originated in user mode or in supervisor mode based on the sscratch in entry.S. For traps originating in user mode it expects the sscratch to be 0 and when non-zero i.e. a supervisor mode trap it does not use the contents of the scratch.

I'm confused by this response. If a user syscall occurs, and a hardware error occurs on the second instruction of handle_exception, then handle_exception will be reentered with a user-provided value in sscratch, leading three instructions later to a user-controlled write of supervisor memory at an address of TASK_TI_USER_SP(tp). The supervisor design I was describing is not hypothetical.

Setting up the double trap mechanism should happen early - before the OS launches - and no later than transition to user mode execution. We would want the double trap extension enabled before the OS launches - by the boot loader. It is enabled by default at reset for machine mode. The option to not have it enabled by default is to provide for backward compatibility.

Do you intend for the boot loader behavior you are describing to be incompatible with the riscv-profiles and server-soc specifications? If this is intended for embedded systems at roughly CLIC's level of bespokeness I guess things are less of problems.

ved-rivos commented 6 months ago

An operating system can choose to only use firmware first or only use OS first approach. Unless the OS decides to support both methods there is no need to support two methods - the architecture does not force the choice of either. For safety applications it has usually been the worst case latency that matters. Some implementations may prefer a OS first model also due to desire not to locate all RAS handling frameworks in the firmware to keep the firmware slim.

I understood the initial case you called out was where a second exception/interrupt happens after the trap handler is re-entrant. For the case that you describe now where a second exception/interrupt occurs when the trap handler is non re-entrant is what double trap extension is addressing. Such issue was also highlighted here.

The double trap handling requires the OS to be capable of managing the SDT bit i.e. clear it when the trap handler can be reentered. The extension thus needs to be enabled by the OS - an opt-in. A older OS image that does not have the capability to handle the SDT bit will not opt-in. This opt-in could be in the boot loader - based on perhaps some properties of the image or based on user configuration - or it could be done by the kernel itself as part of its startup sequence.

sorear commented 6 months ago

I think we agree re. opt-in.

If a supervisor hasn't opted in to double traps, it's not safe for the supervisor to see pseudo-asynchronous hardware error exceptions, but hardware errors might still happen.

Would it be appropriate to say that the SEE will not set [mh]edeleg[19/HWE] to 1 unless [mh]envcfg.DTE is also 1? (Either non-normatively or somewhere out of band.)

Would it be appropriate to say that if hardware errors are not delegated, the SEE will report all hardware errors via the same mechanism as is used to report double traps?

Would it be appropriate to say that the mechanism used for non-delegated hardware errors is the same mechanism used for "firmware first" hardware errors?

ved-rivos commented 6 months ago

A supervisor that does not opt in to double trap, may be also be exposed other bugs such as a memory corruptions that corrupt page tables or a kernel stack overflow. A supervisor would want to opt-in to the double traps when the extension is available regardless of whether it also opts in to handling hardware error exceptions. This opt-in to the extension is no cost to at least the Linux kernel. The leading paragraph of the specification already notes explicitly that without this extension managing hardware errors is problematic for the kernel. This should address the concern that a supervisor may opt-in to handling hardware errors without the extension being available.

sorear commented 6 months ago

Earlier in this thread you were arguing that firmware first and OS first were both valid strategies to support. Is your position changed?

If an uncorrectable error is detected in a page table and the supervisor hasn't opted in to sub-µs RAS events, the error will be delivered to machine mode and ultimately some action will be taken depending on platform policy. I don't see the problem there. If you mean undetected errors, trying to bound the possible effects of those is an impossible task.

What evidence do you have to support the zero cost in Linux claim? AFAICT, none of the existing sstatus writes are in the necessary places to bracket the critical sections where a new trap would overwrite data in CSRs.

ved-rivos commented 6 months ago

Earlier in this thread you were arguing that firmware first and OS first were both valid strategies to support. Is your position changed?

They are both policies and both are valid. Some segments may prefer OS first. For instance.

If an uncorrectable error is detected in a page table

I did not mean a RAS error. Just a bad write - for instance to the identity mapped range - to the PTE that corrupts the PTE leading to a page fault .

What evidence do you have to support the zero cost in Linux claim?

Please see. The line 62 used to writesstatus to clear SUM, FS, and VS fields and the SDT clearing can happen alongside - see line 88.

sorear commented 6 months ago

They are both policies and both are valid. Some segments may prefer OS first. For instance.

This appears to be an x86 focused document. x86 "OS first" passes through microcode, so it can be argued to be equivalent to a RISC-V strategy which passes through M-mode.

Something comparing delivery strategies on POWER or Arm would be much more compelling.

Please see. The line 62 used to writesstatus to clear SUM, FS, and VS fields and the SDT clearing can happen alongside - see line 88.

Thanks. (You need to pull the sstatus write in ret_from_exception above the sscratch write, minding the conditional branch, otherwise you have a window where an asynchronous trap will think it's running on the user stack and corrupt the pt_regs on the kernel stack.)

I did not mean a RAS error. Just a bad write - for instance to the identity mapped range - to the PTE that corrupts the PTE leading to a page fault .

Is best-effort trapping of software errors really in scope here? How will you know when to stop? Plenty of software errors could affect handling of double faults.

ved-rivos commented 6 months ago

Open compute project is not solely x86 focused - is largely arch. agnostic. In ARM a synchronous external abort (SEA) is used to signal hardware errors - such as poison consumption - handled by do_sea.

The detection of faults inside an exception handler helps improve the robustness - at least be able to get to a crash dump - similar to a fault outside the critical part of the exception handler. Besides the detection may help prevent such from being used as enabler like here. Support for detection of faulting behavior within an exception handler has also been part of architectures such as MIPS (status.EXL) and SPARC (PSR.ET).

riscv / riscv-double-trap

struggles with the premise #17