[PAL/Linux-SGX] AEX-Notify 5/5: Add AEX-Notify flows in exception handling

dimakuv commented 1 month ago

Description of the changes

Part 5 in AEX-Notify series.

This PR adds the AEX-Notify flows inside the enclave.

The stage-1 signal handler is augmented as follows when AEX-Notify is enabled: manually restore SSA[0] context, invoke the EDECCSSA instruction instead of EEXIT (to go from SSA[1] to SSA[0] without exiting the enclave) and finally jump to SSA[0].GPRSGX.RIP to resume enclave execution (it will resume in stage-2 signal handler).

The stage-2 signal handler is augmented as follows: set bit 0 of SSA[0].GPRSGX.AEXNOTIFY (so that AEX-Notify starts working again for this thread), then apply AEX-Notify mitigations and finally restore regular enclave execution.

This PR does not add any real AEX-Notify mitigations. Instead, we count the number of AEX events reported inside the SGX enclave and print this number on enclave termination (if log level is at least "warning").

Note that current implementation of AEX-Notify does not use the checkpoint mechanism described in the official AEX-Notify whitepaper. That checkpoint mechanism allows to coalesce multiple AEX events that occur during the execution of mitigations. This saves some CPU cycles and some signal-handling stack space, but we leave implementing this optimization as future work.

How to test this PR?

AEX-Notify is enabled in all LibOS/PAL test manifests if AEXNOTIFY=1 environment variable is set.

This change is

scottconstable commented 1 month ago

Reviewable status: 0 of 61 files reviewed, 4 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners

a discussion (no related file): Quick perf numbers for Gramine built in Release mode on Ubuntu 24.04 with Linux v6.11.

Not sure if they are useful, just wanted to post here. They show that current AEX-Notify (with dummy mitigation) has small overhead.

make clean; AEXNOTIFY=0 EDMM=0 SGX=1 gramine-test pytest -k 'not TC_04_Attestation' -- done in 200.36s

make clean; AEXNOTIFY=0 EDMM=1 SGX=1 gramine-test pytest -k 'not TC_04_Attestation' -- done in 138.00s

make clean; AEXNOTIFY=1 EDMM=0 SGX=1 gramine-test pytest -k 'not TC_04_Attestation' -- done in 208.25s

make clean; AEXNOTIFY=1 EDMM=1 SGX=1 gramine-test pytest -k 'not TC_04_Attestation' -- done in 141.50s

Do you have any intuition as to why AEX-Notify introduces overhead for these tests? In the Intel SGX SDK implementation, we never observed overheads introduced by AEX-Notify, except in the microbenchmarks that repeatedly enter and exit the enclave and do nothing else.

scottconstable commented 1 month ago

Note that current implementation of AEX-Notify does not use the checkpoint mechanism described in the official AEX-Notify whitepaper. That checkpoint mechanism allows to coalesce multiple AEX events that occur during the execution of mitigations. This saves some CPU cycles and some signal-handling stack space, but we leave implementing this optimization as future work.

IMO the checkpoint mechanism is not strictly an optimization. In very specific circumstances, if the enclave repeatedly takes interrupts in the AEX-Notify handler that prevent the stack from unwinding, the enclave thread could run out of stack. This is a different user-observable behavior that would be introduced by this PR, and therefore the checkpoint mechanism--which prevents stack overflow--is not merely an optimization.

gramineproject / gramine

[PAL/Linux-SGX] AEX-Notify 5/5: Add AEX-Notify flows in exception handling #2037

Description of the changes

1530

1531

1948

How to test this PR?