Open dimakuv opened 1 month ago
Reviewable status: 0 of 61 files reviewed, 4 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners
a discussion (no related file): Quick perf numbers for Gramine built in Release mode on Ubuntu 24.04 with Linux v6.11.
Not sure if they are useful, just wanted to post here. They show that current AEX-Notify (with dummy mitigation) has small overhead.
make clean; AEXNOTIFY=0 EDMM=0 SGX=1 gramine-test pytest -k 'not TC_04_Attestation'
-- done in 200.36smake clean; AEXNOTIFY=0 EDMM=1 SGX=1 gramine-test pytest -k 'not TC_04_Attestation'
-- done in 138.00smake clean; AEXNOTIFY=1 EDMM=0 SGX=1 gramine-test pytest -k 'not TC_04_Attestation'
-- done in 208.25smake clean; AEXNOTIFY=1 EDMM=1 SGX=1 gramine-test pytest -k 'not TC_04_Attestation'
-- done in 141.50s
Do you have any intuition as to why AEX-Notify introduces overhead for these tests? In the Intel SGX SDK implementation, we never observed overheads introduced by AEX-Notify, except in the microbenchmarks that repeatedly enter and exit the enclave and do nothing else.
Note that current implementation of AEX-Notify does not use the checkpoint mechanism described in the official AEX-Notify whitepaper. That checkpoint mechanism allows to coalesce multiple AEX events that occur during the execution of mitigations. This saves some CPU cycles and some signal-handling stack space, but we leave implementing this optimization as future work.
IMO the checkpoint mechanism is not strictly an optimization. In very specific circumstances, if the enclave repeatedly takes interrupts in the AEX-Notify handler that prevent the stack from unwinding, the enclave thread could run out of stack. This is a different user-observable behavior that would be introduced by this PR, and therefore the checkpoint mechanism--which prevents stack overflow--is not merely an optimization.
Description of the changes
Part 5 in AEX-Notify series.
This PR adds the AEX-Notify flows inside the enclave.
The stage-1 signal handler is augmented as follows when AEX-Notify is enabled: manually restore SSA[0] context, invoke the EDECCSSA instruction instead of EEXIT (to go from SSA[1] to SSA[0] without exiting the enclave) and finally jump to SSA[0].GPRSGX.RIP to resume enclave execution (it will resume in stage-2 signal handler).
The stage-2 signal handler is augmented as follows: set bit 0 of SSA[0].GPRSGX.AEXNOTIFY (so that AEX-Notify starts working again for this thread), then apply AEX-Notify mitigations and finally restore regular enclave execution.
This PR does not add any real AEX-Notify mitigations. Instead, we count the number of AEX events reported inside the SGX enclave and print this number on enclave termination (if log level is at least "warning").
Note that current implementation of AEX-Notify does not use the checkpoint mechanism described in the official AEX-Notify whitepaper. That checkpoint mechanism allows to coalesce multiple AEX events that occur during the execution of mitigations. This saves some CPU cycles and some signal-handling stack space, but we leave implementing this optimization as future work.
See also related PRs and discussions:
1530
1531
1948
Related documentation:
Closes #1530 Closes #1531
How to test this PR?
AEX-Notify is enabled in all LibOS/PAL test manifests if
AEXNOTIFY=1
environment variable is set.This change is