Open dimakuv opened 10 months ago
Why can't we start its execution in ring3? ld.so does many things we rather shouldn't be doing in vm-ring0 if it's possible to do them in vm-ring3.
Why can't we start its execution in ring3?
Because our LibOS code doesn't currently have a place where it could give control to PAL to do additional things (like switching from ring-0 to ring-3).
More specifically, I think the LibOS should leave the exact way of jumping into the application to the PAL (i.e., libos_elf_entry.nasm
should be a PAL-specific code).
But I didn't want to modify the LibOS component at all, because this would be a rather intrusive change. So I left fixing this problem for later, when we have the code open-sourced and everyone agreeing on the general direction and design of TDX.
There is also a problem around LibOS (1) delivering signals to the application and (2) performing rt_sigreturn()
to the previously saved app context.
The problem stems from the fact that LibOS just assumes to execute in ring-3 always. So LibOS saves the app context as-is and then restores the app-context as is, immediately jumping to it. LibOS is not aware of the ring-0/ring-3 wrapper that we introduce in VM-based PALs.
For a hacky partial solution, see https://github.com/gramineproject/gramine-tdx/pull/36.
But ideally we still need some reasonable way to conduct to the LibOS that it can't just jmp *app_context_rip
. Instead, LibOS needs to invoke PAL's wrappers to exit from the syscall into the app context.
Previous Gramine (
gramine-direct
andgramine-sgx
) always executes in ring-3 and thus doesn't have a hook to add a ring0 -> ring3 transition before jumping from LibOS init phase to the application executable.These are the particular places where this jump happens:
Ideally, we want to introduce some hook / callback to PAL so that it can do its own "jump to userspace executable" logic. Or we can add a macro that will be something like:
How it works now? Well, the executable (which is typically
ld.so
) starts in ring-0, and only after the first syscall invocation is finished, the executable will run in ring-3 (in case ofld.so
, the first syscall is`brk()
). That's because our VM/TDX wrapper around syscalls is like this: https://github.com/gramineproject/gramine-tdx/blob/f4405d38d1a3b5e45146e25d07f589ab31d4e006/pal/src/host/vm-common/kernel_events.S#L145-L148