Closed dimakuv closed 1 week ago
@vijaydhanraj You're working on this, right? Let me assign you now, and if this is wrong, please reply and I'll remove the assignment.
UPDATE: I forgot that I can't assign non-maintainers. @vijaydhanraj Please write something in this issue, so that GitHub allows me to assign you.
Hi @dimakuv, please assign this to me.
I'm not quite familiar w/ the context and was just looking into this requirement. I have the following assumptions and would like to have more discussions/clarifications.
mmap(..., MAP_NORESERVE)
only for Linux-SGX PAL and when EDMM is enabled, i.e., Linux PAL should work as-is which ignores MAP_NORESERVE
.filter_saved_flags()
: https://github.com/gramineproject/gramine/blob/51d915ded30aa2ca55ad5c2e86075e95fe938ca2/libos/src/bookkeep/libos_vma.c#L35-L39 so that MAP_NORESERVE
flag can be saved in VMAs.MAP_NORESERVE
flag to PAL by extending https://github.com/gramineproject/gramine/blob/51d915ded30aa2ca55ad5c2e86075e95fe938ca2/pal/include/pal/pal.h#L189-L194 and https://github.com/gramineproject/gramine/blob/51d915ded30aa2ca55ad5c2e86075e95fe938ca2/libos/include/libos_flags_conv.h#L22._PalVirtualMemoryAlloc()
of Linux-SGX PAL: https://github.com/gramineproject/gramine/blob/51d915ded30aa2ca55ad5c2e86075e95fe938ca2/pal/src/host/linux-sgx/pal_memory.c#L28-L29, we then defer page pre-accepts (by probably an early return) when EDMM is enabled and MAP_NORESERVE
is set.memfault_upcall()
https://github.com/gramineproject/gramine/blob/51d915ded30aa2ca55ad5c2e86075e95fe938ca2/libos/src/bookkeep/libos_signal.c#L337, we first check whether the #PF can be handled (e.g., the faulting address is contained in a VMA w/ MAP_NORESERVE
; now also possible when RIP is in LibOS (which we previously considered it as internal memory fault) because this can happen when some syscalls try to access user buffers). Then for the valid #PF, we leverage PalVirtualMemoryAlloc()
w/ MAP_NORESERVE
cleared and in page granulariy to accept the pages (which should only happen for Linux-SGX PAL). Note that this would require sgx.require_exinfo=true
to retrieve the faulting address.MAP_NORESERVE
. We can check the prerequisite of sgx.require_exinfo=true
together w/ this option or simply fallback to the pre-accept
mode when it's set to false.Pls correct me if anything incorrect. Thanks!
@kailun-qin The proposed changes and flow look correct to me. Below are some notes.
MAP_NORESERVE
. I think you can simply propagate this flag to the host as-is. IIUC, then the host Linux kernel will deal with #PF exceptions (i.e., no signal will be delivered to LibOS).PROT_NONE
. I think many applications use mmap(PROT_NONE)
to achieve the same behavior as MAP_NORESERVE
.
PAL_PROT_
value. We could just rely on PAL_PROT_NONE
(== 0). In other words, both mmap(PROT_NONE)
and mmap(MAP_NORESERVE)
will end up calling PalVirtualMemoryAlloc(PAL_PROT_NONE)
, which will trigger the lazy-alloc behavior.MAP_NORESERVE
to the host, the mapping will be created on the host with PAL_PROT_NONE
. Which I think will be semantically equivalent anyway.mmap(PROT_READ)
on such a prot-none mapping. In this case, the Linux-SGX PAL will need to allocate + accept the pages.Actually, sorry, the more I think about PROT_NONE
approach suggested above, the more I don't like it. I think we should start with the MAP_NORESERVE
flag only, as Kailun described in his proposal. We could extend it in the future, if need arises.
We may want to introduce a manifest option for this lazy allocation on
MAP_NORESERVE
. We can check the prerequisite ofsgx.require_exinfo=true
together w/ this option or simply fallback to the pre-accept mode when it's set to false.
I don't like the new manifest option. Way too many options, and the "lazy acception" optimization shouldn't be controversial -- it seems to be always beneficial.
I don't think we need to require sgx.require_exinfo=true
for this? I mean, 99.9% of the time the mappings with MAP_NORESERVE
will not be accessed by the application at all. So the #PF-exception path will not be triggered at all. And only in cases where this path will be triggered, we can immediately check whether the faulting address is zero, and if it is then we loudly fail with "please enable sgx.require_exinfo
". I think this stop-gap will be enough.
@dimakuv Thanks for the feedbacks! Pls kindly see my comments below.
I see no reason to exclude the Linux PAL from handling MAP_NORESERVE. I think you can simply propagate this flag to the host as-is. IIUC, then the host Linux kernel will deal with #PF exceptions (i.e., no signal will be delivered to LibOS).
Yeah, make sense to me.
I don't like the new manifest option. Way too many options, and the "lazy acception" optimization shouldn't be controversial -- it seems to be always beneficial. I don't think we need to require sgx.require_exinfo=true for this? I mean, 99.9% of the time the mappings with MAP_NORESERVE will not be accessed by the application at all. So the #PF-exception path will not be triggered at all.
Yes, I agree.
And only in cases where this path will be triggered, we can immediately check whether the faulting address is zero, and if it is then we loudly fail with "please enable sgx.require_exinfo". I think this stop-gap will be enough.
But LibOS cannot tell whether the zero faulting address is actually due to the app itself or the MAP_NORESERVE
-caused #PF exception? (since in such case, the corresponding VMA shouldn't be found during the lookup process and the saved flag is hence unknown to LibOS).
To complete the proposal, there are two other points that're not covered yet:
MAP_NORESERVE
cleared and in page granulariy to accept the pages.PAL_PROT_*
flags passed to PAL API PalVirtualMemoryFree()
: https://github.com/gramineproject/gramine/blob/7485c90b1cb5b54695a7dcab1cdcb6361d179777/pal/include/pal/pal.h#L230 (and we probably don't want to extend it?), the only way seems to be skipping this deallocation on VMAs w/ MAP_NORESERVE
flags during e.g., munmap()
https://github.com/gramineproject/gramine/blob/7485c90b1cb5b54695a7dcab1cdcb6361d179777/libos/src/sys/libos_mmap.c#L363-L365. Note that this is a common pattern for workloads using MAP_NORESERVE
lazy allocation, where they usually first allocate w/ mmap(..., MAP_NORESERVE)
and subsequently release/unmap the parts that're not aligned.MAP_NORESERVE
support alone is worth this introduced complexity. Any thoughts? Thanks!
And only in cases where this path will be triggered, we can immediately check whether the faulting address is zero, and if it is then we loudly fail with "please enable sgx.require_exinfo". I think this stop-gap will be enough.
But LibOS cannot tell whether the zero faulting address is actually due to the app itself or the
MAP_NORESERVE
-caused #PF exception? (since in such case, the corresponding VMA shouldn't be found during the lookup process and the saved flag is hence unknown to LibOS).
I don't understand your point. If sgx.require_exinfo = false
, then LibOS can't/shouldn't handle faulting addresses at all! Because in this case, no matter what enclave address the application tried to access, SGX will always report address 0, which is totally wrong.
If our current LibOS #PF handler doesn't check sgx.require_exinfo
and propagates the address to the app anyway, then we have a bug.
To complete the proposal, there are two other points that're not covered yet:
File-backed maping For valid #PF exceptions, similarly, we may handle them here: https://github.com/gramineproject/gramine/blob/7485c90b1cb5b54695a7dcab1cdcb6361d179777/libos/src/bookkeep/libos_signal.c#L362 , by leveraging libos mmap fs ops: https://github.com/gramineproject/gramine/blob/7485c90b1cb5b54695a7dcab1cdcb6361d179777/libos/include/libos_fs.h#L124-L125
w/
MAP_NORESERVE
cleared and in page granulariy to accept the pages.
I am not sure what you mean by this. Do you mean that file-backed mappings can also be with MAP_NORESERVE
? I would actually argue EDMM lazy allocation should not cover this case. In particular, I think you can just clear the MAP_NORESERVE
flag in file-backed mappings. I am certain that no applications use file-backed mappings with MAP_NORESERVE
.
- Unmapping This can be a bit problematic.
- First, as we have no
PAL_PROT_*
flags passed to PAL APIPalVirtualMemoryFree()
: https://github.com/gramineproject/gramine/blob/7485c90b1cb5b54695a7dcab1cdcb6361d179777/pal/include/pal/pal.h#L230 (and we probably don't want to extend it?), the only way seems to be skipping this deallocation on VMAs w/MAP_NORESERVE
flags during e.g.,munmap()
https://github.com/gramineproject/gramine/blob/7485c90b1cb5b54695a7dcab1cdcb6361d179777/libos/src/sys/libos_mmap.c#L363-L365 . Note that this is a common pattern for workloads usingMAP_NORESERVE
lazy allocation, where they usually first allocate w/mmap(..., MAP_NORESERVE)
and subsequently release/unmap the parts that're not aligned.- Second, for the pages committed on demand on #PF exceptions, I don't have a simple and good solution for when and how they should be released. I assume a (bit vector based) page tracker to track the commit status of pages (which may also cooperate w/ our VMA subsystem) would do the trick. But this can lead to a quite different design and I'm not sure if the
MAP_NORESERVE
support alone is worth this introduced complexity.
Yes, here I see the problem. I vote against both suggestions:
PalVirtualMemoryFree()
, as this is an SGX EDMM-only specific logic, and it shouldn't be solved by special-casing the PAL API.Can we use any additional interfaces of the SGX driver API, or can we use some properties of the SGX instructions to "get information" about the state of the to-be-freed enclave pages? E.g., maybe EMODPE
instruction on an unaccepted page returns some special value in RAX, and that's how we can learn that the page was never EACCEPTed in the first place? Or maybe we can ask the SGX driver via some IOCTL "was this enclave page range ever allocated?".
If our current LibOS #PF handler doesn't check sgx.require_exinfo and propagates the address to the app anyway, then we have a bug.
Ah OK, I don't think we check it and we should probably check this in _PalExceptionHandler()
of Linux-SGX PAL: https://github.com/gramineproject/gramine/blob/3b5c88b116a3fd42e8305eb8f8973ecc648afa7c/pal/src/host/linux-sgx/pal_exception.c#L323-L325.
Do you mean that file-backed mappings can also be with MAP_NORESERVE?
Yes, this was what I meant.
I would actually argue EDMM lazy allocation should not cover this case. In particular, I think you can just clear the MAP_NORESERVE flag in file-backed mappings. I am certain that no applications use file-backed mappings with MAP_NORESERVE.
I agree that rare applications use file-backed mappings with MAP_NORESERVE
. I'm fine w/ not covering this case.
E.g., maybe EMODPE instruction on an unaccepted page returns some special value in RAX, and that's how we can learn that the page was never EACCEPTed in the first place?
I don't think EMODPE
'll return anything, pls see https://www.felixcloutier.com/x86/emodpe.
Or maybe we can ask the SGX driver via some IOCTL "was this enclave page range ever allocated?".
SGX_IOC_ENCLAVE_MODIFY_TYPES
IOCTL will return -EFAULT
if an enclave page range was never allocated. I also checked other SGX driver provided IOCTLs (pls see https://docs.kernel.org/6.3/x86/sgx.html), but unfortunately no luck. Anyway, I'll ask around and see if any possibility.
I vote against both suggestions
What about we try to pre-fault all to-be-freed pages at the very beginning of sgx_edmm_remove_pages()
https://github.com/gramineproject/gramine/blob/3b5c88b116a3fd42e8305eb8f8973ecc648afa7c/pal/src/host/linux-sgx/enclave_edmm.c#L86, by EACCEPT
(using initial page permissions), regardless whether they've been accepted or not.
If it's already committed, then we get SGX_PAGE_ATTRIBUTES_MISMATCH
immediately and it can be simply ignored; if it's not, we trigger a #PF and SGX driver should handle it. Then the original flow of sgx_edmm_remove_pages()
should not be impacted -- by first changing the pagess type to PT_TRIM
etc.
I think this should introduce similar overhead if there is an SGX driver API / SGX instruction that we can rely on to "get information" about the state and handle accordingly, because this approach basically leverages EACCEPT
to do that.
If our current LibOS #PF handler doesn't check sgx.require_exinfo and propagates the address to the app anyway, then we have a bug.
Ah OK, I don't think we check it and we should probably check this in
_PalExceptionHandler()
of Linux-SGX PAL:
Yep. Will you create such a PR?
I would actually argue EDMM lazy allocation should not cover this case. In particular, I think you can just clear the MAP_NORESERVE flag in file-backed mappings. I am certain that no applications use file-backed mappings with MAP_NORESERVE.
I agree that rare applications use file-backed mappings with
MAP_NORESERVE
. I'm fine w/ not covering this case.
Yes, let's not cover this case. Just put a FIXME comment in the code, that we currently ignore this case because we consider it rare and not worth optimizing for performance.
What about we try to pre-fault all to-be-freed pages at the very beginning of
sgx_edmm_remove_pages()
, byEACCEPT
(using initial page permissions), regardless whether they've been accepted or not. If it's already committed, then we getSGX_PAGE_ATTRIBUTES_MISMATCH
immediately and it can be simply ignored; if it's not, we trigger a #PF and SGX driver should handle it. Then the original flow ofsgx_edmm_remove_pages()
should not be impacted -- by first changing the pagess type toPT_TRIM
etc.
Yes, this approach looks ok-ish. I'm afraid we won't come up with anything better than this, at least not currently.
I think this should introduce similar overhead if there is an SGX driver API / SGX instruction that we can rely on to "get information" about the state and handle accordingly, because this approach basically leverages
EACCEPT
to do that.
I think the overhead will be significant, because for such MAP_NORESERVE
pages we will have: EACCEPT
+ #PF + kernel handler + SGX handler + EACCEPT
.
I was hoping for an overhead of ESOMEINSTRUCTION
only, without exception handling. But looks like we can't do that. Oh well. At least for the initial PR, your approach should good enough. It will kinda move the performance overhead from the mmap()
time to the unmap()
time.
Also, maybe this is a good point to ask SGX driver developers whether they can suggest a better solution, or even introduce a new IOCTL or something specially for us?
Yep. Will you create such a PR?
I created https://github.com/gramineproject/gramine/pull/1502 for this.
I think the overhead will be significant, because for such MAP_NORESERVE pages we will have: EACCEPT + #PF + kernel handler + SGX handler + EACCEPT.
Yes - for the pages that're not faulted/accessed at all. While for those that're allocated lazily (though can be very few w/ mmap(MAP_NORESERVE)
), I suppose it'll be an overhead of EACCEPT
only.
It will kinda move the performance overhead from the mmap() time to the unmap() time.
Right, exactly.
Also, maybe this is a good point to ask SGX driver developers whether they can suggest a better solution, or even introduce a new IOCTL or something specially for us?
Sure, I'll approach Haitao et al. to see if any better option.
I have two new notes:
On page fault handling, specifically in
memfault_upcall()
...
The original design by Kailun uses the LibOS's memory-fault handler. I overlooked this design choice, but now I'm certain that this is wrong. LibOS is arch-agnostic and must not even know that things like (minor) page faults exist. Also, calling PalVirtualMemoryAlloc()
on a piece of memory that LibOS already considers allocated is definitely an incorrect design.
So we actually need to intercept minor page faults in the SGX PAL: https://github.com/gramineproject/gramine/blob/a8edb2e17d7b9e95ff46e8ee4dcee47447808f33/pal/src/host/linux-sgx/pal_exception.c#L236
This is also correct from the other PAL's view: the Linux PAL never generates such minor page faults (because they are done completely by the underlying Linux host). Thus, the proposed memfault_upcall()
modification would be SGX-specific, which goes against the Gramine LibOS vs PAL separation philosophy.
This bitmap vector will simply span the whole sgx.enclave_size
, with each bit representing "enclave page X was eaccepted" (i.e. enclave page X is usable). E.g., for a 1GB enclave, the bitmap will contain 1024*1024*1024 / 4096 / 8 = 32768
bytes, or 32KB. For a 1TB enclave, the bitmap will contain 32MB. So, the memory overhead for a bitmap is 0.003%.
I'd like to stress that this bitmap vector is introduced purely for the #PF handling (minor page faults due to not-yet-committed enclave pages). The real bulk of the memory bookkeeping is still on the LibOS VMA subsystem, including page permissions. Unfortunately, there seems to be no way to have an implementation of this lazy EDMM feature with a completely stateless code...
The way I see it EDMM sub-component tracks EPCM attributes which is SGX specific, and should not be in conflict with regular VMA attributes. EACCEPT bit map is one of them, even EPCM.R/W/X could be out of sync from libOS VMA record, I'm not sure also reusing standard mmap flags to signal whether a page is EAUG on #PF is reasonable because you may have situations when EAUG on #PF is also needed for other flags. But I'm not deep into gramine use cases, just something to consider.
Intel SDK and MS open enclave SDK (probably other runtimes too given people are sending PRs and issues) are using the sgx-emm implementation which has separate tracking for all those and we didn't find it much overhead.
You can find the rational on storing EPC states here: https://github.com/intel/sgx-emm/blob/main/design_docs/SGX_EDMM_driver_interface.md#enclave-handling-of-faults.
@haitaohuang Thanks for your inputs! Some comments below.
even EPCM.R/W/X could be out of sync from libOS VMA record
Why would the EPCM attributes be out of sync from the LibOS VMA page permissions? I see no real-world scenario when this can happen/is beneficial.
I'm not sure also reusing standard mmap flags to signal whether a page is EAUG on #PF is reasonable because you may have situations when EAUG on #PF is also needed for other flags
I think you're confusing Gramine's purpose here. We are not reusing the flags for our own purposes, instead we want to emulate the lazy-allocation behavior of the Linux x86 kernel. One of the simple cases is this MAP_NORESERVE
case, which we are discussing in this issue. There will be more cases probably, and we'll evaluate whether SGX EDMM lazy allocation is performance-beneficial for each case, and if yes, we'll add these cases too. Typically, such cases are identified by mmap flags and/or by mprotect flags.
I'd like to stress that this bitmap vector is introduced purely for the #PF handling (minor page faults due to not-yet-committed enclave pages).
hmm... But shouldn't we also update this bitmap vector every time we add/remove enclave pages? Otherwise, during #PF handling, how can we know whether an enclave page X was actually eaccepted (so that we can add it if it was not)?
I'd like to stress that this bitmap vector is introduced purely for the #PF handling (minor page faults due to not-yet-committed enclave pages).
hmm... But shouldn't we also update this bitmap vector every time we add/remove enclave pages? Otherwise, during #PF handling, how can we know whether an enclave page X was actually eaccepted (so that we can add it if it was not)?
Yes, of course, sorry for confusion. The bitmap must be updated on actual add/remove of enclave pages. What was I trying to say is that the only rationale for introducing this bitmap is to track lazy allocation of enclave pages via the #PF exceptions.
even EPCM.R/W/X could be out of sync from libOS VMA record
Why would the EPCM attributes be out of sync from the LibOS VMA page permissions? I see no real-world scenario when this can happen/is beneficial.
This of course depends on use case and may not be applicable to gramine. In multithreading case, you may have one thread changes permissions, records that target permission in VMA, But EPCM is not changed yet. Say originally you have RW in both VMA and EPCM. After mprotect to change VMA to RX, before EMODPR, EMODPE, EACCEPT are done to finish change EPCM, another thread may come in and execute the code. In this window, EPCM=RW, but VMA=RX, #PF may happen. To handle the #PF, you need track EPCM. In the linked reference, we documented this kind of scenarios.
I'm not sure also reusing standard mmap flags to signal whether a page is EAUG on #PF is reasonable because you may have situations when EAUG on #PF is also needed for other flags
I think you're confusing Gramine's purpose here. We are not reusing the flags for our own purposes, instead we want to emulate the lazy-allocation behavior of the Linux x86 kernel. One of the simple cases is this
MAP_NORESERVE
case, which we are discussing in this issue. There will be more cases probably, and we'll evaluate whether SGX EDMM lazy allocation is performance-beneficial for each case, and if yes, we'll add these cases too. Typically, such cases are identified by mmap flags and/or by mprotect flags.
Yeah, I misspoke when I say "reuse" those flags. So MAP_NORESERVE and similarly for other flags, it will be always EAUG on #PF once you think this is the way to go, I wonder if you would later use other criteria to determine. e.g. size of the area, special situation like stack/heap you may do a portion on demand. If you plan to support those, then in the end some kind of flags needed in PAL to track which range is on demand which are eagerly committed, and some more explicit indicator passed to PAL for the mode of EPC committing.
BTW Linux kernel does not seem to do much for MAP_NORESERVE other than not reserving/accounting for swap. IIUC, it makes not much difference in terms of whether RAM is committed or not. "MAP_POPULATE do eager allocation, otherwise do lazy" seems to be a better heuristic. Not sure if it was considered.
@haitaohuang Thanks again for more insights!
After mprotect to change VMA to RX, before EMODPR, EMODPE, EACCEPT are done to finish change EPCM, another thread may come in and execute the code. In this window, EPCM=RW, but VMA=RX, #PF may happen.
This race should be impossible in normal execution of Gramine. Gramine's LibOS VMA subsystem synchronizes mprotect requests internally.
And if the application itself does it (one app thread performs mprotect, and another app thread accesses this same page), then it is a bug in the application, and Gramine is not supposed to "try to fix" bad behavior of the app.
If you plan to support those, then in the end some kind of flags needed in PAL to track which range is on demand which are eagerly committed, and some more explicit indicator passed to PAL for the mode of EPC committing.
Yes. After some more discussions with @kailun-qin, we currently think that we'll get away with the following metadata:
BTW Linux kernel does not seem to do much for MAP_NORESERVE other than not reserving/accounting for swap. IIUC, it makes not much difference in terms of whether RAM is committed or not. "MAP_POPULATE do eager allocation, otherwise do lazy" seems to be a better heuristic. Not sure if it was considered.
MAP_POPULATE
does the opposite of what we want. We want in Gramine to commit enclave pages by default, unlike Linux which postpones committing memory pages by default. So for Linux, MAP_POPULATE
is the flag to revert the default policy and to commit pages eagerly. However, in Gramine this is already the default policy, thus MAP_POPULATE
is a no-op.
Note that we also don't want to change Gramine policy to the Linux one (always postpone committing pages, unless instructed otherwise). This would introduce tremendous performance overhead, due to the additional #PF flow, which is very expensive in SGX.
That's why in Gramine, we kinda have a reverse logic -- we try to find the flags that hint at "this memory range will probably never be needed". One good hinting flag that we observed in several workloads (most notably in Java) is MAP_NORESERVE
. That's why this issue discusses this flag exactly.
Description of the feature
PR #1054 implemented an initial version of EDMM support.
In particular, every
mmap(addr, size)
request ends up allocating the range of enclave pages[addr, addr+size)
, via the call topal_memory.c: sgx_edmm_add_pages()
which in turn does a loopsgx_eaccept
+ restrict/expand permissions over all pages in the range. In other words, the whole mmapped region is pre-accepted.In some cases, applications may rely on lazy allocation of pages, where the VMAs are reserved but not actually committed to physical memory. In particular,
mmap(..., MAP_NORESERVE)
requests are used in such cases -- to mmap a huge chunk of memory (possibly never used in the future) at once and then commit pages on demand on page fault events.So, our initial implementation of EDMM support doesn't have this concept of lazy allocation. Ideally, we would pre-accept the mmapped range only on some
mmap()
requests, and defer page accepts to page-fault events on othermmap()
requests. One obvious heuristic to defer page accepts is when Gramine noticesMAP_NORESERVE
flag in the mmap request.Why Gramine should implement it?
Performance reasons, as well as to decrease the amount of required EPC (physical SGX memory). E.g., Java runtime may issue
mmap(64GB, MAP_NORESERVE)
-- current EDMM implementation will spend a lot of time and physical memory on allocating + accepting all 64GB of enclave pages. The improved implementation would not allocate these 64GB enclave pages at all (only the actually required subset of pages will be allocated + accepted during page fault handling).