[LibOS,PAL/Linux-SGX] EDMM: Introduce lazy allocation

gramineproject / gramine

A library OS for Linux multi-process applications, with Intel SGX support

GNU Lesser General Public License v3.0

590 stars 194 forks source link

[LibOS,PAL/Linux-SGX] EDMM: Introduce lazy allocation #1099

Closed dimakuv closed 1 week ago

dimakuv commented 1 year ago

Description of the feature

PR #1054 implemented an initial version of EDMM support.

In particular, every mmap(addr, size) request ends up allocating the range of enclave pages [addr, addr+size), via the call to pal_memory.c: sgx_edmm_add_pages() which in turn does a loop sgx_eaccept + restrict/expand permissions over all pages in the range. In other words, the whole mmapped region is pre-accepted.

In some cases, applications may rely on lazy allocation of pages, where the VMAs are reserved but not actually committed to physical memory. In particular, mmap(..., MAP_NORESERVE) requests are used in such cases -- to mmap a huge chunk of memory (possibly never used in the future) at once and then commit pages on demand on page fault events.

So, our initial implementation of EDMM support doesn't have this concept of lazy allocation. Ideally, we would pre-accept the mmapped range only on some mmap() requests, and defer page accepts to page-fault events on other mmap() requests. One obvious heuristic to defer page accepts is when Gramine notices MAP_NORESERVE flag in the mmap request.

Why Gramine should implement it?

Performance reasons, as well as to decrease the amount of required EPC (physical SGX memory). E.g., Java runtime may issue mmap(64GB, MAP_NORESERVE) -- current EDMM implementation will spend a lot of time and physical memory on allocating + accepting all 64GB of enclave pages. The improved implementation would not allocate these 64GB enclave pages at all (only the actually required subset of pages will be allocated + accepted during page fault handling).

dimakuv commented 1 year ago

@vijaydhanraj You're working on this, right? Let me assign you now, and if this is wrong, please reply and I'll remove the assignment.

UPDATE: I forgot that I can't assign non-maintainers. @vijaydhanraj Please write something in this issue, so that GitHub allows me to assign you.

vijaydhanraj commented 1 year ago

Hi @dimakuv, please assign this to me.

kailun-qin commented 1 year ago

I'm not quite familiar w/ the context and was just looking into this requirement. I have the following assumptions and would like to have more discussions/clarifications.

We'd like to support mmap(..., MAP_NORESERVE) only for Linux-SGX PAL and when EDMM is enabled, i.e., Linux PAL should work as-is which ignores MAP_NORESERVE.
I'd image some potential changes below:
- Extend filter_saved_flags(): https://github.com/gramineproject/gramine/blob/51d915ded30aa2ca55ad5c2e86075e95fe938ca2/libos/src/bookkeep/libos_vma.c#L35-L39 so that MAP_NORESERVE flag can be saved in VMAs.
- Also propogate MAP_NORESERVE flag to PAL by extending https://github.com/gramineproject/gramine/blob/51d915ded30aa2ca55ad5c2e86075e95fe938ca2/pal/include/pal/pal.h#L189-L194 and https://github.com/gramineproject/gramine/blob/51d915ded30aa2ca55ad5c2e86075e95fe938ca2/libos/include/libos_flags_conv.h#L22.
- In _PalVirtualMemoryAlloc() of Linux-SGX PAL: https://github.com/gramineproject/gramine/blob/51d915ded30aa2ca55ad5c2e86075e95fe938ca2/pal/src/host/linux-sgx/pal_memory.c#L28-L29, we then defer page pre-accepts (by probably an early return) when EDMM is enabled and MAP_NORESERVE is set.
- On page fault handling, specifically in memfault_upcall() https://github.com/gramineproject/gramine/blob/51d915ded30aa2ca55ad5c2e86075e95fe938ca2/libos/src/bookkeep/libos_signal.c#L337, we first check whether the #PF can be handled (e.g., the faulting address is contained in a VMA w/ MAP_NORESERVE; now also possible when RIP is in LibOS (which we previously considered it as internal memory fault) because this can happen when some syscalls try to access user buffers). Then for the valid #PF, we leverage PalVirtualMemoryAlloc() w/ MAP_NORESERVE cleared and in page granulariy to accept the pages (which should only happen for Linux-SGX PAL). Note that this would require sgx.require_exinfo=true to retrieve the faulting address.
- We may want to introduce a manifest option for this lazy allocation on MAP_NORESERVE. We can check the prerequisite of sgx.require_exinfo=true together w/ this option or simply fallback to the pre-accept mode when it's set to false.

Pls correct me if anything incorrect. Thanks!

dimakuv commented 1 year ago

@kailun-qin The proposed changes and flow look correct to me. Below are some notes.

I see no reason to exclude the Linux PAL from handling MAP_NORESERVE. I think you can simply propagate this flag to the host as-is. IIUC, then the host Linux kernel will deal with #PF exceptions (i.e., no signal will be delivered to LibOS).
Ideally, I would like to also expand this "lazy alloc" behavior to mappings with PROT_NONE. I think many applications use mmap(PROT_NONE) to achieve the same behavior as MAP_NORESERVE.
- If this indeed would work, then we will not need to introduce a new PAL_PROT_ value. We could just rely on PAL_PROT_NONE (== 0). In other words, both mmap(PROT_NONE) and mmap(MAP_NORESERVE) will end up calling PalVirtualMemoryAlloc(PAL_PROT_NONE), which will trigger the lazy-alloc behavior.
- If this would work, this would also affect the Linux PAL a bit: instead of propagating MAP_NORESERVE to the host, the mapping will be created on the host with PAL_PROT_NONE. Which I think will be semantically equivalent anyway.
- There is a problem with this PROT_NONE though: the Linux-SGX PAL will not have a good hint from the LibOS if the app later calls mmap(PROT_READ) on such a prot-none mapping. In this case, the Linux-SGX PAL will need to allocate + accept the pages.

Actually, sorry, the more I think about PROT_NONE approach suggested above, the more I don't like it. I think we should start with the MAP_NORESERVE flag only, as Kailun described in his proposal. We could extend it in the future, if need arises.

We may want to introduce a manifest option for this lazy allocation on MAP_NORESERVE. We can check the prerequisite of sgx.require_exinfo=true together w/ this option or simply fallback to the pre-accept mode when it's set to false.

I don't like the new manifest option. Way too many options, and the "lazy acception" optimization shouldn't be controversial -- it seems to be always beneficial.

I don't think we need to require sgx.require_exinfo=true for this? I mean, 99.9% of the time the mappings with MAP_NORESERVE will not be accessed by the application at all. So the #PF-exception path will not be triggered at all. And only in cases where this path will be triggered, we can immediately check whether the faulting address is zero, and if it is then we loudly fail with "please enable sgx.require_exinfo". I think this stop-gap will be enough.

kailun-qin commented 1 year ago

@dimakuv Thanks for the feedbacks! Pls kindly see my comments below.

I see no reason to exclude the Linux PAL from handling MAP_NORESERVE. I think you can simply propagate this flag to the host as-is. IIUC, then the host Linux kernel will deal with #PF exceptions (i.e., no signal will be delivered to LibOS).

Yeah, make sense to me.

I don't like the new manifest option. Way too many options, and the "lazy acception" optimization shouldn't be controversial -- it seems to be always beneficial. I don't think we need to require sgx.require_exinfo=true for this? I mean, 99.9% of the time the mappings with MAP_NORESERVE will not be accessed by the application at all. So the #PF-exception path will not be triggered at all.

Yes, I agree.

And only in cases where this path will be triggered, we can immediately check whether the faulting address is zero, and if it is then we loudly fail with "please enable sgx.require_exinfo". I think this stop-gap will be enough.

But LibOS cannot tell whether the zero faulting address is actually due to the app itself or the MAP_NORESERVE-caused #PF exception? (since in such case, the corresponding VMA shouldn't be found during the lookup process and the saved flag is hence unknown to LibOS).

To complete the proposal, there are two other points that're not covered yet:

File-backed maping For valid #PF exceptions, similarly, we may handle them here: https://github.com/gramineproject/gramine/blob/7485c90b1cb5b54695a7dcab1cdcb6361d179777/libos/src/bookkeep/libos_signal.c#L362, by leveraging libos mmap fs ops: https://github.com/gramineproject/gramine/blob/7485c90b1cb5b54695a7dcab1cdcb6361d179777/libos/include/libos_fs.h#L124-L125 w/ MAP_NORESERVE cleared and in page granulariy to accept the pages.
Unmapping This can be a bit problematic.
- First, as we have no PAL_PROT_* flags passed to PAL API PalVirtualMemoryFree(): https://github.com/gramineproject/gramine/blob/7485c90b1cb5b54695a7dcab1cdcb6361d179777/pal/include/pal/pal.h#L230 (and we probably don't want to extend it?), the only way seems to be skipping this deallocation on VMAs w/ MAP_NORESERVE flags during e.g., munmap() https://github.com/gramineproject/gramine/blob/7485c90b1cb5b54695a7dcab1cdcb6361d179777/libos/src/sys/libos_mmap.c#L363-L365. Note that this is a common pattern for workloads using MAP_NORESERVE lazy allocation, where they usually first allocate w/ mmap(..., MAP_NORESERVE) and subsequently release/unmap the parts that're not aligned.
- Second, for the pages committed on demand on #PF exceptions, I don't have a simple and good solution for when and how they should be released. I assume a (bit vector based) page tracker to track the commit status of pages (which may also cooperate w/ our VMA subsystem) would do the trick. But this can lead to a quite different design and I'm not sure if the MAP_NORESERVE support alone is worth this introduced complexity.

Any thoughts? Thanks!

dimakuv commented 1 year ago

And only in cases where this path will be triggered, we can immediately check whether the faulting address is zero, and if it is then we loudly fail with "please enable sgx.require_exinfo". I think this stop-gap will be enough.

But LibOS cannot tell whether the zero faulting address is actually due to the app itself or the MAP_NORESERVE-caused #PF exception? (since in such case, the corresponding VMA shouldn't be found during the lookup process and the saved flag is hence unknown to LibOS).

I don't understand your point. If sgx.require_exinfo = false, then LibOS can't/shouldn't handle faulting addresses at all! Because in this case, no matter what enclave address the application tried to access, SGX will always report address 0, which is totally wrong.

If our current LibOS #PF handler doesn't check sgx.require_exinfo and propagates the address to the app anyway, then we have a bug.

To complete the proposal, there are two other points that're not covered yet:

File-backed maping For valid #PF exceptions, similarly, we may handle them here: https://github.com/gramineproject/gramine/blob/7485c90b1cb5b54695a7dcab1cdcb6361d179777/libos/src/bookkeep/libos_signal.c#L362 , by leveraging libos mmap fs ops: https://github.com/gramineproject/gramine/blob/7485c90b1cb5b54695a7dcab1cdcb6361d179777/libos/include/libos_fs.h#L124-L125

w/ MAP_NORESERVE cleared and in page granulariy to accept the pages.

I am not sure what you mean by this. Do you mean that file-backed mappings can also be with MAP_NORESERVE? I would actually argue EDMM lazy allocation should not cover this case. In particular, I think you can just clear the MAP_NORESERVE flag in file-backed mappings. I am certain that no applications use file-backed mappings with MAP_NORESERVE.

Unmapping This can be a bit problematic.

First, as we have no PAL_PROT_* flags passed to PAL API PalVirtualMemoryFree(): https://github.com/gramineproject/gramine/blob/7485c90b1cb5b54695a7dcab1cdcb6361d179777/pal/include/pal/pal.h#L230 (and we probably don't want to extend it?), the only way seems to be skipping this deallocation on VMAs w/ MAP_NORESERVE flags during e.g., munmap() https://github.com/gramineproject/gramine/blob/7485c90b1cb5b54695a7dcab1cdcb6361d179777/libos/src/sys/libos_mmap.c#L363-L365 . Note that this is a common pattern for workloads using MAP_NORESERVE lazy allocation, where they usually first allocate w/ mmap(..., MAP_NORESERVE) and subsequently release/unmap the parts that're not aligned.

Second, for the pages committed on demand on #PF exceptions, I don't have a simple and good solution for when and how they should be released. I assume a (bit vector based) page tracker to track the commit status of pages (which may also cooperate w/ our VMA subsystem) would do the trick. But this can lead to a quite different design and I'm not sure if the MAP_NORESERVE support alone is worth this introduced complexity.

Yes, here I see the problem. I vote against both suggestions:

I don't want to introduce any new flags to PalVirtualMemoryFree(), as this is an SGX EDMM-only specific logic, and it shouldn't be solved by special-casing the PAL API.
I don't want to introduce a bitvector page tracker, since this will introduce significant complexity. (In fact, the very first design of EDMM had such a bit vector page tracker, but we came up with a second design that was much simpler and cleaner.)

Can we use any additional interfaces of the SGX driver API, or can we use some properties of the SGX instructions to "get information" about the state of the to-be-freed enclave pages? E.g., maybe EMODPE instruction on an unaccepted page returns some special value in RAX, and that's how we can learn that the page was never EACCEPTed in the first place? Or maybe we can ask the SGX driver via some IOCTL "was this enclave page range ever allocated?".

kailun-qin commented 1 year ago

If our current LibOS #PF handler doesn't check sgx.require_exinfo and propagates the address to the app anyway, then we have a bug.

Ah OK, I don't think we check it and we should probably check this in _PalExceptionHandler() of Linux-SGX PAL: https://github.com/gramineproject/gramine/blob/3b5c88b116a3fd42e8305eb8f8973ecc648afa7c/pal/src/host/linux-sgx/pal_exception.c#L323-L325.

Do you mean that file-backed mappings can also be with MAP_NORESERVE?

Yes, this was what I meant.

I would actually argue EDMM lazy allocation should not cover this case. In particular, I think you can just clear the MAP_NORESERVE flag in file-backed mappings. I am certain that no applications use file-backed mappings with MAP_NORESERVE.

I agree that rare applications use file-backed mappings with MAP_NORESERVE. I'm fine w/ not covering this case.

E.g., maybe EMODPE instruction on an unaccepted page returns some special value in RAX, and that's how we can learn that the page was never EACCEPTed in the first place?

I don't think EMODPE'll return anything, pls see https://www.felixcloutier.com/x86/emodpe.

Or maybe we can ask the SGX driver via some IOCTL "was this enclave page range ever allocated?".

SGX_IOC_ENCLAVE_MODIFY_TYPES IOCTL will return -EFAULT if an enclave page range was never allocated. I also checked other SGX driver provided IOCTLs (pls see https://docs.kernel.org/6.3/x86/sgx.html), but unfortunately no luck. Anyway, I'll ask around and see if any possibility.

I vote against both suggestions

What about we try to pre-fault all to-be-freed pages at the very beginning of sgx_edmm_remove_pages() https://github.com/gramineproject/gramine/blob/3b5c88b116a3fd42e8305eb8f8973ecc648afa7c/pal/src/host/linux-sgx/enclave_edmm.c#L86, by EACCEPT (using initial page permissions), regardless whether they've been accepted or not.

If it's already committed, then we get SGX_PAGE_ATTRIBUTES_MISMATCH immediately and it can be simply ignored; if it's not, we trigger a #PF and SGX driver should handle it. Then the original flow of sgx_edmm_remove_pages() should not be impacted -- by first changing the pagess type to PT_TRIM etc.

I think this should introduce similar overhead if there is an SGX driver API / SGX instruction that we can rely on to "get information" about the state and handle accordingly, because this approach basically leverages EACCEPT to do that.

dimakuv commented 1 year ago

If our current LibOS #PF handler doesn't check sgx.require_exinfo and propagates the address to the app anyway, then we have a bug.

Ah OK, I don't think we check it and we should probably check this in _PalExceptionHandler() of Linux-SGX PAL:

Yep. Will you create such a PR?

I would actually argue EDMM lazy allocation should not cover this case. In particular, I think you can just clear the MAP_NORESERVE flag in file-backed mappings. I am certain that no applications use file-backed mappings with MAP_NORESERVE.

I agree that rare applications use file-backed mappings with MAP_NORESERVE. I'm fine w/ not covering this case.

Yes, let's not cover this case. Just put a FIXME comment in the code, that we currently ignore this case because we consider it rare and not worth optimizing for performance.

What about we try to pre-fault all to-be-freed pages at the very beginning of sgx_edmm_remove_pages(), by EACCEPT (using initial page permissions), regardless whether they've been accepted or not. If it's already committed, then we get SGX_PAGE_ATTRIBUTES_MISMATCH immediately and it can be simply ignored; if it's not, we trigger a #PF and SGX driver should handle it. Then the original flow of sgx_edmm_remove_pages() should not be impacted -- by first changing the pagess type to PT_TRIM etc.

Yes, this approach looks ok-ish. I'm afraid we won't come up with anything better than this, at least not currently.

I think this should introduce similar overhead if there is an SGX driver API / SGX instruction that we can rely on to "get information" about the state and handle accordingly, because this approach basically leverages EACCEPT to do that.

I think the overhead will be significant, because for such MAP_NORESERVE pages we will have: EACCEPT + #PF + kernel handler + SGX handler + EACCEPT.

I was hoping for an overhead of ESOMEINSTRUCTION only, without exception handling. But looks like we can't do that. Oh well. At least for the initial PR, your approach should good enough. It will kinda move the performance overhead from the mmap() time to the unmap() time.

Also, maybe this is a good point to ask SGX driver developers whether they can suggest a better solution, or even introduce a new IOCTL or something specially for us?

kailun-qin commented 1 year ago

Yep. Will you create such a PR?

I created https://github.com/gramineproject/gramine/pull/1502 for this.

I think the overhead will be significant, because for such MAP_NORESERVE pages we will have: EACCEPT + #PF + kernel handler + SGX handler + EACCEPT.

Yes - for the pages that're not faulted/accessed at all. While for those that're allocated lazily (though can be very few w/ mmap(MAP_NORESERVE)), I suppose it'll be an overhead of EACCEPT only.

It will kinda move the performance overhead from the mmap() time to the unmap() time.

Right, exactly.

Also, maybe this is a good point to ask SGX driver developers whether they can suggest a better solution, or even introduce a new IOCTL or something specially for us?

Sure, I'll approach Haitao et al. to see if any better option.

dimakuv commented 1 year ago

I have two new notes:

All #PF handling must be inside the SGX PAL

On page fault handling, specifically in memfault_upcall() ...

The original design by Kailun uses the LibOS's memory-fault handler. I overlooked this design choice, but now I'm certain that this is wrong. LibOS is arch-agnostic and must not even know that things like (minor) page faults exist. Also, calling PalVirtualMemoryAlloc() on a piece of memory that LibOS already considers allocated is definitely an incorrect design.

So we actually need to intercept minor page faults in the SGX PAL: https://github.com/gramineproject/gramine/blob/a8edb2e17d7b9e95ff46e8ee4dcee47447808f33/pal/src/host/linux-sgx/pal_exception.c#L236

This is also correct from the other PAL's view: the Linux PAL never generates such minor page faults (because they are done completely by the underlying Linux host). Thus, the proposed memfault_upcall() modification would be SGX-specific, which goes against the Gramine LibOS vs PAL separation philosophy.

We must introduce a bitmap vector

This bitmap vector will simply span the whole sgx.enclave_size, with each bit representing "enclave page X was eaccepted" (i.e. enclave page X is usable). E.g., for a 1GB enclave, the bitmap will contain 1024*1024*1024 / 4096 / 8 = 32768 bytes, or 32KB. For a 1TB enclave, the bitmap will contain 32MB. So, the memory overhead for a bitmap is 0.003%.

I'd like to stress that this bitmap vector is introduced purely for the #PF handling (minor page faults due to not-yet-committed enclave pages). The real bulk of the memory bookkeeping is still on the LibOS VMA subsystem, including page permissions. Unfortunately, there seems to be no way to have an implementation of this lazy EDMM feature with a completely stateless code...

haitaohuang commented 1 year ago

The way I see it EDMM sub-component tracks EPCM attributes which is SGX specific, and should not be in conflict with regular VMA attributes. EACCEPT bit map is one of them, even EPCM.R/W/X could be out of sync from libOS VMA record, I'm not sure also reusing standard mmap flags to signal whether a page is EAUG on #PF is reasonable because you may have situations when EAUG on #PF is also needed for other flags. But I'm not deep into gramine use cases, just something to consider.

Intel SDK and MS open enclave SDK (probably other runtimes too given people are sending PRs and issues) are using the sgx-emm implementation which has separate tracking for all those and we didn't find it much overhead.

You can find the rational on storing EPC states here: https://github.com/intel/sgx-emm/blob/main/design_docs/SGX_EDMM_driver_interface.md#enclave-handling-of-faults.

dimakuv commented 1 year ago

@haitaohuang Thanks for your inputs! Some comments below.

even EPCM.R/W/X could be out of sync from libOS VMA record

Why would the EPCM attributes be out of sync from the LibOS VMA page permissions? I see no real-world scenario when this can happen/is beneficial.

I'm not sure also reusing standard mmap flags to signal whether a page is EAUG on #PF is reasonable because you may have situations when EAUG on #PF is also needed for other flags

I think you're confusing Gramine's purpose here. We are not reusing the flags for our own purposes, instead we want to emulate the lazy-allocation behavior of the Linux x86 kernel. One of the simple cases is this MAP_NORESERVE case, which we are discussing in this issue. There will be more cases probably, and we'll evaluate whether SGX EDMM lazy allocation is performance-beneficial for each case, and if yes, we'll add these cases too. Typically, such cases are identified by mmap flags and/or by mprotect flags.

kailun-qin commented 1 year ago

I'd like to stress that this bitmap vector is introduced purely for the #PF handling (minor page faults due to not-yet-committed enclave pages).

hmm... But shouldn't we also update this bitmap vector every time we add/remove enclave pages? Otherwise, during #PF handling, how can we know whether an enclave page X was actually eaccepted (so that we can add it if it was not)?

dimakuv commented 1 year ago

I'd like to stress that this bitmap vector is introduced purely for the #PF handling (minor page faults due to not-yet-committed enclave pages).

hmm... But shouldn't we also update this bitmap vector every time we add/remove enclave pages? Otherwise, during #PF handling, how can we know whether an enclave page X was actually eaccepted (so that we can add it if it was not)?

Yes, of course, sorry for confusion. The bitmap must be updated on actual add/remove of enclave pages. What was I trying to say is that the only rationale for introducing this bitmap is to track lazy allocation of enclave pages via the #PF exceptions.

haitaohuang commented 1 year ago

even EPCM.R/W/X could be out of sync from libOS VMA record

Why would the EPCM attributes be out of sync from the LibOS VMA page permissions? I see no real-world scenario when this can happen/is beneficial.

This of course depends on use case and may not be applicable to gramine. In multithreading case, you may have one thread changes permissions, records that target permission in VMA, But EPCM is not changed yet. Say originally you have RW in both VMA and EPCM. After mprotect to change VMA to RX, before EMODPR, EMODPE, EACCEPT are done to finish change EPCM, another thread may come in and execute the code. In this window, EPCM=RW, but VMA=RX, #PF may happen. To handle the #PF, you need track EPCM. In the linked reference, we documented this kind of scenarios.

I'm not sure also reusing standard mmap flags to signal whether a page is EAUG on #PF is reasonable because you may have situations when EAUG on #PF is also needed for other flags

I think you're confusing Gramine's purpose here. We are not reusing the flags for our own purposes, instead we want to emulate the lazy-allocation behavior of the Linux x86 kernel. One of the simple cases is this MAP_NORESERVE case, which we are discussing in this issue. There will be more cases probably, and we'll evaluate whether SGX EDMM lazy allocation is performance-beneficial for each case, and if yes, we'll add these cases too. Typically, such cases are identified by mmap flags and/or by mprotect flags.

Yeah, I misspoke when I say "reuse" those flags. So MAP_NORESERVE and similarly for other flags, it will be always EAUG on #PF once you think this is the way to go, I wonder if you would later use other criteria to determine. e.g. size of the area, special situation like stack/heap you may do a portion on demand. If you plan to support those, then in the end some kind of flags needed in PAL to track which range is on demand which are eagerly committed, and some more explicit indicator passed to PAL for the mode of EPC committing.

BTW Linux kernel does not seem to do much for MAP_NORESERVE other than not reserving/accounting for swap. IIUC, it makes not much difference in terms of whether RAM is committed or not. "MAP_POPULATE do eager allocation, otherwise do lazy" seems to be a better heuristic. Not sure if it was considered.

dimakuv commented 1 year ago

@haitaohuang Thanks again for more insights!

After mprotect to change VMA to RX, before EMODPR, EMODPE, EACCEPT are done to finish change EPCM, another thread may come in and execute the code. In this window, EPCM=RW, but VMA=RX, #PF may happen.

This race should be impossible in normal execution of Gramine. Gramine's LibOS VMA subsystem synchronizes mprotect requests internally.

And if the application itself does it (one app thread performs mprotect, and another app thread accesses this same page), then it is a bug in the application, and Gramine is not supposed to "try to fix" bad behavior of the app.

If you plan to support those, then in the end some kind of flags needed in PAL to track which range is on demand which are eagerly committed, and some more explicit indicator passed to PAL for the mode of EPC committing.

Yes. After some more discussions with @kailun-qin, we currently think that we'll get away with the following metadata:

a bit vector that for each enclave page, indicates whether it was already committed or not yet.
an upcall from the SGX EDMM backend into the LibOS, to ask for additional info on a particular page (LibOS finds this page address in its list of VMAs and returns info like map flags, protections, etc.).

BTW Linux kernel does not seem to do much for MAP_NORESERVE other than not reserving/accounting for swap. IIUC, it makes not much difference in terms of whether RAM is committed or not. "MAP_POPULATE do eager allocation, otherwise do lazy" seems to be a better heuristic. Not sure if it was considered.

MAP_POPULATE does the opposite of what we want. We want in Gramine to commit enclave pages by default, unlike Linux which postpones committing memory pages by default. So for Linux, MAP_POPULATE is the flag to revert the default policy and to commit pages eagerly. However, in Gramine this is already the default policy, thus MAP_POPULATE is a no-op.

Note that we also don't want to change Gramine policy to the Linux one (always postpone committing pages, unless instructed otherwise). This would introduce tremendous performance overhead, due to the additional #PF flow, which is very expensive in SGX.

That's why in Gramine, we kinda have a reverse logic -- we try to find the flags that hint at "this memory range will probably never be needed". One good hinting flag that we observed in several workloads (most notably in Java) is MAP_NORESERVE. That's why this issue discusses this flag exactly.