mmap() doesn't support lazy allocation of physical memory

llly commented 3 years ago

Description of the problem

According to GLIBC manual Memory Protection about mmap syscall

PROT_NONE For anonymous mappings, the kernel will not reserve any physical memory for the allocation at the time the mapping is created.

For example, mmap(0x0,134217728,PROT_NONE,MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE,-1,0) only allocate virtual memory but not physical memory. It will always success except virtual memory runs out. On native 64bit OS, Program Virtual Memory space is UINT64_MAX. Most program including java don’t consider virtual memory as a scarce resource.

However Graphene uses EPC as Program Virtual Memory space and its size is sgx.enclave_size in manifest. Graphene also allocate physical EPC on mmap() with PROT_NONE and MAP_ANONYMOUS flag. mmap() with PROT_NONE and MAP_ANONYMOUS flag fails when total size is bigger than sgx.enclave_size.

Although sgx.enclave_size can be bigger than physical EPC size because EPC page can swap in and swap out. sgx.enclave_size can never be big enough as 128TB. The performance drops a lot if we increase sgx.enclave_size only for more virtual memory.

Steps to reproduce

C program:

#define M (1024LL*1024)
int main(int argc, char** argv) {
    unsigned long i = 0;
    while (1) {
            void* addr = mmap(0x0, 16*M, PROT_NONE,MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK,-1,0);
            if (addr == MAP_FAILED) {
                printf("mmap failed. errno = %d, time = %ld\n", errno, i);
                return 1;
            }
            i++;
    }
    return 0;
}

with manifest item sgx.enclave_size = "4G" on a 256MB EPC machine.

Expected results

mmap failed. errno = 12, time = 8388606 On native 64bit Ubuntu 18.04. about 125TB.

Actual results

mmap failed. errno = 12, time = 247 On graphene, 3952MB.

Additional information

This issue Blocks java from running workload for long time. Java -Xmx option seems only limit the physical memory usage but not virtual memory space usage. Here is a debug log snap and my comments for java reporting Out Of Memory when -Xmx is small than sgx.enclave_size.

1.  [P4663:T2:java] ---- shim_mmap(0x0,134217728,PROT_NONE,MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE,-1,0) ... mmap
2.  [P4663:T2:java] ---- return from shim_mmap(...) = 0x7c3622000
//Graphene: Reserve continuous 128M EPC in enclave.
//Native OS: Reserve continuous 128M Virtual Memory.
3.  [P4663:T2:java] ---- shim_munmap(0x7c3622000,10346496) ... munmap
4.  [P4663:T2:java] ---- return from shim_munmap(...) = 0
5.  [P4663:T2:java] ---- shim_munmap(0x7c8000000,56762368) ... munmap
6.  [P4663:T2:java] ---- return from shim_munmap(...) = 0
//Graphene: Keep aligned 64M EPC. Release not aligned 64M EPC. Two EPC fragments  appear.
//Native OS: Keep aligned 64M Virtual Memory. Release not aligned 64M Virtual Memory.
7.  [P4663:T2:java] ---- shim_mprotect(0x7c4000000,135168,PROT_READ|PROT_WRITE) ... mprotect
8.  [P4663:T2:java] ---- return from shim_mprotect(...) = 0
//Graphene: Change first 132K EPC read and write permission and can be used by program.
//Native OS: Change first 132K Virtual Memory read and write permission. 132K is mapped to physical memory and can be used by program.
9.  [P4663:T2:java] ---- shim_mprotect(0x7c4021000,4096,PROT_READ|PROT_WRITE) ... mprotect
10. [P4663:T2:java] ---- return from shim_mprotect(...) = 0
//Subsequent 4K. Same as above.
11. [P4663:T2:java] ---- shim_mprotect(0x7c4022000,4096,PROT_READ|PROT_WRITE) ... mprotect
12. [P4663:T2:java] ---- return from shim_mprotect(...) = 0
//Subsequent 4K. Same as above.
13. [P4663:T2:java] ---- shim_mprotect(0x7c4023000,4096,PROT_READ|PROT_WRITE) ... mprotect
14. [P4663:T2:java] ---- return from shim_mprotect(...) = 0
//Subsequent 4K. Same as above.
15. [P4663:T2:java] ---- shim_mprotect(0x7c4024000,8192,PROT_READ|PROT_WRITE) ... mprotect
16. [P4663:T2:java] ---- return from shim_mprotect(...) = 0
//Subsequent 8K. Same as above.
17. ……. More mmap and munmap etc. in different threads.
18. [P4663:T196:java] ---- shim_mmap(0x0,134217728,PROT_NONE,MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE,-1,0) ... mmap
19. [P4663:T196:java] ---- return from shim_mmap(...) = -12
20. [P4663:T196:java] ---- //shim_mmap(0x0,67108864,PROT_NONE,MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE,-1,0) ... mmap
21. [P4663:T196:java] ---- return from shim_mmap(...) = -12
//Graphene: Fail to reserve continuous 128M or 64M EPC in enclave. JAVA will GC. But Virtual Memory/EPC in Graphene is never continuous if any of the memory is not freed. In the end JAVA report Out of Memory.
//Native OS: Will not happen. Virtual Memory space is big enough.

llly commented 3 years ago

This issue is probably the root cause of #1924.

dimakuv commented 3 years ago

@llly Thanks for debugging this!

Yes, we are painfully aware of the inadequacy of mmap(.., PROT_NONE) under current Intel SGX (and therefore Graphene). Unfortunately, we currently don't see any reasonable fix to this issue. Would you have an idea how to fix this?

The problem is: Intel SGX version 1 doesn't allow dynamic enclave memory management. The virtual-space limit must be specified per enclave (via sgx.enclave_size). All this virtual enclave memory is allocated at enclave startup. There is no notion of "allocating enclave pages on demand".

This is fixed in Intel SGX version 2, with a feature called Enclave Dynamic Memory Management (EDMM). Unfortunately, this feature is currently not supported by the upstream Intel SGX driver. Thus, it is also not supported in Graphene; it will be supported only somewhere in 2021.

TLDR: The correct way to fix this Java issue is to wait for EDMM support in SGX driver and Graphene. I don't know of any other correct way.

llly commented 3 years ago

@dimakuv You are right. EDMM is the final fix. We need to find a workaround anyhow.

mkow commented 3 years ago

According to GLIBC manual Memory Protection about mmap syscall

We are implementing Linux syscall API, not libc (which are different and often have different semantics, despite using the same function names). But in this case Linux also does lazy mappings.

I don't think we can do anything about this, it's a hardware limitation, we just need to wait for SGX2 support.

AI-Memory commented 3 years ago

This is actually one of limitations of SGX1, there is a SGX2 patch that contributed by @rainfld about 2 years ago PR #234, you can try it before the SGX2 can be fully supported. but the following workarounds can also be considered for special cases. 1) Do mmap on host memory space instead of EPC conditionally (Note that, No SGX security benefits at all) 2) Patch SGX driver to handle memory preserve according to particular cases. 3) Some code logics actually smart enough to reduce memory consumption when failed to preserve memory space, so fail fast from GSGX LibOS (fit into some cases e.g. pre-allocating scenario) In Java cases, Please also try pre-allocation/prefetching options.

dimakuv commented 3 years ago

@llly Actually, I'm curious if you tried huge enclaves? Like sgx.enclave_size = 1024G? I know it may take minutes (hours? days?). But would be quite interesting to know.

llly commented 3 years ago

@bigdata-memory We are trying No.3 to find a Java GC that can relocate objects and reduce memory fragmentation. @dimakuv I tried sgx.enclave_size = "64G" with JAVA -Xmx50g on 128G EPC machine and it take 40s to start enclave and 70s to finish first mmap 32GB. For sgx.enclave_size = "128G" it take 100s to start enclave. A Java function Processbuilder.start() is used a lot to run native command such as setsid, rm, chmod. It call fork() syscall firstly, and execve() then. fork() of Java in Graphene cost a lot if enclave_size is big. It takes 10min to finish one Processbuilder.start("chmod") with Java manifest sgx.enclave_size = "128G". That's the problem for large enclave_size.

dimakuv commented 3 years ago

@llly Thanks for the information!

So the problem is not only in the large enclave sizes, but also in the fork/execve pattern. We know about this problem as well, and we had some ideas for optimization of forks (for example, a pool of pre-initialized enclaves waiting for a parent to fork). But this was low priority for us...

mkow commented 3 years ago

Closing, as I believe this isn't something which we can fix (it's a hardware limitation, not an issue with Graphene).

gramineproject / graphene