Extreme slow rust rocket webservice

kvinwang commented 2 years ago

Description of the problem

After temp patched the settimeout issue, the rocket works now. But it often response very slow.

Steps to reproduce

git clone https://github.com/kvinwang/gramine-show-slow-http.git
cd gramine-show-slow-http/gramine-build
make
gramine-sgx pruntime

In another terminal:

$ time curl localhost:8000/hello >/dev/null 2>&1

real    0m0.012s
user    0m0.000s
sys 0m0.010s

$ time curl localhost:8000/hello >/dev/null 2>&1

real    0m0.187s
user    0m0.000s
sys 0m0.010s

$ time curl localhost:8000/hello >/dev/null 2>&1

real    0m2.258s
user    0m0.007s
sys 0m0.004s

$ time curl localhost:8000/hello >/dev/null 2>&1

real    0m2.081s
user    0m0.007s
sys 0m0.004s

$ time curl localhost:8000/hello >/dev/null 2>&1

real    0m2.074s
user    0m0.005s
sys 0m0.006s

Expected results

Should response the HTTP request in milliseconds.

Actual results

Often response the HTTP request in several seconds.

Additional information

perf shows that it spent most time on memory allocation.

Adding the following log shows that it frequatly allocating a 64MB of memory (about 100 times during a single request).

I tried to use gdb to inspect the traceback, but failed to stop at the breakpoint with an error.

Finally, I tried to reject the 64MB allocation by returning NOMEM, then the HTTP response become fast without reporting any error.

dimakuv commented 2 years ago

Do you observe the same behavior with gramine-direct (non-SGX mode)?
Did you try sgx.preheat_enclave = true?
Did you try sys.brk.max_size = "1G"? (try to play with this size; for example, try 1G, 2G, 4G, etc.)
What is the SGX machine you're using? How much EPC (Enclave Page Cache) does this machine have?

kvinwang commented 2 years ago

Do you observe the same behavior with gramine-direct (non-SGX mode)?

Tried many times, no problem with gramine-direct

Did you try sgx.preheat_enclave = true?

No help, still slow.

Did you try sys.brk.max_size = "1G"? (try to play with this size; for example, try 1G, 2G, 4G, etc.)

Failed to start with sys.brk.max_size = "1G", etc.

[P1:T1:pruntime] error: Error during shim_init() in init_elf_objects (-12)

Tried 128M, 256M, No help, still slow.

What is the SGX machine you're using? How much EPC (Enclave Page Cache) does this machine have?

My CPU is i7-9700k

$ is-sgx-available
SGX supported by CPU: true
SGX1 (ECREATE, EENTER, ...): true
SGX2 (EAUG, EACCEPT, EMODPR, ...): false
Flexible Launch Control (IA32_SGXPUBKEYHASH{0..3} MSRs): true
SGX extensions for virtualizers (EINCVIRTCHILD, EDECVIRTCHILD, ESETCONTEXT): false
Extensions for concurrent memory management (ETRACKC, ELDBC, ELDUC, ERDINFO): false
CET enclave attributes support (See Table 37-5 in the SDM): false
Key separation and sharing (KSS) support (CONFIGID, CONFIGSVN, ISVEXTPRODID, ISVFAMILYID report fields): false
Max enclave size (32-bit): 0x80000000
Max enclave size (64-bit): 0x1000000000
EPC size: 0x5d80000
SGX driver loaded: true
AESMD installed: true
SGX PSW/libsgx installed: true

dimakuv commented 2 years ago

EPC size: 0x5d80000

That's not good. You only have 94MB of EPC. Given that your workload tries to allocate 64MB of memory for every request, this leads to a huge amount of EPC swapping (there is not enough free memory in EPC, so the least recently used EPC pages are swapped out to RAM, which is a very expensive operation in Intel SGX).

To be honest, I think the only way to improve your Rocket workload is to run it on a newer Intel CPU (with more EPC)...

kvinwang commented 2 years ago

I don't think it the rocket allocating the memory. I logged the Rust's low level memory allocation API, the 64MB allocation is not from Rust. So, I tried to use gdb to inspect the backtrace, but failed to stop at the breakpoint. Is there a way to inspect who is allocating the memory?

dimakuv commented 2 years ago

So, I tried to use gdb to inspect the backtrace, but failed to stop at the breakpoint.

Did you add sgx.debug = 1 to your manifest file? Otherwise GDB won't be able to peek inside the SGX enclave.

Is there a way to inspect who is allocating the memory?

GDB is your best tool. See https://gramine.readthedocs.io/en/latest/devel/debugging.html.

kvinwang commented 2 years ago

Did you add sgx.debug = 1 to your manifest file?

Yes, But it quited when the breakpoint hit.

error: Unexpected illegal instruction (SIGILL) occurred inside untrusted PAL (pruntime+0x10551f38)
[LWP 457651 exited]
[LWP 457652 exited]
[LWP 457650 exited]
[LWP 457649 exited]
[LWP 457648 exited]
[LWP 457647 exited]
[LWP 457646 exited]
[LWP 457645 exited]
[LWP 457644 exited]
[LWP 457643 exited]
[LWP 457642 exited]
[LWP 457641 exited]
[LWP 457640 exited]
[LWP 457639 exited]
[LWP 457638 exited]
[LWP 457637 exited]
[LWP 457632 exited]
[LWP 457631 exited]
[LWP 457628 exited]
[Inferior 1 (process 457628) exited with code 01]

dimakuv commented 2 years ago

error: Unexpected illegal instruction (SIGILL) occurred inside untrusted PAL (pruntime+0x10551f38)

This is not good. Does it happen also without GDB, in normal gramine-sgx ... runs? You should first identify why this happens.

kvinwang commented 2 years ago

It does not happen without GDB. It's only happen with gramine-sgx GDB and when the breakpoint hit. The program is a really simple helloworld and written in Rust. So there might be some issue in gramine-sgx env?

boryspoplawski commented 2 years ago

@kvinwang Try adding loader.env.M_ARENA_MAX = 1 to the manifest file or change the enclave size to something greater than number_of_threads * 64MB + some_flat_memory_for_graphene(let's say 256MB)

dimakuv commented 2 years ago

@boryspoplawski I guess you assume that this Rust application spawns a new thread for each new request? Have you observed this in Rust apps? (I only remember observing this for Python.)

But yes, it is a good suggestion.

boryspoplawski commented 2 years ago

I guess you assume that this Rust application spawns a new thread for each new request?

No, but it spawns some (~20 in this example, notice the gdb output) threads.

This is a issue I've seen already. The problem is that glibc's malloc (@kvinwang that's why you didn't see these allocations come from rust) has a per thread arenas (think of it as memory pools) that are 64MB in size that are lazily allocated at first malloc (in a given thread). If enclave doesn't have enough memory, some of these arenas allocations will fail and the allocation will just fallback to global arena/mmaps (don't remember if global arena is shared or just given to the first thread). Note that because of arena allocation laziness this will happen on every malloc, hence the observer slowdown. Increasing the enclave size will just cause these arenas to be created successfully and limiting arenas number will just cause malloc to use a shared state + locking (which might be slower than per thread cache, but not that bad compared to this).

kvinwang commented 2 years ago

Try adding loader.env.M_ARENA_MAX = 1 to the manifest file or change the enclave size to something greater than number_of_threads * 64MB + some_flat_memory_for_graphene(let's say 256MB)

Adding the env got an error:

error: Invalid environment variable in manifest: 'M_ARENA_MAX'
error: PAL failed at ../Pal/src/db_main.c:pal_main:499 (exitcode = 16, reason=Building the final environment based on the original environment and the manifest failed)

However, after increasing the enclave_size the problem gone. Cool!

Could you explain a bit more on what happened under the hood?

And should I open an new issue for the GDB issue? It's seems only has problem on the breakpoint for the 64MB allocation.

kvinwang commented 2 years ago

This is a issue I've seen already. The problem is that glibc's malloc (@kvinwang that's why you didn't see these allocations come from rust) has a per thread arenas (think of it as memory pools) that are 64MB in size that are lazily allocated at first malloc (in a given thread). If enclave doesn't have enough memory, some of these arenas allocations will fail and the allocation will just fallback to global arena/mmaps (don't remember if global arena is shared or just given to the first thread). Note that because of arena allocation laziness this will happen on every malloc, hence the observer slowdown. Increasing the enclave size will just cause these arenas to be created successfully and limiting arenas number will just cause malloc to use a shared state + locking (which might be slower than per thread cache, but not that bad compared to this).

That explains a lot. Thank you so much.

dimakuv commented 2 years ago

error: Invalid environment variable in manifest: 'M_ARENA_MAX'

That's because Borys gave a slightly wrong string to copy. It should be: loader.env.M_ARENA_MAX = "1" -- note the double-quotes. This is because all environment variables are strings, not numbers. Try this and it will work.

boryspoplawski commented 2 years ago

It should be: loader.env.M_ARENA_MAX = "1" -- note the double-quotes.

Ah, yes, was writing from the top of my head, sorry.

Could you explain a bit more on what happened under the hood?

Already did in the previous message (I guess a race condition happened).

And should I open an new issue for the GDB issue? It's seems only has problem on the breakpoint for the 64MB allocation.

No need, there are tons of issues with GDB, half probably caused by our integration and other just by GDB bugs. It's not related to this specific workload or rust

kvinwang commented 2 years ago

That's because Borys gave a slightly wrong string to copy. It should be: loader.env.M_ARENA_MAX = "1" -- note the double-quotes. This is because all environment variables are strings, not numbers. Try this and it will work.

Yes, This works as well. Thanks again very much!

gramineproject / gramine