Workloads (Redis, Curl, R) failing with Out of memory PAL error after new manifest syntax to define lists of SGX trusted files.

gramineproject / graphene

Graphene / Graphene-SGX - a library OS for Linux multi-process applications, with Intel SGX support

https://grapheneproject.io

GNU Lesser General Public License v3.0

771 stars 260 forks source link

Workloads (Redis, Curl, R) failing with Out of memory PAL error after new manifest syntax to define lists of SGX trusted files. #2680

Closed jinengandhi-intel closed 3 years ago

jinengandhi-intel commented 3 years ago

Description of the problem

On some systems (tried with RHEL, CentOS servers) we are seeing a regression with some of the workloads mentioned in the bug title. Not seeing the same issue on Ubuntu client as well as servers. This is a regression that was introduced with the recent commit: Define SGX allowed/trusted/protected files as TOML arrays https://github.com/oscarlab/graphene/commit/ddc01ba844207bb3c6dadb067e4c1276776f221a

We have tried changing the loader.pal_internal_mem_size to as high as 16G but the test still continues to fail.

Logs for the same are attached to the report here. R_example_trace_log_RHEL.txt redis_trace_log_RHEL.txt Curl_trace_log_RHEL.txt

Steps to reproduce

Take a SGX enabled, build Graphene and run any of the above workloads.

Expected results

Workloads should PASS.

Actual results

dimakuv commented 3 years ago

@jinengandhi-intel Could you attach a redis-server.manifest.sgx final manifest file? I wonder what is so special about RHEL/CentOS.

I assume that the difference is in how this final manifest file is generated on Ubuntu vs on RHEL/CentOS.

dimakuv commented 3 years ago

@jinengandhi-intel Also could you attach redis-server.manifest.sgx file in Ubuntu (where it doesn't fail)?

I want to look at two versions of this redis-server.manifest.sgx file: one in Ubuntu, one in RHEL. Comparing them side by side, we may spot the difference, which will be the root cause of this failure.

jinengandhi-intel commented 3 years ago

Manifest SGX file for Ubuntu are attached here. For RHEL, I am awaiting the manifest files from my colleague R.manifest.sgx.txt

curl.manifest.sgx.txt

aniket-intelx commented 3 years ago

Please find the manifest.sgx files for RHEL attached. curl.manifest.sgx.txt R.manifest.sgx.txt redis-server.manifest.sgx.txt

dimakuv commented 3 years ago

RHEL manifest files are ~10MB in size... This feels like way too much for the initial 64MB pre-allocated by Graphene.

@aniket-intelx @jinengandhi-intel Can any of you run the failing workload (e.g., redis-server) on RHEL under GDB and find the exact place where the out of PAL memory error happens? I would assume that it happens here: https://github.com/oscarlab/graphene/blob/33a68bc4302891e8c591570942bc39394acefd23/Pal/src/host/Linux-SGX/db_main.c#L683

anjalirai-intel commented 3 years ago

@dimakuv Before the mentioned commit the RHEL manifest files for curl is ~16MB in size and it did worked.

@aniket-intelx will be sharing the rest of the details soon

dimakuv commented 3 years ago

We debugged and the problem is in Graphene's pre-allocated internal PAL memory pool of 64MB.

We fail on toml_parse(): https://github.com/gramineproject/graphene/blob/33a68bc4302891e8c591570942bc39394acefd23/Pal/src/host/Linux-SGX/db_main.c#L683

But we read loader.pal_internal_mem_size (which increases the internal PAL memory) only after parsing: https://github.com/gramineproject/graphene/blob/33a68bc4302891e8c591570942bc39394acefd23/Pal/src/host/Linux-SGX/db_main.c#L703

So we get a chicken-and-egg problem.

Easy solution: if we detect that the manifest size is greater than some threshold (I recommend 1MB), then we immediately increase internal PAL memory by additional 64MB.