Closed jinengandhi-intel closed 3 years ago
@jinengandhi-intel Could you attach a redis-server.manifest.sgx
final manifest file? I wonder what is so special about RHEL/CentOS.
I assume that the difference is in how this final manifest file is generated on Ubuntu vs on RHEL/CentOS.
@jinengandhi-intel Also could you attach redis-server.manifest.sgx
file in Ubuntu (where it doesn't fail)?
I want to look at two versions of this redis-server.manifest.sgx
file: one in Ubuntu, one in RHEL. Comparing them side by side, we may spot the difference, which will be the root cause of this failure.
Manifest SGX file for Ubuntu are attached here. For RHEL, I am awaiting the manifest files from my colleague R.manifest.sgx.txt
Please find the manifest.sgx files for RHEL attached. curl.manifest.sgx.txt R.manifest.sgx.txt redis-server.manifest.sgx.txt
RHEL manifest files are ~10MB in size... This feels like way too much for the initial 64MB pre-allocated by Graphene.
@aniket-intelx @jinengandhi-intel Can any of you run the failing workload (e.g., redis-server
) on RHEL under GDB and find the exact place where the out of PAL memory
error happens? I would assume that it happens here: https://github.com/oscarlab/graphene/blob/33a68bc4302891e8c591570942bc39394acefd23/Pal/src/host/Linux-SGX/db_main.c#L683
@dimakuv Before the mentioned commit the RHEL manifest files for curl is ~16MB in size and it did worked.
@aniket-intelx will be sharing the rest of the details soon
We debugged and the problem is in Graphene's pre-allocated internal PAL memory pool of 64MB.
We fail on toml_parse()
: https://github.com/gramineproject/graphene/blob/33a68bc4302891e8c591570942bc39394acefd23/Pal/src/host/Linux-SGX/db_main.c#L683
But we read loader.pal_internal_mem_size
(which increases the internal PAL memory) only after parsing: https://github.com/gramineproject/graphene/blob/33a68bc4302891e8c591570942bc39394acefd23/Pal/src/host/Linux-SGX/db_main.c#L703
So we get a chicken-and-egg problem.
Easy solution: if we detect that the manifest size is greater than some threshold (I recommend 1MB), then we immediately increase internal PAL memory by additional 64MB.
Description of the problem
On some systems (tried with RHEL, CentOS servers) we are seeing a regression with some of the workloads mentioned in the bug title. Not seeing the same issue on Ubuntu client as well as servers. This is a regression that was introduced with the recent commit:
Define SGX allowed/trusted/protected files as TOML arrays
https://github.com/oscarlab/graphene/commit/ddc01ba844207bb3c6dadb067e4c1276776f221aWe have tried changing the loader.pal_internal_mem_size to as high as 16G but the test still continues to fail.
Logs for the same are attached to the report here. R_example_trace_log_RHEL.txt redis_trace_log_RHEL.txt Curl_trace_log_RHEL.txt
Steps to reproduce
Take a SGX enabled, build Graphene and run any of the above workloads.
Expected results
Workloads should PASS.
Actual results