gramineproject / examples

Sample applications configs for Gramine
BSD 3-Clause "New" or "Revised" License
29 stars 22 forks source link

Mongodb workloads fails intermittently with unable to launch server, failing during _initAndListen #93

Closed anjalirai-intel closed 5 months ago

anjalirai-intel commented 5 months ago

Summary: Mongodb workloads fails intermittently with unable to launch server, failing during _initAndListen

We tried to repro this issue on Ubuntu 22.04 and we were able to repro it 1 out of 10 times,

I have attached the debug logs for the server start.

Console logs for when Server Started:

{"t":{"$date":"2024-01-29T18:48:00.866+00:00"},"s":"W",  "c":"FTDC",     "id":23913,   "ctx":"initandlisten","msg":"Error checking directory '{sysBlockPathStr}': {ec_message}","attr":{"sysBlockPathStr":"/sys/block","ec_message":"No such file or directory"}}
{"t":{"$date":"2024-01-29T18:48:00.866+00:00"},"s":"I",  "c":"FTDC",     "id":20625,   "ctx":"initandlisten","msg":"Initializing full-time diagnostic data capture","attr":{"dataDirectory":"/data/db/diagnostic.data"}}
{"t":{"$date":"2024-01-29T18:48:00.867+00:00"},"s":"I",  "c":"CONTROL",  "id":23103,   "ctx":"ftdc","msg":"Ignoring error from setting thread name","attr":{"error":"Function not implemented"}}
{"t":{"$date":"2024-01-29T18:48:00.867+00:00"},"s":"I",  "c":"CONTROL",  "id":23103,   "ctx":"DeadlineMonitor","msg":"Ignoring error from setting thread name","attr":{"error":"Function not implemented"}}
{"t":{"$date":"2024-01-29T18:48:00.870+00:00"},"s":"I",  "c":"CONTROL",  "id":23103,   "ctx":"AuthorizationManager-0","msg":"Ignoring error from setting thread name","attr":{"error":"Function not implemented"}}
{"t":{"$date":"2024-01-29T18:48:00.870+00:00"},"s":"I",  "c":"CONTROL",  "id":23103,   "ctx":"AuthorizationManager-0","msg":"Ignoring error from setting thread name","attr":{"error":"Function not implemented"}}
.....
.....
.....
{"t":{"$date":"2024-01-29T18:48:00.871+00:00"},"s":"I",  "c":"STORAGE",  "id":20320,   "ctx":"initandlisten","msg":"createCollection","attr":{"namespace":"local.startup_log","uuidDisposition":"generated","uuid":{"uuid":{"$uuid":"aac99f36-3fb9-4443-a5db-cb544f695e29"}},"options":{"capped":true,"size":10485760}}}
{"t":{"$date":"2024-01-29T18:48:00.882+00:00"},"s":"I",  "c":"INDEX",    "id":20345,   "ctx":"initandlisten","msg":"Index build: done building","attr":{"buildUUID":null,"collectionUUID":{"uuid":{"$uuid":"aac99f36-3fb9-4443-a5db-cb544f695e29"}},"namespace":"local.startup_log","index":"_id_","ident":"index-3-4902319850581555852","collectionIdent":"collection-2-4902319850581555852","commitTimestamp":null}}
{"t":{"$date":"2024-01-29T18:48:00.882+00:00"},"s":"I",  "c":"REPL",     "id":6015317, "ctx":"initandlisten","msg":"Setting new configuration state","attr":{"newState":"ConfigReplicationDisabled","oldState":"ConfigPreStart"}}
{"t":{"$date":"2024-01-29T18:48:00.883+00:00"},"s":"I",  "c":"STORAGE",  "id":22262,   "ctx":"initandlisten","msg":"Timestamp monitor starting"}
......
......
......
......
{"t":{"$date":"2024-01-29T18:48:00.888+00:00"},"s":"I",  "c":"CONTROL",  "id":8423403, "ctx":"initandlisten","msg":"mongod startup complete","attr":{"Summary of time elapsed":{"Startup from clean shutdown?":true,"Statistics":{"Transport layer setup":"0 ms","Run initial syncer crash recovery":"0 ms","Create storage engine lock file in the data directory":"0 ms","Get metadata describing storage engine":"0 ms","Create storage engine":"1190 ms","Write current PID to file":"0 ms","Write a new metadata for storage engine":"0 ms","Initialize FCV before rebuilding indexes":"0 ms","Drop abandoned idents and get back indexes that need to be rebuilt or builds that need to be restarted":"0 ms","Rebuild indexes for collections":"0 ms","Load cluster parameters from disk for a standalone":"0 ms","Build user and roles graph":"0 ms","Set up the background thread pool responsible for waiting for opTimes to be majority committed":"1 ms","Initialize information needed to make a mongod instance shard aware":"0 ms","Start up the replication coordinator":"0 ms","Start transport layer":"1 ms","_initAndListen total elapsed time":"1228 ms"}}}}

Console log for when Server didn't start

{"t":{"$date":"2024-01-29T08:24:28.691+00:00"},"s":"W",  "c":"FTDC",     "id":23913,   "ctx":"initandlisten","msg":"Error checking directory '{sysBlockPathStr}': {ec_message}","attr":{"sysBlockPathStr":"/sys/block","ec_message":"No such file or directory"}}
{"t":{"$date":"2024-01-29T08:24:28.691+00:00"},"s":"I",  "c":"FTDC",     "id":20625,   "ctx":"initandlisten","msg":"Initializing full-time diagnostic data capture","attr":{"dataDirectory":"/data/db/diagnostic.data"}}
{"t":{"$date":"2024-01-29T08:24:28.693+00:00"},"s":"I",  "c":"CONTROL",  "id":23103,   "ctx":"ftdc","msg":"Ignoring error from setting thread name","attr":{"error":"Function not implemented"}}
{"t":{"$date":"2024-01-29T08:24:28.696+00:00"},"s":"I",  "c":"CONTROL",  "id":23103,   "ctx":"DeadlineMonitor","msg":"Ignoring error from setting thread name","attr":{"error":"Function not implemented"}}
{"t":{"$date":"2024-01-29T08:24:28.701+00:00"},"s":"I",  "c":"CONTROL",  "id":8423403, "ctx":"initandlisten","msg":"mongod startup complete","attr":{"Summary of time elapsed":{"Startup from clean shutdown?":false,"Statistics":{"Transport layer setup":"0 ms","Run initial syncer crash recovery":"0 ms","Create storage engine lock file in the data directory":"0 ms","Get metadata describing storage engine":"0 ms","Validate options in metadata against current startup options":"0 ms","Create storage engine":"813 ms","Write current PID to file":"0 ms","Initialize FCV before rebuilding indexes":"3 ms","Drop abandoned idents and get back indexes that need to be rebuilt or builds that need to be restarted":"0 ms","Rebuild indexes for collections":"0 ms","Load cluster parameters from disk for a standalone":"0 ms","_initAndListen total elapsed time":"845 ms"}}}}

mongodb_trace.log

dimakuv commented 5 months ago

I'm confused, where is the problem/error/failure in these logs? Both logs don't show any catastrophic errors and both end up with a "ctx":"initandlisten","msg":"mongod startup complete" message.

anjalirai-intel commented 5 months ago

Sorry, I have updated the logs for failed scenario, In first section, there are lot of error messages b/w /sys/block and server starting, whereas in the case of failure, there are no extra logs and also you can observe this line Startup from clean shutdown?":false

dimakuv commented 5 months ago

Hm, yeah, this requires some investigation. Did you look at Gramine logs? Does that give you any additional hints?

anjalirai-intel commented 5 months ago

Yes, I have looked and attached trace logs as well, there was no error b/w this section

dimakuv commented 5 months ago

Ok, I finally found some hint in the failing log:

{"t":{"$date":"2024-01-29T08:24:28.701+00:00"},"s":"E",  "c":"CONTROL",  "id":20557,  
    "ctx":"initandlisten","msg":"DBException in initAndListen, terminating",
    "attr":{"error":"JSInterpreterFailure: Failed to JS_Init()"}}

A bit before this line in the log I see this:

[P1:T1:mongod] trace: ---- mmap(0x2adbf6b48000, 0x7fc00000, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0x0) ...
[P1:T1:mongod] trace: ---- return from mmap(...) = -12

I.e., we can't allocate 2GB of memory. Can this be the reason?

Moreover, I see that ASLR (Address Space Layout Randomization) is enabled, based on this line in the log:

debug: ASLR top address adjusted to 0x89447000

TLDR:

  1. Looks like you don't specify enough enclave memory in the manifest file.
  2. The workload fails only periodically because of the probabilistic nature of ASLR.
    • ASLR effectively reduces the available-to-application enclave memory, each time by a different amount.
    • Try disabling ASLR in the manifest to see if the error disappears/becomes persistent: loader.insecure__disable_aslr = true

@anjalirai-intel Can you try these (increasing enclave size; disabling ASLR)?

anjalirai-intel commented 5 months ago

@dimakuv I tried by adding loader.insecure__disable_aslr = true with and without EDMM, it is working fine. No issues there

dimakuv commented 5 months ago

Ok, closing this issue.

dimakuv commented 5 months ago

Sorry, I guess I wrongly closed this issue.

@anjalirai-intel So the action now is to increase the enclave size for our MongoDB example?

Note that we don't want to have loader.insecure__disable_aslr = true (insecure options) in our manifest files.

anjalirai-intel commented 5 months ago

@dimakuv We have already tried with enclave size as 8GB and with that issue is not seen without EDMM

dimakuv commented 5 months ago

@anjalirai-intel Do you want to submit such a PR, with increasing enclave size to 8G?

anjalirai-intel commented 5 months ago

Raised #94 for the same