Stress-ng fork, vfork, clone tests fail with Graphene

jinengandhi-intel commented 3 years ago

Description of the problem

For fork test, 2/5 times we see the following error message during fork creation and termination.

[P69528::] error: failed restoring checkpoint at vma (-13)
[P69528::] error: Error during shim_init() in receive_checkpoint_and_restore (-13)
[P66477:T1019:stress-ng] error: process creation failed

Logs for the same are attached here. fork_stress-ng_error_log2.txt fork_stress-ng_error_log1.txt

For vfork test, 100% of the times the same error as above but only during termination. Logs for vfork are also attached here. vfork_stress-ng_error_log2.txt vfork_stress-ng_error_log1.txt

The issue is also reproducible with clone test. clone_stress-ng_error_log2.txt clone_stress-ng_error.txt

Steps to reproduce

graphene-direct stress-ng --verbose --timeout 60s --fork 0

graphene-direct stress-ng --verbose --timeout 60s --vfork 0

graphene-direct stress-ng --verbose --timeout 60s --clone 0

System configuration:
ICX Server
160 cores
128GB Memory
5.12 kernel
Ubuntu 18.04

Expected results

Actual results

dimakuv commented 3 years ago

Someone needs to root cause this. It doesn't seem to be related to https://github.com/oscarlab/graphene/issues/2514.

dimakuv commented 3 years ago

This was interesting. The root cause is like this:

The main process forks child processes like crazy
The fork is implemented on Linux PAL as follows: checkpoint -> fork -> execve("libpal.so") -> restore
Notice how the child process is started via execve() -- this allows Linux host to apply ASLR to map libpal.so at some rather random high address (e.g. 0x7fc8f4167000)
restore in the child process tries to mmap exactly the same memory ranges as in the parent, but the parent may have had a higher libpal.so address, so the parent created memory ranges that overlap with the child's libpal.so address range
BOOM! The child cannot map correctly, and we see all kinds of weird behavior (I observed the child complaining as in the report, hanging, and segfaulting)

The root cause is basically that the child process's main executable (libpal.so) may be mapped randomly, and the parent has no idea. This is because we don't use fork (which would preserve the libpal.so memory mapping) but exec (which allows Linux host to perform ASLR on libpal.so).

Changing our behavior from exec to fork doesn't sound reasonable. So I chose to simply restrict the memory range accessible to LibOS for creating memory regions. Since Linux allocated from the top of the x86-64 address range (in approx. 0x7f... range), I chose to restrict the upper bound on the LibOS memory range to 0x555555554000 (this constant is taken from https://stackoverflow.com/questions/61561331/why-does-linux-favor-0x7f-mappings).

Now it works again. I'll submit a PR.

dimakuv commented 3 years ago

@jinengandhi-intel If you could try #2595 on your side, would be great. Works for me.

jinengandhi-intel commented 3 years ago

Sure, will test with the PR.

jinengandhi-intel commented 3 years ago

The issue is not seen with the PR

jinengandhi-intel commented 3 years ago

Intermittently the issue is also seen in LibOS Bootstrap_401_exit_group

dimakuv commented 3 years ago

We understand the root cause, but we have at least two different proposals to fix it:

We got a bit stuck on which proposal to choose (I prefer Borys's #2597). But we need to expedite this, since it even fails Bootstrap_401_exit_group periodically.

dimakuv commented 3 years ago

Pinging @boryspoplawski about #2597

gramineproject / graphene