Closed jinengandhi-intel closed 3 years ago
Someone needs to root cause this. It doesn't seem to be related to https://github.com/oscarlab/graphene/issues/2514.
This was interesting. The root cause is like this:
checkpoint -> fork -> execve("libpal.so") -> restore
execve()
-- this allows Linux host to apply ASLR to map libpal.so
at some rather random high address (e.g. 0x7fc8f4167000
)restore
in the child process tries to mmap exactly the same memory ranges as in the parent, but the parent may have had a higher libpal.so
address, so the parent created memory ranges that overlap with the child's libpal.so
address rangeThe root cause is basically that the child process's main executable (libpal.so
) may be mapped randomly, and the parent has no idea. This is because we don't use fork
(which would preserve the libpal.so
memory mapping) but exec
(which allows Linux host to perform ASLR on libpal.so
).
Changing our behavior from exec
to fork
doesn't sound reasonable. So I chose to simply restrict the memory range accessible to LibOS for creating memory regions. Since Linux allocated from the top of the x86-64 address range (in approx. 0x7f...
range), I chose to restrict the upper bound on the LibOS memory range to 0x555555554000
(this constant is taken from https://stackoverflow.com/questions/61561331/why-does-linux-favor-0x7f-mappings).
Now it works again. I'll submit a PR.
@jinengandhi-intel If you could try #2595 on your side, would be great. Works for me.
Sure, will test with the PR.
The issue is not seen with the PR
Intermittently the issue is also seen in LibOS Bootstrap_401_exit_group
We understand the root cause, but we have at least two different proposals to fix it:
We got a bit stuck on which proposal to choose (I prefer Borys's #2597). But we need to expedite this, since it even fails Bootstrap_401_exit_group
periodically.
Pinging @boryspoplawski about #2597
Description of the problem
For fork test, 2/5 times we see the following error message during fork creation and termination.
Logs for the same are attached here. fork_stress-ng_error_log2.txt fork_stress-ng_error_log1.txt
For vfork test, 100% of the times the same error as above but only during termination. Logs for vfork are also attached here. vfork_stress-ng_error_log2.txt vfork_stress-ng_error_log1.txt
The issue is also reproducible with clone test. clone_stress-ng_error_log2.txt clone_stress-ng_error.txt
Steps to reproduce
graphene-direct stress-ng --verbose --timeout 60s --fork 0
graphene-direct stress-ng --verbose --timeout 60s --vfork 0
graphene-direct stress-ng --verbose --timeout 60s --clone 0
Expected results
Actual results