includeos / IncludeOS

A minimal, resource efficient unikernel for cloud services
https://includeos.github.io/
Apache License 2.0
4.93k stars 365 forks source link

Early boot memory corruption sometimes causes chain crashes #2252

Open alfreb opened 5 months ago

alfreb commented 5 months ago

The best repro case was found with https://github.com/includeos/IncludeOS/pull/2251, preserved until fixed in https://github.com/alfreb/IncludeOS/tree/memory-ghost-repro . On that branch, starting at commit e81fb7c7da96b8cae8b43d406b6d868b7d09b66e reproduce with

nix-shell --argstr unikernel ./test/net/integration/tcp/ --run "./test.py"

( Requires https://github.com/includeos/vmrunner )

Backtrace was fetched from gdb after building musl with debug symbols, and seeing the same issue:

#0  0x0000000000329bc2 in a_crash ()
#1  0x000000000032895e in enframe ()
#2  0x0000000000329840 in alloc_group ()
#3  0x0000000000328853 in alloc_slot ()
#4  0x00000000003297df in alloc_group ()
#5  0x0000000000328853 in alloc_slot ()
#6  0x00000000003297df in alloc_group ()
#7  0x0000000000328853 in alloc_slot ()
#8  0x00000000003285eb in __libc_malloc_impl ()
#9  0x00000000003267a5 in malloc ()
#10 0x000000000023f36b in strdup ()
#11 0x0000000000246f1d in x86::init_libc (magic=<optimized out>, addr=<optimized out>) at /build/source/src/platform/x86_pc/init_libc.cpp:107
#12 0x000000000024769a in long_mode ()
#13 0x0000000000000000 in ?? ()

The call to strdup in init_libc causes a crash in libc during malloc. Our heap should be ready at that time, since this is after init_heap.

Possible culprit:

Note that I think this bug is also present on master, possibly the main reason for master not booting at the moment.

Things I've tried

MagnusS commented 3 months ago

Some additional references:

MagnusS commented 2 months ago

This may be resolved with #2273