Closed kyoguan closed 4 years ago
I also try the tcmalloc benchmark, https://github.com/gperftools/gperftools/tree/master/benchmark
Mesh would crash,
./binary_trees 20 10
/home/gzwanglinggui/3rd/mesh/src/cheap_heap.h:76:char* mesh::CheapHeap<allocSize, maxCount>::ptrFromOffset(size_t) const [with long unsigned int allocSize = 64; long unsigned int maxCount = 3145728; size_t = long unsigned int]: ASSERTION 'off < _arenaOff' FAILED:
And the same things happned, crash on linux, but passed on MacOS.
Is this all resolved by https://github.com/plasma-umass/Mesh/pull/77?
not all, for example, running ./test-stress 64 10 50000000, would crash, we have found the bugs , there are some races in the function freeFor, we still try to fix it without losing the performance.
the invariants around freeFor
are pretty hairy; its not surprising there are issues. Happy to help reason through them if you post details
ah, I see:
#2 0x00007ffff7ed8eee in mesh::internal::__mesh_assert_fail (assertion=0x7ffff7eb3e38 "!newEntry->isLargeAlloc()", file=0x7ffff7eb865a "src/internal.h",
func=0x7ffff7eb5d3b "void mesh::ListEntry<mesh::MiniHeap, mesh::MiniHeapID>::add(mesh::ListEntry::Entry *, uint8_t, ID, Object *) [Object = mesh::MiniHeap, ID = mesh::MiniHeapID]", line=146, fmt=0x7ffff7eba3e0 "") at src/d_assert.cc:74
#3 0x00007ffff7ee63ed in mesh::ListEntry<mesh::MiniHeap, mesh::MiniHeapID>::add (this=0x7ffff7f965e8 <mesh::runtime()::buf+17896>, listHead=0x0, listId=0 '\000',
selfId=..., newEntry=0x7fffb7bd2380) at src/internal.h:146
#4 0x00007ffff7edf8d8 in mesh::GlobalHeap::postFreeLocked (this=0x7ffff7f92040 <mesh::runtime()::buf+64>, mh=0x7fffb7bd2380, sizeClass=16, inUse=1)
at src/global_heap.h:203
#5 0x00007ffff7edb763 in mesh::GlobalHeap::freeFor (this=0x7ffff7f92040 <mesh::runtime()::buf+64>, mh=0x7fffb7bd2380, ptr=0x7fefc7ed2c00, startEpoch=194)
at src/global_heap.cc:184
#6 0x00007ffff7ee8e12 in mesh::ThreadLocalHeap::free (this=0x7fefa3a29000, ptr=0x7fefc7ed2c00) at src/thread_local_heap.h:203
#7 mesh_free (ptr=0x7fefc7ed2c00) at src/libmesh.cc:115
#8 0x0000000000401d58 in thread_entry ()
#9 0x00007ffff7efe46c in mesh::Runtime::startThread (threadArgs=0x0) at src/runtime.cc:150
#10 0x00007ffff7e81432 in start_thread () from /lib64/libpthread.so.0
#11 0x00007ffff7da4913 in clone () from /lib64/libc.so.6
we're basically hitting this TODO:
// TODO: we should really store 'created epoch' on mh and
// check those are the same here, too.
I believe whats happening in the crash I'm seeing is that while this thread was waiting to acquire the miniheap lock, the miniheap was freed and re-allocated as a large allocation.
yes , that is one of the racing bug,
another bug I found is here :
auto remaining = mh->inUseCount() - 1;
mh->free(arenaBegin(), ptr);
bool shouldMesh = false;
// the epoch will be odd if a mesh was in progress when we looked up
// the miniheap; if that is true, or a meshing started between then
// and now we can't be sure the above free was successful
if (startEpoch % 2 == 1 || !_meshEpoch.isSame(startEpoch)) {
// a mesh was started in between when we looked up our miniheap
// and now. synchronize to avoid races
lock_guard<mutex> lock(_miniheapLock);
const auto origMh = mh;
mh = miniheapForWithEpoch(ptr, startEpoch);
if (unlikely(mh != origMh)) {
hard_assert(!mh->isMeshed());
mh->free(arenaBegin(), ptr);
}
image two mini_heap A and B. ptr is in A, mh->free() is ok, and A is meshed with B before locked, and mh = miniheapForWithEpoch would return another object in B, this code would release a wrong object.
yeah, great find. I think this one can be papered over more easily (we should just never call mh->free
a second time, and we can more carefully/explicitly check for the situation this is supposed to be guarding against: that we set the 'free' bit on the original miniheap while it was in the process of being meshed), but I think figuring out a way to shoehorn the created epoch into the original miniheap would also be helpful here.
@kyoguan I pushed some changes, and running both the debug and release builds with ./mimalloc-test-stress 64 10 50000000
I for a few minutes (but not to completion - I have to shut off the computer for the night) I don't observe any crashes. I don't think this is truly solved, but I'd be interested to hear if this improves things in your testing + setup.
the patch hasn't fixed the bugs.
Program terminated with signal SIGABRT, Aborted.
#0 raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
50 ../sysdeps/unix/sysv/linux/raise.c: 没有那个文件或目录.
[Current thread is 1 (Thread 0x7f1fbaea1700 (LWP 172700))]
(gdb) bt
#0 raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1 0x00007f304cc6b9bc in mesh::Runtime::segfaultHandler (context=<optimized out>, siginfo=0x7f1fbaea0830, sig=11) at /home/kyo/work/Mesh/src/runtime.cc:360
#2 mesh::Runtime::segfaultHandler (sig=11, siginfo=0x7f1fbaea0830, context=0x7f1fbaea0700) at /home/kyo/work/Mesh/src/runtime.cc:320
#3 <signal handler called>
#4 std::__atomic_base<unsigned int>::load (__m=std::memory_order_acquire, this=<optimized out>) at /usr/include/c++/9/bits/atomic_base.h:413
#5 mesh::Flags::is (offset=30, this=<optimized out>) at /home/kyo/work/Mesh/src/mini_heap.h:109
#6 mesh::Flags::isMeshed (this=<optimized out>) at /home/kyo/work/Mesh/src/mini_heap.h:103
#7 mesh::MiniHeap::isMeshed (this=<optimized out>) at /home/kyo/work/Mesh/src/mini_heap.h:332
#8 mesh::GlobalHeap::freeFor (this=0x7f304cc94ea0 <mesh::runtime()::buf+64>, mh=<optimized out>, ptr=0x7f202e640380, startEpoch=<optimized out>) at /home/kyo/work/Mesh/src/global_heap.cc:105
#9 0x00005604f8feadf9 in free_items (p=<optimized out>) at test-stress.c:114
#10 stress (tid=<optimized out>) at test-stress.c:149
#11 0x00005604f8fea5fe in thread_entry (param=<optimized out>) at test-stress.c:289
#12 0x00007f304cc0d609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#13 0x00007f304cb34293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
(gdb)
thats a more surprising one and feels like a different bug - I would never expect us to observe through the _mhIndex
a MiniHeap that has been meshed into another while we are holding the _mhLock
.
with your fix, I got another crash:
Core was generated by `./test-stress 64 10 500000'.
Program terminated with signal SIGABRT, Aborted.
#0 raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
50 ../sysdeps/unix/sysv/linux/raise.c: 没有那个文件或目录.
[Current thread is 1 (Thread 0x7f1fbaea1700 (LWP 172700))]
(gdb) bt
#0 raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1 0x00007f304cc6b9bc in mesh::Runtime::segfaultHandler (context=<optimized out>, siginfo=0x7f1fbaea0830, sig=11) at /home/kyo/work/Mesh/src/runtime.cc:360
#2 mesh::Runtime::segfaultHandler (sig=11, siginfo=0x7f1fbaea0830, context=0x7f1fbaea0700) at /home/kyo/work/Mesh/src/runtime.cc:320
#3 <signal handler called>
#4 0x00007f304cc6ff21 in mesh::MiniHeap::freeOff (off=0, this=0x7f300c90dd40) at /usr/include/c++/9/bits/atomic_base.h:413
#5 mesh::MiniHeap::free (ptr=0x7f202e640380, arenaBegin=<optimized out>, this=<optimized out>) at /home/kyo/work/Mesh/src/mini_heap.h:191
#6 mesh::GlobalHeap::freeFor (this=0x7f304cc94ea0 <mesh::runtime()::buf+64>, mh=<optimized out>, ptr=0x7f202e640380, startEpoch=139845423437016) at /home/kyo/work/Mesh/src/global_heap.cc:89
#7 0x00005604f8feadf9 in free_items (p=<optimized out>) at test-stress.c:114
#8 stress (tid=<optimized out>) at test-stress.c:149
#9 0x00005604f8fea5fe in thread_entry (param=<optimized out>) at test-stress.c:289
#10 0x00007f304cc0d609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#11 0x00007f304cb34293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
all resolved by #81
Mesh can't pass the test on linux, the test code is from the mimalloc test case. https://github.com/microsoft/mimalloc/blob/master/test/test-stress.c
I do a little fix to remove the mimalloc dependence。
kyo@kyo-1080:~/work/mimalloc/test$ LD_PRELOAD=~/work/Mesh/build/lib/libmesh.so ./test-stress Using 32 threads with a 10% load-per-thread and 50 iterations /home/kyo/work/Mesh/src/mini_heap.h:195:void mesh::MiniHeap::freeOff(size_t): ASSERTION '_bitmap.isSet(off)' FAILED: MiniHeap(0x7f0a09020e80) expected bit 5 to be set (svOff:0)
or
kyo@kyo-1080:~/work/mimalloc/test$ LD_PRELOAD=~/work/Mesh/build/lib/libmesh.so ./test-stress Using 32 threads with a 10% load-per-thread and 50 iterations libmesh: caught null pointer dereference (signal: 11)
And I also found Mesh can pass the test-stress test on MacOS.
test-stress.txt