plasma-umass / Mesh

A memory allocator that automatically reduces the memory footprint of C/C++ applications.
Apache License 2.0
1.76k stars 75 forks source link

Mesh crash on test-stress.c #73

Closed kyoguan closed 4 years ago

kyoguan commented 4 years ago

Mesh can't pass the test on linux, the test code is from the mimalloc test case. https://github.com/microsoft/mimalloc/blob/master/test/test-stress.c

I do a little fix to remove the mimalloc dependence。

kyo@kyo-1080:~/work/mimalloc/test$ LD_PRELOAD=~/work/Mesh/build/lib/libmesh.so ./test-stress Using 32 threads with a 10% load-per-thread and 50 iterations /home/kyo/work/Mesh/src/mini_heap.h:195:void mesh::MiniHeap::freeOff(size_t): ASSERTION '_bitmap.isSet(off)' FAILED: MiniHeap(0x7f0a09020e80) expected bit 5 to be set (svOff:0)

or

kyo@kyo-1080:~/work/mimalloc/test$ LD_PRELOAD=~/work/Mesh/build/lib/libmesh.so ./test-stress Using 32 threads with a 10% load-per-thread and 50 iterations libmesh: caught null pointer dereference (signal: 11)

And I also found Mesh can pass the test-stress test on MacOS.

test-stress.txt

kyoguan commented 4 years ago

I also try the tcmalloc benchmark, https://github.com/gperftools/gperftools/tree/master/benchmark

Mesh would crash,

./binary_trees 20 10

/home/gzwanglinggui/3rd/mesh/src/cheap_heap.h:76:char* mesh::CheapHeap<allocSize, maxCount>::ptrFromOffset(size_t) const [with long unsigned int allocSize = 64; long unsigned int maxCount = 3145728; size_t = long unsigned int]: ASSERTION 'off < _arenaOff' FAILED:

And the same things happned, crash on linux, but passed on MacOS.

emeryberger commented 4 years ago

Is this all resolved by https://github.com/plasma-umass/Mesh/pull/77?

kyoguan commented 4 years ago

not all, for example, running ./test-stress 64 10 50000000, would crash, we have found the bugs , there are some races in the function freeFor, we still try to fix it without losing the performance.

bpowers commented 4 years ago

the invariants around freeFor are pretty hairy; its not surprising there are issues. Happy to help reason through them if you post details

bpowers commented 4 years ago

ah, I see:

#2  0x00007ffff7ed8eee in mesh::internal::__mesh_assert_fail (assertion=0x7ffff7eb3e38 "!newEntry->isLargeAlloc()", file=0x7ffff7eb865a "src/internal.h", 
    func=0x7ffff7eb5d3b "void mesh::ListEntry<mesh::MiniHeap, mesh::MiniHeapID>::add(mesh::ListEntry::Entry *, uint8_t, ID, Object *) [Object = mesh::MiniHeap, ID = mesh::MiniHeapID]", line=146, fmt=0x7ffff7eba3e0 "") at src/d_assert.cc:74
#3  0x00007ffff7ee63ed in mesh::ListEntry<mesh::MiniHeap, mesh::MiniHeapID>::add (this=0x7ffff7f965e8 <mesh::runtime()::buf+17896>, listHead=0x0, listId=0 '\000', 
    selfId=..., newEntry=0x7fffb7bd2380) at src/internal.h:146
#4  0x00007ffff7edf8d8 in mesh::GlobalHeap::postFreeLocked (this=0x7ffff7f92040 <mesh::runtime()::buf+64>, mh=0x7fffb7bd2380, sizeClass=16, inUse=1)
    at src/global_heap.h:203
#5  0x00007ffff7edb763 in mesh::GlobalHeap::freeFor (this=0x7ffff7f92040 <mesh::runtime()::buf+64>, mh=0x7fffb7bd2380, ptr=0x7fefc7ed2c00, startEpoch=194)
    at src/global_heap.cc:184
#6  0x00007ffff7ee8e12 in mesh::ThreadLocalHeap::free (this=0x7fefa3a29000, ptr=0x7fefc7ed2c00) at src/thread_local_heap.h:203
#7  mesh_free (ptr=0x7fefc7ed2c00) at src/libmesh.cc:115
#8  0x0000000000401d58 in thread_entry ()
#9  0x00007ffff7efe46c in mesh::Runtime::startThread (threadArgs=0x0) at src/runtime.cc:150
#10 0x00007ffff7e81432 in start_thread () from /lib64/libpthread.so.0
#11 0x00007ffff7da4913 in clone () from /lib64/libc.so.6
bpowers commented 4 years ago

we're basically hitting this TODO:

          // TODO: we should really store 'created epoch' on mh and
          // check those are the same here, too.

I believe whats happening in the crash I'm seeing is that while this thread was waiting to acquire the miniheap lock, the miniheap was freed and re-allocated as a large allocation.

kyoguan commented 4 years ago

yes , that is one of the racing bug,

another bug I found is here :

  auto remaining = mh->inUseCount() - 1;
  mh->free(arenaBegin(), ptr);

  bool shouldMesh = false;

  // the epoch will be odd if a mesh was in progress when we looked up
  // the miniheap; if that is true, or a meshing started between then
  // and now we can't be sure the above free was successful
  if (startEpoch % 2 == 1 || !_meshEpoch.isSame(startEpoch)) {
    // a mesh was started in between when we looked up our miniheap
    // and now.  synchronize to avoid races
    lock_guard<mutex> lock(_miniheapLock);

    const auto origMh = mh;
    mh = miniheapForWithEpoch(ptr, startEpoch);

    if (unlikely(mh != origMh)) {
      hard_assert(!mh->isMeshed());
      mh->free(arenaBegin(), ptr);
    }

image two mini_heap A and B. ptr is in A, mh->free() is ok, and A is meshed with B before locked, and mh = miniheapForWithEpoch would return another object in B, this code would release a wrong object.

bpowers commented 4 years ago

yeah, great find. I think this one can be papered over more easily (we should just never call mh->free a second time, and we can more carefully/explicitly check for the situation this is supposed to be guarding against: that we set the 'free' bit on the original miniheap while it was in the process of being meshed), but I think figuring out a way to shoehorn the created epoch into the original miniheap would also be helpful here.

bpowers commented 4 years ago

@kyoguan I pushed some changes, and running both the debug and release builds with ./mimalloc-test-stress 64 10 50000000 I for a few minutes (but not to completion - I have to shut off the computer for the night) I don't observe any crashes. I don't think this is truly solved, but I'd be interested to hear if this improves things in your testing + setup.

kyoguan commented 4 years ago

the patch hasn't fixed the bugs.

Program terminated with signal SIGABRT, Aborted.
#0  raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
50      ../sysdeps/unix/sysv/linux/raise.c: 没有那个文件或目录.
[Current thread is 1 (Thread 0x7f1fbaea1700 (LWP 172700))]
(gdb) bt
#0  raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00007f304cc6b9bc in mesh::Runtime::segfaultHandler (context=<optimized out>, siginfo=0x7f1fbaea0830, sig=11) at /home/kyo/work/Mesh/src/runtime.cc:360
#2  mesh::Runtime::segfaultHandler (sig=11, siginfo=0x7f1fbaea0830, context=0x7f1fbaea0700) at /home/kyo/work/Mesh/src/runtime.cc:320
#3  <signal handler called>
#4  std::__atomic_base<unsigned int>::load (__m=std::memory_order_acquire, this=<optimized out>) at /usr/include/c++/9/bits/atomic_base.h:413
#5  mesh::Flags::is (offset=30, this=<optimized out>) at /home/kyo/work/Mesh/src/mini_heap.h:109
#6  mesh::Flags::isMeshed (this=<optimized out>) at /home/kyo/work/Mesh/src/mini_heap.h:103
#7  mesh::MiniHeap::isMeshed (this=<optimized out>) at /home/kyo/work/Mesh/src/mini_heap.h:332
#8  mesh::GlobalHeap::freeFor (this=0x7f304cc94ea0 <mesh::runtime()::buf+64>, mh=<optimized out>, ptr=0x7f202e640380, startEpoch=<optimized out>) at /home/kyo/work/Mesh/src/global_heap.cc:105
#9  0x00005604f8feadf9 in free_items (p=<optimized out>) at test-stress.c:114
#10 stress (tid=<optimized out>) at test-stress.c:149
#11 0x00005604f8fea5fe in thread_entry (param=<optimized out>) at test-stress.c:289
#12 0x00007f304cc0d609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#13 0x00007f304cb34293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
(gdb)
bpowers commented 4 years ago

thats a more surprising one and feels like a different bug - I would never expect us to observe through the _mhIndex a MiniHeap that has been meshed into another while we are holding the _mhLock.

kyoguan commented 4 years ago

with your fix, I got another crash:


Core was generated by `./test-stress 64 10 500000'.
Program terminated with signal SIGABRT, Aborted.
#0  raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
50      ../sysdeps/unix/sysv/linux/raise.c: 没有那个文件或目录.
[Current thread is 1 (Thread 0x7f1fbaea1700 (LWP 172700))]
(gdb) bt
#0  raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00007f304cc6b9bc in mesh::Runtime::segfaultHandler (context=<optimized out>, siginfo=0x7f1fbaea0830, sig=11) at /home/kyo/work/Mesh/src/runtime.cc:360
#2  mesh::Runtime::segfaultHandler (sig=11, siginfo=0x7f1fbaea0830, context=0x7f1fbaea0700) at /home/kyo/work/Mesh/src/runtime.cc:320
#3  <signal handler called>
#4  0x00007f304cc6ff21 in mesh::MiniHeap::freeOff (off=0, this=0x7f300c90dd40) at /usr/include/c++/9/bits/atomic_base.h:413
#5  mesh::MiniHeap::free (ptr=0x7f202e640380, arenaBegin=<optimized out>, this=<optimized out>) at /home/kyo/work/Mesh/src/mini_heap.h:191
#6  mesh::GlobalHeap::freeFor (this=0x7f304cc94ea0 <mesh::runtime()::buf+64>, mh=<optimized out>, ptr=0x7f202e640380, startEpoch=139845423437016) at /home/kyo/work/Mesh/src/global_heap.cc:89
#7  0x00005604f8feadf9 in free_items (p=<optimized out>) at test-stress.c:114
#8  stress (tid=<optimized out>) at test-stress.c:149
#9  0x00005604f8fea5fe in thread_entry (param=<optimized out>) at test-stress.c:289
#10 0x00007f304cc0d609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#11 0x00007f304cb34293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
kyoguan commented 4 years ago

all resolved by #81