[BUG] Failed to free an rmem pointer, memory leak?

casperisfine commented 1 year ago

We've been digging into this issue with @peterzhu2118 for a couple weeks. We've noticed some crashes in production, but couldn't figure it out.

So we implemented https://github.com/msgpack/msgpack-ruby/pull/323 to make it easier to debug, and now we caught an instance of this bug on CI:

[BUG] Failed to free an rmem pointer, memory leak?

-- C level backtrace information -------------------------------------------
/usr/local/ruby/bin/ruby(rb_print_backtrace+0x11) [0x556452739d98] vm_dump.c:785
/usr/local/ruby/bin/ruby(rb_vm_bugreport) vm_dump.c:1080
/usr/local/ruby/bin/ruby(bug_report_end+0x0) [0x5564528d5d0f] error.c:790
/usr/local/ruby/bin/ruby(rb_bug_without_die) error.c:790
/usr/local/ruby/bin/ruby(die+0x0) [0x556452526d39] error.c:798
/usr/local/ruby/bin/ruby(rb_bug) error.c:800
/tmp/bundle/ruby/3.2.0/bundler/gems/msgpack-ruby-51da8d82cb7a/lib/msgpack/msgpack.so(_msgpack_unpacker_destroy+0x6f) [0x7f06b824791f] unpacker.c:78
/tmp/bundle/ruby/3.2.0/bundler/gems/msgpack-ruby-51da8d82cb7a/lib/msgpack/msgpack.so(Unpacker_free+0x21) [0x7f06b8248461] unpacker_class.c:47
/usr/local/ruby/bin/ruby(obj_free+0xec) [0x5564525442ec] gc.c:3606
/usr/local/ruby/bin/ruby(gc_sweep_plane+0x51) [0x55645254bc19] gc.c:5596
/usr/local/ruby/bin/ruby(gc_sweep_page) gc.c:5681
/usr/local/ruby/bin/ruby(gc_sweep_step) gc.c:5992
/usr/local/ruby/bin/ruby(gc_sweep_rest+0x1e) [0x556452553a16] gc.c:6058
/usr/local/ruby/bin/ruby(gc_rest) gc.c:9586
/usr/local/ruby/bin/ruby(gc_rest+0x20) [0x556452556ce7] gc.c:9576
/usr/local/ruby/bin/ruby(objspace_malloc_increase_body) gc.c:12167
/usr/local/ruby/bin/ruby(objspace_malloc_increase_body) gc.c:12148
/usr/local/ruby/bin/ruby(objspace_malloc_fixup+0x13) [0x5564525576ef] gc.c:12258
/usr/local/ruby/bin/ruby(objspace_xmalloc0) gc.c:12335
/tmp/bundle/ruby/3.2.0/bundler/gems/msgpack-ruby-51da8d82cb7a/lib/msgpack/msgpack.so(_msgpack_rmem_alloc2+0x94) [0x7f06b8245ab4] rmem.c:74
/tmp/bundle/ruby/3.2.0/bundler/gems/msgpack-ruby-51da8d82cb7a/lib/msgpack/msgpack.so(_msgpack_unpacker_init+0x8c) [0x7f06b82478ac] rmem.h:82
/tmp/bundle/ruby/3.2.0/bundler/gems/msgpack-ruby-51da8d82cb7a/lib/msgpack/msgpack.so(MessagePack_Unpacker_alloc+0x2b) [0x7f06b824837b] unpacker_class.c:87
/tmp/bundle/ruby/3.2.0/bundler/gems/msgpack-ruby-51da8d82cb7a/lib/msgpack/msgpack.so(MessagePack_Factory_unpacker+0x39) [0x7f06b8240939] factory_class.c:172
/usr/local/ruby/bin/ruby(vm_cfp_consistent_p+0x0) [0x556452718fd5] vm_insnhelper.c:3268
/usr/local/ruby/bin/ruby(vm_call_cfunc_with_frame) vm_insnhelper.c:3270
/usr/local/ruby/bin/ruby(vm_sendish+0x0) [0x55645272b56c] vm_insnhelper.c:5080
/usr/local/ruby/bin/ruby(vm_exec_core) insns.def:820
/usr/local/ruby/bin/ruby(rb_vm_exec+0xcb) [0x55645271b61b] vm.c:2374
/usr/local/ruby/bin/ruby(rb_vm_invoke_proc+0x33) [0x556452723473] vm.c:1603
/usr/local/ruby/bin/ruby(vm_call0_body+0x59f) [0x556452723a5f] vm_eval.c:274
/usr/local/ruby/bin/ruby(rb_funcallv_scope+0x1b0) [0x556452727a40] vm_eval.c:87

casperisfine commented 1 year ago

What is interesting in this backtrace is that we're triggering GC when allocating a new arena:

    pm->head.mask = 0xffffffff & (~1);  /* "& (~1)" means first chunk is already allocated */
    pm->head.pages = xmalloc(MSGPACK_RMEM_PAGE_SIZE * 32);

This may help find a repro.

casperisfine commented 1 year ago

Additionally, the bug trigger in a recursive unpacker:

/bundler/gems/msgpack-ruby-51da8d82cb7a/lib/msgpack/factory.rb:150:in `load'
/bundler/gems/msgpack-ruby-51da8d82cb7a/lib/msgpack/factory.rb:101:in `with'
/bundler/gems/msgpack-ruby-51da8d82cb7a/lib/msgpack/factory.rb:152:in `block in load'
/bundler/gems/msgpack-ruby-51da8d82cb7a/lib/msgpack/factory.rb:152:in `full_unpack'
/gems/paquito-0.10.0/lib/paquito/types.rb:355:in `block in register_serializable_type'
/bundler/gems/msgpack-ruby-51da8d82cb7a/lib/msgpack/factory.rb:64:in `load'
/bundler/gems/msgpack-ruby-51da8d82cb7a/lib/msgpack/factory.rb:64:in `unpacker'

https://github.com/Shopify/paquito/blob/7b97f481a3690fe5531aebe2273b779481cfa4ac/lib/paquito/types.rb#L355

casperisfine commented 1 year ago

Ok, so I managed to find a semi-decent local repro via bootsnap precompile of one of our large apps, using a modified msgpack to always trigger GC in that spot. It's reliable enough that I can get a crash relatively quickly, but I doubt I can turn it into a standalone repro.

Either way, adding some debug, I managed to confirm that the same pointer get assigned to two distinct chunks:

allocated = 0x118280000
// snip...
chunk = 0x600003b055f0 msgpack_rmem_free 0x118280000
// snip...
chunk_free freeing = 0x118280000
// snip...
chunk = 0x600003b05510 msgpack_rmem_free 0x118280000
/Users/byroot/.gem/ruby/3.2.1/gems/bootsnap-1.16.0/lib/bootsnap/compile_cache/yaml.rb:192: [BUG] MessagePack::Buffer: Failed to free an rmem pointer (0x0000000118280000), memory leak?

So now the question is to figure out how that happens.

casperisfine commented 1 year ago

More debug:

chunk_malloc (alloc) chunk = 0x600003b208d0 mem = 0x1183b8000
chunk_malloc (alloc) chunk = 0x600003b195f0 mem = 0x1183b8000

That debug is right there in _msgpack_buffer_chunk_malloc: https://github.com/msgpack/msgpack-ruby/blob/51da8d82cb7ae92698442be830d9577398ecff10/ext/msgpack/buffer.c#L351-L352

Interestingly, these two are right one after the other, so somehow msgpack_rmem_alloc returned the same address twice consecutively, without any free happening in between.

casperisfine commented 1 year ago

More debug:

static inline void* msgpack_rmem_alloc(msgpack_rmem_t* pm)
{
    if(_msgpack_rmem_chunk_available(&pm->head)) {
        void *ptr = _msgpack_rmem_chunk_alloc(&pm->head);
        fprintf(stderr, "msgpack_rmem_alloc available = %p\n", ptr);
        return ptr;
    } else {
        void *ptr = _msgpack_rmem_alloc2(pm);
        fprintf(stderr, "msgpack_rmem_alloc new = %p\n", ptr);
        return ptr;
    }
}

msgpack_rmem_alloc new = 0x1184c8000
chunk_malloc (alloc) chunk = 0x600003b0dcf0 mem = 0x1184c8000
msgpack_rmem_alloc available = 0x1184c8000
chunk_malloc (alloc) chunk = 0x600003b0deb0 mem = 0x1184c8000

So what we see here is:

alloc a new chunk, no place left in the arena, so we call _msgpack_rmem_alloc2 and return that pointer.
alloc a second chunk, and somehow _msgpack_rmem_chunk_alloc return the same pointer _msgpack_rmem_alloc2 just returned us.

Something's up here.

casperisfine commented 1 year ago

Using the following debug:

    pm->head.mask = 0xffffffff & (~1);  /* "& (~1)" means first chunk is already allocated */
    void *ptr = xmalloc(MSGPACK_RMEM_PAGE_SIZE * 32);
    fprintf(stderr, "allocated = %p, mask = %08x\n", ptr, pm->head.mask);
    fprintf(stderr, "gc_start ------------------------------------------------\n");
    rb_gc_start(); // try to simulate crash
    fprintf(stderr, "gc_end ------------------------------------------------\n");
    fprintf(stderr, "after_gc mask = %08x\n", pm->head.mask);
    pm->head.pages = ptr;

I was able to notice the following pattern:

allocated = 0x110510000, mask = fffffffe
gc_start ------------------------------------------------
.... lots of free
gc_end ------------------------------------------------
after_gc mask = ffffffff

So when the GC triggers here, some chunks are freed and incorrectly release the first page of that chunk.

I'll keep digging as of why.

NB: this is with -O0.

casperisfine commented 1 year ago

There is also this very weird transition:

chunk = 0x600000ca2ca0 msgpack_rmem_free 0x138030000
rmem_chunk_alloc rmem_chunk = 0x107fe8010, mask = fffffffe -> ffffffff, pos = 0, mem = 0x138030000
chunk = 0x600003b200f0 msgpack_rmem_free 0x138031000
rmem_chunk_alloc rmem_chunk = 0x107fe8010, mask = ffffffff -> ffffffff, pos = 1, mem = 0x138031000

Note the ffffffff -> ffffffff, which means we're freeing a slot that is already marked as free.

I can now crash earlier with:

static inline bool _msgpack_rmem_chunk_try_free(msgpack_rmem_chunk_t* c, void* mem)
{
    ptrdiff_t pdiff = ((char*)(mem)) - ((char*)(c)->pages);
    if(0 <= pdiff && pdiff < MSGPACK_RMEM_PAGE_SIZE * 32) {
        size_t pos = pdiff / MSGPACK_RMEM_PAGE_SIZE;
        unsigned int mask_before = (c)->mask;
        if ((c)->mask & (1 << pos)) {
            rb_bug("_msgpack_rmem_chunk_try_free %p was already freed", mem);
        }
        (c)->mask |= (1 << pos);
        fprintf(stderr, "rmem_chunk_alloc rmem_chunk = %p, mask = %08x -> %08x, pos = %lu, mem = %p\n", c, mask_before, (c)->mask, pos, mem);
        return true;
    }
    return false;
}

[BUG] _msgpack_rmem_chunk_try_free 0x000000010817e000 was already freed

/opt/rubies/3.2.1/lib/libruby.3.2.dylib(rb_bug+0x1c) [0x100ad1f84]
/Users/byroot/src/github.com/Shopify/msgpack-ruby/lib/msgpack/msgpack.bundle(_msgpack_rmem_chunk_try_free+0xa4) [0x1206d33c0]
/Users/byroot/src/github.com/Shopify/msgpack-ruby/lib/msgpack/msgpack.bundle(msgpack_rmem_free+0x20) [0x1206d3228]
/Users/byroot/src/github.com/Shopify/msgpack-ruby/lib/msgpack/msgpack.bundle(_msgpack_buffer_chunk_destroy+0x74) [0x1206d15a4]
/Users/byroot/src/github.com/Shopify/msgpack-ruby/lib/msgpack/msgpack.bundle(msgpack_buffer_destroy+0x68) [0x1206d14e8]
/Users/byroot/src/github.com/Shopify/msgpack-ruby/lib/msgpack/msgpack.bundle(msgpack_packer_destroy+0x18) [0x1206d713c]
/Users/byroot/src/github.com/Shopify/msgpack-ruby/lib/msgpack/msgpack.bundle(Packer_free+0x3c) [0x1206d9840]
/opt/rubies/3.2.1/lib/libruby.3.2.dylib(obj_free+0x8a4) [0x1008199c0]
/opt/rubies/3.2.1/lib/libruby.3.2.dylib(gc_sweep_page+0x274) [0x10081901c]
/opt/rubies/3.2.1/lib/libruby.3.2.dylib(gc_sweep_step+0x134) [0x1008177c0]
/opt/rubies/3.2.1/lib/libruby.3.2.dylib(gc_sweep+0xa44) [0x100817170]
/opt/rubies/3.2.1/lib/libruby.3.2.dylib(gc_start+0xd9c) [0x10081c5b0]
/opt/rubies/3.2.1/lib/libruby.3.2.dylib(rb_gc_start+0x60) [0x10080c8d8]
/Users/byroot/src/github.com/Shopify/msgpack-ruby/lib/msgpack/msgpack.bundle(_msgpack_rmem_alloc2+0x21c) [0x1206dd488]
/Users/byroot/src/github.com/Shopify/msgpack-ruby/lib/msgpack/msgpack.bundle(msgpack_rmem_alloc+0x64) [0x1206d3748]
/Users/byroot/src/github.com/Shopify/msgpack-ruby/lib/msgpack/msgpack.bundle(_msgpack_buffer_chunk_malloc+0x74) [0x1206d25d4]
/Users/byroot/src/github.com/Shopify/msgpack-ruby/lib/msgpack/msgpack.bundle(_msgpack_buffer_expand+0x174) [0x1206d222c]
/Users/byroot/src/github.com/Shopify/msgpack-ruby/lib/msgpack/msgpack.bundle(_msgpack_buffer_append_impl+0x98) [0x1206d8acc]
/Users/byroot/src/github.com/Shopify/msgpack-ruby/lib/msgpack/msgpack.bundle(msgpack_buffer_append+0x30) [0x1206d89d0]
/Users/byroot/src/github.com/Shopify/msgpack-ruby/lib/msgpack/msgpack.bundle(msgpack_buffer_append_string+0x68) [0x1206d8904]
/Users/byroot/src/github.com/Shopify/msgpack-ruby/lib/msgpack/msgpack.bundle(msgpack_packer_write_string_value+0x124) [0x1206d82bc]
/Users/byroot/src/github.com/Shopify/msgpack-ruby/lib/msgpack/msgpack.bundle(msgpack_packer_write_value+0xcc) [0x1206d7498]

casperisfine commented 1 year ago

I figured it out!

diff --git a/ext/msgpack/rmem.c b/ext/msgpack/rmem.c
index 7e5f5e5..d480372 100644
--- a/ext/msgpack/rmem.c
+++ b/ext/msgpack/rmem.c
@@ -70,6 +70,7 @@ void* _msgpack_rmem_alloc2(msgpack_rmem_t* pm)
     pm->head = *c;
     *c = tmp;

+    pm->head.pages = NULL; /* make sure we don't point to another chunk's pages in case xmalloc triggers GC */
     pm->head.mask = 0xffffffff & (~1);  /* "& (~1)" means first chunk is already allocated */
     pm->head.pages = xmalloc(MSGPACK_RMEM_PAGE_SIZE * 32);

We just copied the head at the back of the list, but pm->head.pages hasn't changed. So if we free any buffer that was on that previous head, we'll free it from the wrong chunk.

(Somehow GitHub won't let me push my branch with the fix, but I'll open a PR as soon as it let me do it).

msgpack / msgpack-ruby

[BUG] Failed to free an rmem pointer, memory leak? #327