Closed casperisfine closed 1 year ago
What is interesting in this backtrace is that we're triggering GC when allocating a new arena:
pm->head.mask = 0xffffffff & (~1); /* "& (~1)" means first chunk is already allocated */
pm->head.pages = xmalloc(MSGPACK_RMEM_PAGE_SIZE * 32);
This may help find a repro.
Additionally, the bug trigger in a recursive unpacker:
/bundler/gems/msgpack-ruby-51da8d82cb7a/lib/msgpack/factory.rb:150:in `load'
/bundler/gems/msgpack-ruby-51da8d82cb7a/lib/msgpack/factory.rb:101:in `with'
/bundler/gems/msgpack-ruby-51da8d82cb7a/lib/msgpack/factory.rb:152:in `block in load'
/bundler/gems/msgpack-ruby-51da8d82cb7a/lib/msgpack/factory.rb:152:in `full_unpack'
/gems/paquito-0.10.0/lib/paquito/types.rb:355:in `block in register_serializable_type'
/bundler/gems/msgpack-ruby-51da8d82cb7a/lib/msgpack/factory.rb:64:in `load'
/bundler/gems/msgpack-ruby-51da8d82cb7a/lib/msgpack/factory.rb:64:in `unpacker'
Ok, so I managed to find a semi-decent local repro via bootsnap precompile
of one of our large apps, using a modified msgpack to always trigger GC in that spot. It's reliable enough that I can get a crash relatively quickly, but I doubt I can turn it into a standalone repro.
Either way, adding some debug, I managed to confirm that the same pointer get assigned to two distinct chunks:
allocated = 0x118280000
// snip...
chunk = 0x600003b055f0 msgpack_rmem_free 0x118280000
// snip...
chunk_free freeing = 0x118280000
// snip...
chunk = 0x600003b05510 msgpack_rmem_free 0x118280000
/Users/byroot/.gem/ruby/3.2.1/gems/bootsnap-1.16.0/lib/bootsnap/compile_cache/yaml.rb:192: [BUG] MessagePack::Buffer: Failed to free an rmem pointer (0x0000000118280000), memory leak?
So now the question is to figure out how that happens.
More debug:
chunk_malloc (alloc) chunk = 0x600003b208d0 mem = 0x1183b8000
chunk_malloc (alloc) chunk = 0x600003b195f0 mem = 0x1183b8000
That debug is right there in _msgpack_buffer_chunk_malloc
: https://github.com/msgpack/msgpack-ruby/blob/51da8d82cb7ae92698442be830d9577398ecff10/ext/msgpack/buffer.c#L351-L352
Interestingly, these two are right one after the other, so somehow msgpack_rmem_alloc
returned the same address twice consecutively, without any free happening in between.
More debug:
static inline void* msgpack_rmem_alloc(msgpack_rmem_t* pm)
{
if(_msgpack_rmem_chunk_available(&pm->head)) {
void *ptr = _msgpack_rmem_chunk_alloc(&pm->head);
fprintf(stderr, "msgpack_rmem_alloc available = %p\n", ptr);
return ptr;
} else {
void *ptr = _msgpack_rmem_alloc2(pm);
fprintf(stderr, "msgpack_rmem_alloc new = %p\n", ptr);
return ptr;
}
}
msgpack_rmem_alloc new = 0x1184c8000
chunk_malloc (alloc) chunk = 0x600003b0dcf0 mem = 0x1184c8000
msgpack_rmem_alloc available = 0x1184c8000
chunk_malloc (alloc) chunk = 0x600003b0deb0 mem = 0x1184c8000
So what we see here is:
_msgpack_rmem_alloc2
and return that pointer._msgpack_rmem_chunk_alloc
return the same pointer _msgpack_rmem_alloc2
just returned us.Something's up here.
Using the following debug:
pm->head.mask = 0xffffffff & (~1); /* "& (~1)" means first chunk is already allocated */
void *ptr = xmalloc(MSGPACK_RMEM_PAGE_SIZE * 32);
fprintf(stderr, "allocated = %p, mask = %08x\n", ptr, pm->head.mask);
fprintf(stderr, "gc_start ------------------------------------------------\n");
rb_gc_start(); // try to simulate crash
fprintf(stderr, "gc_end ------------------------------------------------\n");
fprintf(stderr, "after_gc mask = %08x\n", pm->head.mask);
pm->head.pages = ptr;
I was able to notice the following pattern:
allocated = 0x110510000, mask = fffffffe
gc_start ------------------------------------------------
.... lots of free
gc_end ------------------------------------------------
after_gc mask = ffffffff
So when the GC triggers here, some chunks are freed and incorrectly release the first page of that chunk.
I'll keep digging as of why.
NB: this is with -O0
.
There is also this very weird transition:
chunk = 0x600000ca2ca0 msgpack_rmem_free 0x138030000
rmem_chunk_alloc rmem_chunk = 0x107fe8010, mask = fffffffe -> ffffffff, pos = 0, mem = 0x138030000
chunk = 0x600003b200f0 msgpack_rmem_free 0x138031000
rmem_chunk_alloc rmem_chunk = 0x107fe8010, mask = ffffffff -> ffffffff, pos = 1, mem = 0x138031000
Note the ffffffff -> ffffffff
, which means we're freeing a slot that is already marked as free.
I can now crash earlier with:
static inline bool _msgpack_rmem_chunk_try_free(msgpack_rmem_chunk_t* c, void* mem)
{
ptrdiff_t pdiff = ((char*)(mem)) - ((char*)(c)->pages);
if(0 <= pdiff && pdiff < MSGPACK_RMEM_PAGE_SIZE * 32) {
size_t pos = pdiff / MSGPACK_RMEM_PAGE_SIZE;
unsigned int mask_before = (c)->mask;
if ((c)->mask & (1 << pos)) {
rb_bug("_msgpack_rmem_chunk_try_free %p was already freed", mem);
}
(c)->mask |= (1 << pos);
fprintf(stderr, "rmem_chunk_alloc rmem_chunk = %p, mask = %08x -> %08x, pos = %lu, mem = %p\n", c, mask_before, (c)->mask, pos, mem);
return true;
}
return false;
}
[BUG] _msgpack_rmem_chunk_try_free 0x000000010817e000 was already freed
/opt/rubies/3.2.1/lib/libruby.3.2.dylib(rb_bug+0x1c) [0x100ad1f84]
/Users/byroot/src/github.com/Shopify/msgpack-ruby/lib/msgpack/msgpack.bundle(_msgpack_rmem_chunk_try_free+0xa4) [0x1206d33c0]
/Users/byroot/src/github.com/Shopify/msgpack-ruby/lib/msgpack/msgpack.bundle(msgpack_rmem_free+0x20) [0x1206d3228]
/Users/byroot/src/github.com/Shopify/msgpack-ruby/lib/msgpack/msgpack.bundle(_msgpack_buffer_chunk_destroy+0x74) [0x1206d15a4]
/Users/byroot/src/github.com/Shopify/msgpack-ruby/lib/msgpack/msgpack.bundle(msgpack_buffer_destroy+0x68) [0x1206d14e8]
/Users/byroot/src/github.com/Shopify/msgpack-ruby/lib/msgpack/msgpack.bundle(msgpack_packer_destroy+0x18) [0x1206d713c]
/Users/byroot/src/github.com/Shopify/msgpack-ruby/lib/msgpack/msgpack.bundle(Packer_free+0x3c) [0x1206d9840]
/opt/rubies/3.2.1/lib/libruby.3.2.dylib(obj_free+0x8a4) [0x1008199c0]
/opt/rubies/3.2.1/lib/libruby.3.2.dylib(gc_sweep_page+0x274) [0x10081901c]
/opt/rubies/3.2.1/lib/libruby.3.2.dylib(gc_sweep_step+0x134) [0x1008177c0]
/opt/rubies/3.2.1/lib/libruby.3.2.dylib(gc_sweep+0xa44) [0x100817170]
/opt/rubies/3.2.1/lib/libruby.3.2.dylib(gc_start+0xd9c) [0x10081c5b0]
/opt/rubies/3.2.1/lib/libruby.3.2.dylib(rb_gc_start+0x60) [0x10080c8d8]
/Users/byroot/src/github.com/Shopify/msgpack-ruby/lib/msgpack/msgpack.bundle(_msgpack_rmem_alloc2+0x21c) [0x1206dd488]
/Users/byroot/src/github.com/Shopify/msgpack-ruby/lib/msgpack/msgpack.bundle(msgpack_rmem_alloc+0x64) [0x1206d3748]
/Users/byroot/src/github.com/Shopify/msgpack-ruby/lib/msgpack/msgpack.bundle(_msgpack_buffer_chunk_malloc+0x74) [0x1206d25d4]
/Users/byroot/src/github.com/Shopify/msgpack-ruby/lib/msgpack/msgpack.bundle(_msgpack_buffer_expand+0x174) [0x1206d222c]
/Users/byroot/src/github.com/Shopify/msgpack-ruby/lib/msgpack/msgpack.bundle(_msgpack_buffer_append_impl+0x98) [0x1206d8acc]
/Users/byroot/src/github.com/Shopify/msgpack-ruby/lib/msgpack/msgpack.bundle(msgpack_buffer_append+0x30) [0x1206d89d0]
/Users/byroot/src/github.com/Shopify/msgpack-ruby/lib/msgpack/msgpack.bundle(msgpack_buffer_append_string+0x68) [0x1206d8904]
/Users/byroot/src/github.com/Shopify/msgpack-ruby/lib/msgpack/msgpack.bundle(msgpack_packer_write_string_value+0x124) [0x1206d82bc]
/Users/byroot/src/github.com/Shopify/msgpack-ruby/lib/msgpack/msgpack.bundle(msgpack_packer_write_value+0xcc) [0x1206d7498]
I figured it out!
diff --git a/ext/msgpack/rmem.c b/ext/msgpack/rmem.c
index 7e5f5e5..d480372 100644
--- a/ext/msgpack/rmem.c
+++ b/ext/msgpack/rmem.c
@@ -70,6 +70,7 @@ void* _msgpack_rmem_alloc2(msgpack_rmem_t* pm)
pm->head = *c;
*c = tmp;
+ pm->head.pages = NULL; /* make sure we don't point to another chunk's pages in case xmalloc triggers GC */
pm->head.mask = 0xffffffff & (~1); /* "& (~1)" means first chunk is already allocated */
pm->head.pages = xmalloc(MSGPACK_RMEM_PAGE_SIZE * 32);
We just copied the head
at the back of the list, but pm->head.pages
hasn't changed. So if we free any buffer that was on that previous head
, we'll free it from the wrong chunk
.
(Somehow GitHub won't let me push my branch with the fix, but I'll open a PR as soon as it let me do it).
We've been digging into this issue with @peterzhu2118 for a couple weeks. We've noticed some crashes in production, but couldn't figure it out.
So we implemented https://github.com/msgpack/msgpack-ruby/pull/323 to make it easier to debug, and now we caught an instance of this bug on CI: