Built executables hang in stack/heap queries

stephenrkell commented 5 years ago

(reported by @clearyf -- with thanks!)

Note that the resulting executables still don't work fully, eg the sample test.c in README.md is broken, produces:

$ LD_PRELOAD=/usr/local/src/liballocs/lib/liballocs_preload.so ./test
xed could not decode instruction at 0x0000557aa6486a2a
xed could not decode instruction at 0x00007f3aeb0da64a
xed could not decode instruction at 0x00007f3aeb122c36
xed could not decode instruction at 0x00007f3aeb54ae7e
xed could not decode instruction at 0x00007f3aeb557a4e
test: Warning: mapping of (null) could not extend preceding bigalloc
At 0x557aa64868c5 is a static-allocated object of size 0, type __FUN_FROM___ARG0_int$32__ARG1___PTR___PTR_signed_char$8__FUN_TO_int$32

and just hangs there with 100% CPU usage and requires SIGKILL to kill it.

Querying static storage is fine (functions & static/global variables); stack & heap allocations hang.

stephenrkell commented 5 years ago

I don't think the xed warnings are the issue here. We use xed to walk over the instructions i ncertain places (principally in the libc) and replace any syscalls with ud2. Then the actual syscall is performed (perhaps modified form) in the SIGILL handler. All this happens in libsystrap (in the trap-syscalls repository, for now).

clearyf commented 5 years ago

Update: it doesn't hang infinitely, only for ca. 5 mins or so. Second point is that alloc_get_allocator returns NULL for stack-allocated variables. I modified the test program a bit:

#include <allocs.h>
#include <stdio.h>

#define ARRAY_SIZE(x) (sizeof x / sizeof x[0])

int main(int const argc, char const **argv)
{
  int const num = 42;
  int const*const allocated = malloc(sizeof *allocated);

  void const*const ptrs[] = { main, argv, &argc, &num, allocated };
  for (size_t i = 0; i < ARRAY_SIZE(ptrs); ++i) {
    struct allocator const*const allocator = alloc_get_allocator(ptrs[i]);
    if (allocator) {
      printf("At %p is a %s-allocated object of size %zu, type %s\n",
             ptrs[i],
             allocator->name,
             alloc_get_size(ptrs[i]),
             UNIQTYPE_NAME(alloc_get_type(ptrs[i]))
             );
    } else {
      printf("At %p is an object without an allocator\n", ptrs[i]);
    }
  }
  return 0;
}

And it outputs:

$ time LD_PRELOAD=/usr/local/src/liballocs/lib/liballocs_preload.so ./test
xed could not decode instruction at 0x000056127a657a0a
xed could not decode instruction at 0x00007f8e780df64a
xed could not decode instruction at 0x00007f8e78127c36
xed could not decode instruction at 0x00007f8e7854fe7e
xed could not decode instruction at 0x00007f8e7855ca4e
test: Warning: mapping of (null) could not extend preceding bigalloc
test: Warning: mapping of /usr/lib/meta/home/user/test-meta.so could not extend preceding bigalloc
At 0x56127a6578c5 is a static-allocated object of size 0, type __FUN_FROM___ARG0_int$32__ARG1___PTR___PTR_signed_char$8__FUN_TO_int$32
At 0x7ffd87a298e8 is a auxv-allocated object of size 16, type __ARR2___PTR_signed_char$8
At 0x7ffd87a297ac is an object without an allocator
At 0x7ffd87a297dc is an object without an allocator
Segmentation fault (core dumped)

real    5m15.888s
user    0m0.432s
sys 5m15.102s

It almost instantly prints out everything up to and including the two object without an allocator lines, and then hangs for the next 5 minutes, before eventually segfaulting. Looking at ps -al and top reveals that it is 100% system time, but wchan is always 0. What is also interesting is the virtual memory usage, which is ca 16.3TB according to htop. I've attached the /proc/$PID/smaps file, there are two enormous mappings in there, but with 4kB pages. Is this expected? I'm guessing the "hang" is the process somehow scanning the entire (or large portion) of the 16.3TB mapped memory in kernel-space, which is going to take a while no matter what.

However when I run the program inside gdb it immediately segfaults without the 5 minute wait:

(gdb) set environment LD_PRELOAD = /usr/local/src/liballocs/lib/liballocs_preload.so
(gdb) handle SIGILL noprint nostop
Signal        Stop  Print   Pass to program Description
SIGILL        No    No  Yes     Illegal instruction
(gdb) run
Starting program: /home/user/test 
xed could not decode instruction at 0x00005611db94be2a
xed could not decode instruction at 0x00007f30e8f1a64a
xed could not decode instruction at 0x00007f30e8f62c36
xed could not decode instruction at 0x00007f30e8fabe7e
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
xed could not decode instruction at 0x000055d0b0a0ba0a
xed could not decode instruction at 0x00007f8cc0fe064a
xed could not decode instruction at 0x00007f8cc1028c36
xed could not decode instruction at 0x00007f8cc1450e7e
xed could not decode instruction at 0x00007f8cc145da4e
test: Warning: mapping of (null) could not extend preceding bigalloc
test: Warning: mapping of /usr/lib/meta/home/user/test-meta.so could not extend preceding bigalloc
At 0x55d0b0a0b8c5 is a static-allocated object of size 0, type __FUN_FROM___ARG0_int$32__ARG1___PTR___PTR_signed_char$8__FUN_TO_int$32
At 0x7ffc78010268 is a auxv-allocated object of size 16, type __ARR2___PTR_signed_char$8
At 0x7ffc7801012c is an object without an allocator
At 0x7ffc7801015c is an object without an allocator

Program received signal SIGSEGV, Segmentation fault.
__generic_heap_get_info (obj=<optimized out>, maybe_bigalloc=<optimized out>, out_type=0x0, out_base=0x0, out_size=0x7ffc780100f8, 
    out_site=0x0) at /usr/local/src/liballocs/src/allocators/generic_malloc.c:1422
1422                *out_size = requested_size_for_chunk(*out_base, alloc_usable_chunksize);
(gdb) bt
#0  __generic_heap_get_info (obj=<optimized out>, maybe_bigalloc=<optimized out>, out_type=0x0, out_base=0x0, 
    out_size=0x7ffc780100f8, out_site=0x0) at /usr/local/src/liballocs/src/allocators/generic_malloc.c:1422
#1  0x00007f8cc156f506 in __liballocs_get_alloc_info (out_alloc_site=0x0, out_alloc_uniqtype=0x0, 
    out_alloc_size_bytes=0x7ffc780100f8, out_alloc_start=0x0, out_allocator=0x0, obj=0x55d0b0c922c0) at ../include/pageindex.h:191
#2  __liballocs_get_alloc_size (obj=0x55d0b0c922c0) at /usr/local/src/liballocs/src/liballocs.c:1792
#3  0x000055d0b0a0b963 in main (argc=<optimized out>, argv=0x7ffc78010268) at test.c:15
(gdb)

liballocs-issue32-smaps.txt

Any ideas where to proceed from here? While I might be able to eventually figure this out, any pointers/helpers in the right direction would help a lot.

stephenrkell commented 5 years ago

Sorry for the delayed reply. The huge mappings are intentional, but are supposed to be used only sparsely; indeed this bug sounds like some code is scanning/writing to a huge portion of this. I have seen some bugs of this form in the past, although I believed them all to be fixed! Usually the fix amounts to "not shadowing the shadows" (the huge array is for keeping metadata, but we don't keep metadata on it), or not interpreting (void*)-1 as a legit address.

I would try breaking on memset_bigalloc in pageindex.c. To do so is sometimes a problem in gdb... 'starti' might get you control early enough, but otherwise run with LIBALLOCS_DELAY_STARTUP=1 in your environment, and then attach gdb from the outside.

I can look at this myself if you can get me a sharp enough reproducer... was this built in a container?

clearyf commented 5 years ago

Don't worry, I've been busy myself. This was built inside a docker container, using the commands provided in README.md, docker build -t liballocs_built liballocs/buildtest/debian-buster/. I checked out liballocs at the end of September, I've attached the output of git submodule status --recursive, my branch is clearyf/debian-buster. This is running inside my nominally buster system. I say nominally, because it was unstable for a long time but coming up to the buster release I stopped tracking unstable and have stayed on buster since the release. However some packages slipped through, so I've finally gotten that cleaned up sufficiently that I can apt-get the entire list of liballocs dependencies and can test on my local machine. That won't happen though until the weekend. In the meantime I've attached the dpkg -l from the buster docker image too.

submodule-status.txt dpkg-l.txt

clearyf commented 5 years ago

So I have rebuilt the docker image from current origin/master (ce0d7dc), submodules still the same, using the same buster image as before, I didn't remove that. And now my test program works just fine:

$ LD_PRELOAD=/usr/local/src/liballocs/lib/liballocs_preload.so ./test
xed could not decode instruction at 0x00005561ee8e3a0a
xed could not decode instruction at 0x00007f1237d0064a
xed could not decode instruction at 0x00007f1237d48c36
xed could not decode instruction at 0x00007f1238170e7e
xed could not decode instruction at 0x00007f123817da4e
test: Warning: mapping of (null) could not extend preceding bigalloc
test: Warning: mapping of /usr/lib/meta/home/user/test-meta.so could not extend preceding bigalloc
At 0x5561ee8e38c5 is a static-allocated object of size 0, type __FUN_FROM___ARG0_int$32__ARG1___PTR___PTR_signed_char$8__FUN_TO_int$32
At 0x7ffc30ad76b8 is a auxv-allocated object of size 16, type __ARR2___PTR_signed_char$8
At 0x7ffc30ad757c is a stackframe-allocated object of size 104, type _test_cil_c_main_vaddrs_0x1915_0x1995
At 0x7ffc30ad75ac is a stackframe-allocated object of size 104, type _test_cil_c_main_vaddrs_0x1915_0x1995
At 0x5561effb22c0 is a generic malloc-allocated object of size 4, type __ARR_int$32
====================================================
liballocs summary: 
----------------------------------------------------
queries aborted for unknown storage:               0
queries handled by static case:                    3
queries handled by stack case:                     6
queries handled by heap case:                      3
----------------------------------------------------
queries aborted for unindexed heap:                0
queries aborted for unknown heap allocsite:        0
queries aborted for unknown stackframes:           0
queries aborted for unknown static obj:            0
====================================================

I don't know what's at fault here; I can change the Dockerfile to checkout where I was (ca clearyf/debian-buster) and retry the docker build, but that's not going to happen this evening. In the meantime we can ponder what is going on, so at this stage if nothing related has changed in your liballocs master branch (seems true based on what's in the gitlog) recently then I'd almost be ready to chalk this one up to bitflips when the system is under load or something along those lines.

stephenrkell commented 5 years ago

Thanks for this. I'm sort-of glad it works now! But yes, this is troubling. I suspect there is a real bug there, but that it is sensitive to memory placement. I have never quite understood how the kernel decides to place things, but sometimes your environment can make a difference (e.g. it's quite common for some bugs of this kind to show when running in gdb but not otherwise, or vice-versa).

So let's leave this open and see if it crops up again. If you do get time to investigate further, that would be great, though don't feel obliged!

clearyf commented 5 years ago

I have had things like this before, eg code generating an impossible result, fixed by cleaning out the ccache and rebuilding. I'll try to keep an eye out in future.

stephenrkell commented 3 years ago

I haven't seen any recurrence of this, and enough has changed that a new issue would be merited anyway... so closing this one.

stephenrkell / liballocs

Built executables hang in stack/heap queries #32