tarantool / tarantool-qa

QA related issues of Tarantool
3 stars 0 forks source link

Segfault while running tests against ASAN build (Clang-16) #321

Closed ylobankov closed 1 year ago

ylobankov commented 1 year ago

Tarantool

Tarantool 3.0.0-alpha1-66-g10870343a
Target: Linux-x86_64-RelWithDebInfo
Build options: cmake . -DCMAKE_INSTALL_PREFIX=/usr/local -DENABLE_BACKTRACE=TRUE
Compiler: Clang-16.0.6
C_FLAGS: -fexceptions -funwind-tables -fasynchronous-unwind-tables -fno-common -msse2 -fsanitize=fuzzer-no-link -fsanitize=address -fsanitize-blacklist=/tarantool/asan/asan.supp  -fmacro-prefix-map=/tarantool=. -std=c11 -Wall -Wextra -fsanitize=alignment,bool,bounds,builtin,enum,float-cast-overflow,float-divide-by-zero,function,integer-divide-by-zero,return,shift,unreachable,vla-bound -fno-sanitize-recover=alignment,bool,bounds,builtin,enum,float-cast-overflow,float-divide-by-zero,function,integer-divide-by-zero,return,shift,unreachable,vla-bound -Wno-gnu-alignof-expression -Wno-cast-function-type -Werror
CXX_FLAGS: -fexceptions -funwind-tables -fasynchronous-unwind-tables -fno-common -msse2 -fsanitize=fuzzer-no-link -fsanitize=address -fsanitize-blacklist=/tarantool/asan/asan.supp  -fmacro-prefix-map=/tarantool=. -std=c++11 -Wall -Wextra -fsanitize=alignment,bool,bounds,builtin,enum,float-cast-overflow,float-divide-by-zero,function,integer-divide-by-zero,return,shift,unreachable,vla-bound -fno-sanitize-recover=alignment,bool,bounds,builtin,enum,float-cast-overflow,float-divide-by-zero,function,integer-divide-by-zero,return,shift,unreachable,vla-bound -Wno-invalid-offsetof -Wno-gnu-alignof-expression -Wno-cast-function-type -Werror

Steps to reproduce

docker run -it --rm tarantool/testing:ubuntu-jammy-clang16
git clone https://github.com/tarantool/tarantool.git
cd tarantool && git submodule update --recursive --init --jobs $(nproc)

Open .test.mk file and remove run-luajit-test target from L112. Save and exit.

CC=clang-16 CXX=clang++-16 TEST_RUN_EXTRA_PARAMS=app/fiber.test.lua make -f .test.mk test-release-asan

The command above runs one test only. If you want to get all failures, remove TEST_RUN_EXTRA_PARAMS=app/fiber.test.lua from the command.

Expected result: Test passed.

Actual result:

======================================================================================
WORKR TEST                                            PARAMS          RESULT
---------------------------------------------------------------------------------
[001]
[001] [Instance "app"] Failed to start: Broken tarantool console handshake
[001]
[001] [test-run server "app"] Last 15 lines of the log file /tmp/t/001_app/app.log:
[001]   fs       0x0                0
[001]   cr2      0x7f9294552f60     140267530563424
[001]   err      0x4                4
[001]   oldmask  0x0                0
[001]   trapno   0xe                14
[001] Current time: 1691493315
[001] Please file a bug at https://github.com/tarantool/tarantool/issues
[001] Attempting backtrace... Note: since the server has already crashed,
[001] this may fail as well
[001] #1  0x55e5a1f1581f in crash_collect+575
[001] #2  0x55e5a1f155a8 in crash_signal_cb+424
[001] #3  0x7f9298145520 in __sigaction+80
[001] #4  0x55e5a1f260f9 in fiber_yield+1113
[001] #5  0x55e5a1f2d973 in fiber_loop+2259
[001] #6  0x55e5a2860273 in coro_init+307
[001]
[001] [Instance "app"] Failed to start: Broken tarantool console handshake
---------------------------------------------------------------------------------
locker commented 1 year ago

To reproduce the issue, it's suffice to start tarantool in the interactive mode and then exit. The problem is buried somewhere in the on_shutdown triggers because disabling them helps.

diff --git a/src/main.cc b/src/main.cc
index 5c7587135915..40b89f6c9368 100644
--- a/src/main.cc
+++ b/src/main.cc
@@ -162,7 +162,7 @@ on_shutdown_f(va_list ap)
    while (!is_shutting_down)
        fiber_yield();

-   if (trigger_fiber_run(&box_on_shutdown_trigger_list, NULL,
+   if (0 && trigger_fiber_run(&box_on_shutdown_trigger_list, NULL,
                  on_shutdown_trigger_timeout) != 0) {
        say_error("on_shutdown triggers failed");
        diag_log();
Gumix commented 1 year ago

This change also helps...

diff --git a/src/lib/core/fiber.c b/src/lib/core/fiber.c
index 958e4a02f..4ab7bd282 100644
--- a/src/lib/core/fiber.c
+++ b/src/lib/core/fiber.c
@@ -710,6 +710,7 @@ fiber_yield(void)
        cord->fiber = callee;
        callee->flags = (callee->flags & ~FIBER_IS_READY) | FIBER_IS_RUNNING;

+       __asm__ __volatile__("": : :"memory");
        ASAN_START_SWITCH_FIBER(asan_state,
                                (caller->flags & FIBER_IS_DEAD) == 0,
                                callee->stack,

This also helps:

diff --git a/src/lib/core/fiber.c b/src/lib/core/fiber.c
index 958e4a02f..3e717d64e 100644
--- a/src/lib/core/fiber.c
+++ b/src/lib/core/fiber.c
@@ -711,7 +711,7 @@ fiber_yield(void)
        callee->flags = (callee->flags & ~FIBER_IS_READY) | FIBER_IS_RUNNING;

        ASAN_START_SWITCH_FIBER(asan_state,
-                               (caller->flags & FIBER_IS_DEAD) == 0,
+                               true,
                                callee->stack,
                                callee->stack_size);
        coro_transfer(&caller->ctx, &callee->ctx);

A similar crash: https://github.com/google/sanitizers/issues/189#issuecomment-1007798300

Gumix commented 1 year ago

So, what happens when Ctrl+D is pressed:

tarantool> ^D
fiber_recycle(103/interactive)
fiber_yield(0/, FIBER_IS_DEAD) -> 1/sched
==2744649==T0: FakeStack destroyed: 0: 594/8192; 1: 153/4096; 2: 74/2048; 3: 0/1024; 4: 30/512; 5: 0/256; 6: 4/128; 7: 0/64; 8: 87/32; 9: 0/16; 10: 0/8;
fiber_call_impl(1/sched -> 102/on_shutdown)
==2744649==T0: FakeStack created: 0x7f1c5027b000 -- 0x7f1c50800000 stack_size_log: 19; mmapped 5652K, noreserve=0
fiber_call_impl(102/on_shutdown -> 104/trigger_fiber0)
tarantool: src/lib/core/fiber.c:728: void fiber_yield(void): Assertion `asan_state != NULL' failed.
  1. The 103/interactive fiber is recycled and fiber_yield() is called to switch to the scheduler.
  2. In fiber_yield() ASAN allocates a fake stack by __asan_stack_malloc_0() for the stack-use-after-return detection.
  3. __sanitizer_start_switch_fiber() is called with fake_stack_save == NULL, because the fiber is marked as dead.
  4. __sanitizer_start_switch_fiber() destroys the fake stack.
  5. coro_transfer() switches to the scheduler, which switches to 102/on_shutdown.
  6. In trigger_fiber_run() ASAN allocates a new fake stack at the same addresses as [2].
  7. A new fiber 104/trigger_fiber0 is created from the recycled 103/interactive.
  8. 104/trigger_fiber0 is started, so now we are just after coro_transfer(), see [5].
  9. At the end of fiber_yield() ASAN tries to access the fake stack and crashes, because it was unmapped at [4] and recreated at [6].

With ASAN_START_SWITCH_FIBER(asan_state, true, ...) it doesn't crash, because the fake stack is not destroyed:

tarantool> ^D
fiber_recycle(103/interactive)
fiber_yield(0/, FIBER_IS_DEAD) -> 1/sched
fiber_call_impl(1/sched -> 102/on_shutdown)
==2750102==T0: FakeStack created: 0x7fb22f771000 -- 0x7fb22fcf6000 stack_size_log: 19; mmapped 5652K, noreserve=0
fiber_call_impl(102/on_shutdown -> 104/trigger_fiber0)
fiber_yield(104/session.shutdown, FIBER_IS_DEAD) -> 102/on_shutdown
fiber_recycle(104/session.shutdown)
fiber_recycle(102/on_shutdown)
fiber_yield(0/, FIBER_IS_DEAD) -> 1/sched
cord_destroy()

TLDR: We said to ASAN that we wouldn't switch back to a fiber, but switched to its recycled incarnation.