scylladb / scylladb

NoSQL data store using the seastar framework, compatible with Apache Cassandra
http://scylladb.com
GNU Affero General Public License v3.0
13.46k stars 1.27k forks source link

test_crash_coordinator_before_streaming.test_kill_coordinator_during_op.debug.1 times out #21114

Open xemul opened 6 days ago

xemul commented 6 days ago

https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/12400/

...
        # kill coordinator during bootstrap
        logger.debug("Kill coordinator during bootstrap")
        nodes = await manager.running_servers()
        coordinator_host = await get_coordinator_host(manager)
        other_nodes = [srv for srv in nodes if srv.server_id != coordinator_host.server_id]
        new_node = await manager.server_add(start=False)
        await manager.api.enable_injection(coordinator_host.ip_addr, "crash_coordinator_before_stream", one_shot=True)
        await manager.server_start(new_node.server_id,
                                   expected_error="Startup failed: std::runtime_error")
        await wait_new_coordinator_elected(manager, 4, time.time() + 60)
>       await manager.server_restart(coordinator_host.server_id, wait_others=1)

test/topology_experimental_raft/test_crash_coordinator_before_streaming.py:99: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
test/pylib/manager_client.py:238: in server_restart
    await self.server_start(server_id=server_id, wait_others=wait_others, wait_interval=wait_interval)
test/pylib/manager_client.py:226: in server_start
    await self.client.put_json(f"/cluster/server/{server_id}/start", data, timeout=timeout)
test/pylib/rest_client.py:114: in put_json
    ret = await self._fetch("PUT", resource_uri, response_type = response_type, host = host,
test/pylib/rest_client.py:65: in _fetch
    async with request(method, uri,
/usr/lib64/python3.12/site-packages/aiohttp/client.py:1246: in __aenter__
    self._resp = await self._coro
/usr/lib64/python3.12/site-packages/aiohttp/client.py:608: in _request
    await resp.start(conn)
/usr/lib64/python3.12/site-packages/aiohttp/client_reqrep.py:971: in start
    with self._timer:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <aiohttp.helpers.TimerContext object at 0x7f49a9952000>
exc_type = <class 'asyncio.exceptions.CancelledError'>
exc_val = CancelledError(), exc_tb = <traceback object at 0x7f49a853d740>

    def __exit__(
        self,
        exc_type: Optional[Type[BaseException]],
        exc_val: Optional[BaseException],
        exc_tb: Optional[TracebackType],
    ) -> Optional[bool]:
        if self._tasks:
            self._tasks.pop()

        if exc_type is asyncio.CancelledError and self._cancelled:
>           raise asyncio.TimeoutError from None
E           TimeoutError

several nodes have this crash in their logs

...
DEBUG 2024-10-15 00:24:19,389 [shard 0: gms] raft_topology - obtaining group 0 guard...
DEBUG 2024-10-15 00:24:19,397 [shard 0: gms] raft_topology - guard taken, prev_state_id: ac4539fa-8a72-11ef-f9cf-e340a1016174, new_state_id: ac55bb36-8a72-11ef-9399-9be3ab08e1b9, coordinator term: 3, current Raft term: 3
Aborting on shard 0, in scheduling group gossip.
Backtrace:
  0x3058755
  /jenkins/workspace/scylla-master/scylla-ci/scylla/build/debug/seastar/libseastar.so+0x3832e17
  /jenkins/workspace/scylla-master/scylla-ci/scylla/build/debug/seastar/libseastar.so+0x3832aa5
  /jenkins/workspace/scylla-master/scylla-ci/scylla/build/debug/seastar/libseastar.so+0x3674aeb
  /jenkins/workspace/scylla-master/scylla-ci/scylla/build/debug/seastar/libseastar.so+0x36aa265
  /jenkins/workspace/scylla-master/scylla-ci/scylla/build/debug/seastar/libseastar.so+0x3798be9
  /jenkins/workspace/scylla-master/scylla-ci/scylla/build/debug/seastar/libseastar.so+0x3798f17
  /jenkins/workspace/scylla-master/scylla-ci/scylla/build/debug/seastar/libseastar.so+0x3798d3e
  /lib64/libc.so.6+0x40cff
  /lib64/libc.so.6+0x99663
  /lib64/libc.so.6+0x40c4d
  /lib64/libc.so.6+0x28901
  0x8f17db2
  0x8e91f30
  0x433d036
  /jenkins/workspace/scylla-master/scylla-ci/scylla/build/debug/seastar/libseastar.so+0x36974ce
  /jenkins/workspace/scylla-master/scylla-ci/scylla/build/debug/seastar/libseastar.so+0x36a34bd
  /jenkins/workspace/scylla-master/scylla-ci/scylla/build/debug/seastar/libseastar.so+0x36a8990
  /jenkins/workspace/scylla-master/scylla-ci/scylla/build/debug/seastar/libseastar.so+0x36a6969
  /jenkins/workspace/scylla-master/scylla-ci/scylla/build/debug/seastar/libseastar.so+0x31f86f4
  /jenkins/workspace/scylla-master/scylla-ci/scylla/build/debug/seastar/libseastar.so+0x31f508a
  0x30f231b
  0x30efef3
  /lib64/libc.so.6+0x2a087
  /lib64/libc.so.6+0x2a14a
  0x30144a4

decoded

Backtrace:
[Backtrace #0]
__interceptor_backtrace at /mnt/clang_build/llvm-project-x86_64/compiler-rt/lib/asan/../sanitizer_common/sanitizer_common_interceptors.inc:4358
void seastar::backtrace<seastar::backtrace_buffer::append_backtrace()::{lambda(seastar::frame)#1}>(seastar::backtrace_buffer::append_backtrace()::{lambda(seastar::frame)#1}&&) at ./build/debug/seastar/./seastar/include/seastar/util/backtrace.hh:68
seastar::backtrace_buffer::append_backtrace() at ./build/debug/seastar/./build/debug/seastar/./seastar/src/core/reactor.cc:825
seastar::print_with_backtrace(seastar::backtrace_buffer&, bool) at ./build/debug/seastar/./build/debug/seastar/./seastar/src/core/reactor.cc:858
seastar::print_with_backtrace(char const*, bool) at ./build/debug/seastar/./build/debug/seastar/./seastar/src/core/reactor.cc:870
seastar::sigabrt_action() at ./build/debug/seastar/./build/debug/seastar/./seastar/src/core/reactor.cc:4003
seastar::install_oneshot_signal_handler<6, (void (*)())(&seastar::sigabrt_action)>()::{lambda(int, siginfo_t*, void*)#1}::operator()(int, siginfo_t*, void*) const at ./build/debug/seastar/./build/debug/seastar/./seastar/src/core/reactor.cc:3980
seastar::install_oneshot_signal_handler<6, (void (*)())(&seastar::sigabrt_action)>()::{lambda(int, siginfo_t*, void*)#1}::__invoke(int, siginfo_t*, void*) at ./build/debug/seastar/./build/debug/seastar/./seastar/src/core/reactor.cc:3975
/data/scylla-s3-reloc.cache/by-build-id/0f90c478f81b9f37cd9245207e200a3ec986cc50/extracted/scylla/libreloc/libc.so.6: ELF 64-bit LSB shared object, x86-64, version 1 (GNU/Linux), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, BuildID[sha1]=77c77fee058b19c6f001cf2cb0371ce3b8341211, for GNU/Linux 3.2.0, not stripped

__GI___sigaction at :?
__pthread_kill_implementation at ??:?
__GI_raise at :?
__GI_abort at :?
service::topology_coordinator::handle_topology_transition(service::group0_guard)::{lambda()#1}::operator()() const at ././service/topology_coordinator.cc:1872
 (inlined by) void std::__invoke_impl<void, service::topology_coordinator::handle_topology_transition(service::group0_guard)::{lambda()#1}&>(std::__invoke_other, service::topology_coordinator::handle_topology_transition(service::group0_guard)::{lambda()#1}&) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/invoke.h:61
 (inlined by) std::enable_if<is_invocable_r_v<void, service::topology_coordinator::handle_topology_transition(service::group0_guard)::{lambda()#1}&>, void>::type std::__invoke_r<void, service::topology_coordinator::handle_topology_transition(service::group0_guard)::{lambda()#1}&>(service::topology_coordinator::handle_topology_transition(service::group0_guard)::{lambda()#1}&) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/invoke.h:111
 (inlined by) std::_Function_handler<void (), service::topology_coordinator::handle_topology_transition(service::group0_guard)::{lambda()#1}>::_M_invoke(std::_Any_data const&) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/std_function.h:290
utils::error_injection<true>::inject(std::basic_string_view<char, std::char_traits<char> > const&, std::function<void ()>) at ././utils/error_injection.hh:373
 (inlined by) service::topology_coordinator::handle_topology_transition(service::group0_guard) at ././service/topology_coordinator.cc:1872
std::__n4861::coroutine_handle<seastar::internal::coroutine_traits_base<bool>::promise_type>::resume() const at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/coroutine:242
 (inlined by) seastar::internal::coroutine_traits_base<bool>::promise_type::run_and_dispose() at ././seastar/include/seastar/core/coroutine.hh:80
seastar::reactor::run_tasks(seastar::reactor::task_queue&) at ./build/debug/seastar/./build/debug/seastar/./seastar/src/core/reactor.cc:2621
seastar::reactor::run_some_tasks() at ./build/debug/seastar/./build/debug/seastar/./seastar/src/core/reactor.cc:3087
seastar::reactor::do_run() at ./build/debug/seastar/./build/debug/seastar/./seastar/src/core/reactor.cc:3255
seastar::reactor::run() at ./build/debug/seastar/./build/debug/seastar/./seastar/src/core/reactor.cc:3145
seastar::app_template::run_deprecated(int, char**, std::function<void ()>&&) at ./build/debug/seastar/./build/debug/seastar/./seastar/src/core/app-template.cc:276
seastar::app_template::run(int, char**, std::function<seastar::future<int> ()>&&) at ./build/debug/seastar/./build/debug/seastar/./seastar/src/core/app-template.cc:167
scylla_main(int, char**) at ././main.cc:703
main at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/std_function.h:591
__libc_start_call_main at ??:?
__libc_start_main_alias_2 at :?
_start at ??:?
kbr-scylla commented 6 days ago

several nodes have this crash in their logs

service::topology_coordinator::handle_topology_transition(service::group0_guard)::{lambda()#1}::operator()() const at ././service/topology_coordinator.cc:1872

Nothing unexpected here:

                utils::get_local_injector().inject("crash_coordinator_before_stream", [] { abort(); });
kbr-scylla commented 6 days ago

test_kill_coordinator_during_op.1.debug.1.zip

node-4165 was trying to restart It began restarting at 00:24:27:

WARN  2024-10-15 00:24:27,745 seastar - Seastar compiled with default allocator, --memory option won't take effect

last log message was at 00:24:40:

INFO  2024-10-15 00:24:40,859 [shard 0:comp] compaction - [Compact system.peers b91439b0-8a72-11ef-834b-50ec3982def2] Compacted 4 sstables to [/scylladir/testlog/x86_64/debug/scylla-4165/data/system/peers-37f71aca7dc2383ba70672528af04d4f/me-3gkd_1nh4_4p3yo1zvma01fxspv6-big-Data.db:level=0]. 38kB to 9489 bytes (~24% of original) in 68ms = 562kB/s. ~512 total partitions merged to 5.

so it got stuck 13 seconds into the restart procedure.

Last message from storage_service is:

INFO  2024-10-15 00:24:38,026 [shard 0:strm] storage_service - The node is already in group 0 and will restart in raft mode

less than a second earlier it started reloading topology state:

DEBUG 2024-10-15 00:24:37,989 [shard 0: gms] raft_topology - reload raft topology state

it's unclear if this operation finished; did it get stuck here?

Annamikhlin commented 5 days ago

seen also on https://jenkins.scylladb.com/job/scylla-enterprise/job/next/3179/testReport/junit/(root)/test_crash_coordinator_before_streaming/Build___x86___Unit_Tests_x86___test_kill_coordinator_during_op_debug_3_2/