risingwavelabs / risingwave

Best-in-class stream processing, analytics, and management. Perform continuous analytics, or build event-driven applications, real-time ETL pipelines, and feature stores in minutes. Unified streaming and batch. PostgreSQL compatible.
https://go.risingwave.com/slack
Apache License 2.0
6.88k stars 569 forks source link

ux: malfunctioning "graceful shutdown": pressing ctrl-c in playground or single-node terminates the process with panicking #17351

Open BugenZhao opened 3 months ago

BugenZhao commented 3 months ago
^C
2024-06-19T17:59:58.02596+08:00  INFO        rw-main risingwave_meta_node: receive ctrl+c
2024-06-19T17:59:58.025957+08:00  INFO           main risingwave_cmd_all::standalone: Ctrl+C received, now exiting
2024-06-19T17:59:58.026082+08:00  INFO        rw-main risingwave_meta_node::server: Shutting down services
2024-06-19T17:59:58.026072+08:00  INFO        rw-main risingwave_rpc_client::meta_client: Heartbeat loop is stopped

2024-06-19T17:59:58.029081+08:00  INFO        rw-main risingwave_rpc_client::meta_client: Heartbeat loop is stopped
2024-06-19T17:59:58.02926+08:00  INFO        rw-main risingwave_meta::manager::system_param: System params notifier is stopped
2024-06-19T17:59:58.029655+08:00  INFO        rw-main risingwave_meta::manager::sink_coordination::manager: sink manager worker start cleaning up
2024-06-19T17:59:58.029702+08:00  INFO        rw-main risingwave_meta::manager::sink_coordination::manager: sink manager worker finished cleaning up
2024-06-19T17:59:58.029733+08:00  INFO        rw-main risingwave_meta::manager::sink_coordination::manager: sink manager worker exited
2024-06-19T17:59:58.029791+08:00  INFO        rw-main risingwave_meta::hummock: Hummock version checkpoint is stopped
2024-06-19T17:59:58.029846+08:00  INFO        rw-main risingwave_meta::hummock: Vacuum metadata loop is stopped
2024-06-19T17:59:58.029915+08:00  INFO        rw-main risingwave_meta::hummock: Vacuum object loop is stopped
2024-06-19T17:59:58.030093+08:00  INFO        rw-main risingwave_meta::rpc::metrics: Worker number monitor is stopped
2024-06-19T17:59:58.030216+08:00  INFO        rw-main risingwave_meta::rpc::metrics: Fragment info monitor is stopped
2024-06-19T17:59:58.031265+08:00  INFO        rw-main risingwave_meta::hummock::manager::timer_task: Hummock timer loop is stopped

2024-06-19T17:59:58.033336+08:00 DEBUG ThreadId(57) risingwave_rpc_client::meta_client: matching tonic code Unknown error
2024-06-19T17:59:58.033468+08:00 DEBUG ThreadId(57) risingwave_rpc_client::meta_client: skipping the current refresh, somewhere else is already doing it
2024-06-19T17:59:58.033732+08:00 DEBUG ThreadId(56) risingwave_rpc_client::meta_client: matching tonic code Unknown error
2024-06-19T17:59:58.033922+08:00 DEBUG ThreadId(56) risingwave_rpc_client::meta_client: skipping the current refresh, somewhere else is already doing it

thread 'rw-main' panicked at src/compute/src/lib.rs:227:31:
called `Result::unwrap()` on an `Err` value: JoinError::Cancelled(Id(178))
thread 'rw-main' panicked at /Users/bugenzhao/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.37.0/src/runtime/blocking/shutdown.rs:51:21:
Cannot drop a runtime in a context where blocking is not allowed. This happens when a runtime is dropped from within an asynchronous context.
stack backtrace:
2024-06-19T17:59:58.036347+08:00  WARN        rw-main risingwave_storage::hummock::local_version::pinned_version: failed to send req unpin version id: 908
2024-06-19T17:59:58.03839+08:00 ERROR        rw-main risingwave_storage::hummock::store::hummock_storage: unable to send shutdown event=HummockEvent { debug_string: "Shutdown" }

   0: rust_begin_unwind
             at /rustc/4a0cc881dcc4d800f10672747f61a94377ff6662/library/std/src/panicking.rs:645:5
   1: core::panicking::panic_fmt
             at /rustc/4a0cc881dcc4d800f10672747f61a94377ff6662/library/core/src/panicking.rs:72:14
   2: core::result::unwrap_failed
             at /rustc/4a0cc881dcc4d800f10672747f61a94377ff6662/library/core/src/result.rs:1654:5
   3: core::result::Result<T,E>::unwrap
             at /rustc/4a0cc881dcc4d800f10672747f61a94377ff6662/library/core/src/result.rs:1077:23
   4: risingwave_compute::start::{{closure}}
             at ./src/compute/src/lib.rs:227:13
   5: <core::pin::Pin<P> as core::future::future::Future>::poll
             at /rustc/4a0cc881dcc4d800f10672747f61a94377ff6662/library/core/src/future/future.rs:123:9
   6: risingwave_cmd_all::standalone::standalone::{{closure}}::{{closure}}
             at ./src/cmd_all/src/standalone.rs:218:89
   7: <tracing::instrument::Instrumented<T> as core::future::future::Future>::poll
             at /Users/bugenzhao/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tracing-0.1.40/src/instrument.rs:321:9
   8: tokio::runtime::task::core::Core<T,S>::poll::{{closure}}
             at /Users/bugenzhao/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.37.0/src/runtime/task/core.rs:328:17
   9: tokio::loom::std::unsafe_cell::UnsafeCell<T>::with_mut
             at /Users/bugenzhao/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.37.0/src/loom/std/unsafe_cell.rs:16:9
  10: tokio::runtime::task::core::Core<T,S>::poll
             at /Users/bugenzhao/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.37.0/src/runtime/task/core.rs:317:13
  11: tokio::runtime::task::harness::poll_future::{{closure}}
             at /Users/bugenzhao/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.37.0/src/runtime/task/harness.rs:485:19
  12: <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once
             at /rustc/4a0cc881dcc4d800f10672747f61a94377ff6662/library/core/src/panic/unwind_safe.rs:272:9
  13: std::panicking::try::do_call
             at /rustc/4a0cc881dcc4d800f10672747f61a94377ff6662/library/std/src/panicking.rs:552:40
  14: ___rust_try
  15: std::panicking::try
             at /rustc/4a0cc881dcc4d800f10672747f61a94377ff6662/library/std/src/panicking.rs:516:19
  16: std::panic::catch_unwind
             at /rustc/4a0cc881dcc4d800f10672747f61a94377ff6662/library/std/src/panic.rs:146:14
  17: tokio::runtime::task::harness::poll_future
             at /Users/bugenzhao/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.37.0/src/runtime/task/harness.rs:473:18
  18: tokio::runtime::task::harness::Harness<T,S>::poll_inner
             at /Users/bugenzhao/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.37.0/src/runtime/task/harness.rs:208:27
  19: tokio::runtime::task::harness::Harness<T,S>::poll
             at /Users/bugenzhao/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.37.0/src/runtime/task/harness.rs:153:15
  20: tokio::runtime::task::raw::poll
             at /Users/bugenzhao/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.37.0/src/runtime/task/raw.rs:271:5
  21: tokio::runtime::task::raw::RawTask::poll
             at /Users/bugenzhao/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.37.0/src/runtime/task/raw.rs:201:18
  22: tokio::runtime::task::LocalNotified<S>::run
             at /Users/bugenzhao/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.37.0/src/runtime/task/mod.rs:427:9
  23: tokio::runtime::scheduler::multi_thread::worker::Context::run_task::{{closure}}
             at /Users/bugenzhao/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.37.0/src/runtime/scheduler/multi_thread/worker.rs:639:17
  24: tokio::runtime::coop::with_budget
             at /Users/bugenzhao/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.37.0/src/runtime/coop.rs:107:5
  25: tokio::runtime::coop::budget
             at /Users/bugenzhao/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.37.0/src/runtime/coop.rs:73:5
  26: tokio::runtime::scheduler::multi_thread::worker::Context::run_task
             at /Users/bugenzhao/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.37.0/src/runtime/scheduler/multi_thread/worker.rs:575:9
  27: tokio::runtime::scheduler::multi_thread::worker::Context::run
             at /Users/bugenzhao/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.37.0/src/runtime/scheduler/multi_thread/worker.rs:526:24
  28: tokio::runtime::scheduler::multi_thread::worker::run::{{closure}}::{{closure}}
             at /Users/bugenzhao/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.37.0/src/runtime/scheduler/multi_thread/worker.rs:491:21
  29: tokio::runtime::context::scoped::Scoped<T>::set
             at /Users/bugenzhao/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.37.0/src/runtime/context/scoped.rs:40:9
  30: tokio::runtime::context::set_scheduler::{{closure}}
             at /Users/bugenzhao/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.37.0/src/runtime/context.rs:176:26
  31: std::thread::local::LocalKey<T>::try_with
             at /rustc/4a0cc881dcc4d800f10672747f61a94377ff6662/library/std/src/thread/local.rs:284:16
  32: std::thread::local::LocalKey<T>::with
             at /rustc/4a0cc881dcc4d800f10672747f61a94377ff6662/library/std/src/thread/local.rs:260:9
  33: tokio::runtime::context::set_scheduler
             at /Users/bugenzhao/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.37.0/src/runtime/context.rs:176:9
  34: tokio::runtime::scheduler::multi_thread::worker::run::{{closure}}
             at /Users/bugenzhao/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.37.0/src/runtime/scheduler/multi_thread/worker.rs:486:9
  35: tokio::runtime::context::runtime::enter_runtime
             at /Users/bugenzhao/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.37.0/src/runtime/context/runtime.rs:65:16
  36: tokio::runtime::scheduler::multi_thread::worker::run
             at /Users/bugenzhao/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.37.0/src/runtime/scheduler/multi_thread/worker.rs:478:5
  37: tokio::runtime::scheduler::multi_thread::worker::Launch::launch::{{closure}}
             at /Users/bugenzhao/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.37.0/src/runtime/scheduler/multi_thread/worker.rs:447:45
  38: <tokio::runtime::blocking::task::BlockingTask<T> as core::future::future::Future>::poll
             at /Users/bugenzhao/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.37.0/src/runtime/blocking/task.rs:42:21
  39: <tracing::instrument::Instrumented<T> as core::future::future::Future>::poll
             at /Users/bugenzhao/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tracing-0.1.40/src/instrument.rs:321:9
  40: tokio::runtime::task::core::Core<T,S>::poll::{{closure}}
             at /Users/bugenzhao/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.37.0/src/runtime/task/core.rs:328:17
  41: tokio::loom::std::unsafe_cell::UnsafeCell<T>::with_mut
             at /Users/bugenzhao/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.37.0/src/loom/std/unsafe_cell.rs:16:9
  42: tokio::runtime::task::core::Core<T,S>::poll
             at /Users/bugenzhao/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.37.0/src/runtime/task/core.rs:317:13
  43: tokio::runtime::task::harness::poll_future::{{closure}}
             at /Users/bugenzhao/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.37.0/src/runtime/task/harness.rs:485:19
  44: <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once
             at /rustc/4a0cc881dcc4d800f10672747f61a94377ff6662/library/core/src/panic/unwind_safe.rs:272:9
  45: std::panicking::try::do_call
             at /rustc/4a0cc881dcc4d800f10672747f61a94377ff6662/library/std/src/panicking.rs:552:40
  46: ___rust_try
  47: std::panicking::try
             at /rustc/4a0cc881dcc4d800f10672747f61a94377ff6662/library/std/src/panicking.rs:516:19
  48: std::panic::catch_unwind
             at /rustc/4a0cc881dcc4d800f10672747f61a94377ff6662/library/std/src/panic.rs:146:14
  49: tokio::runtime::task::harness::poll_future
             at /Users/bugenzhao/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.37.0/src/runtime/task/harness.rs:473:18
  50: tokio::runtime::task::harness::Harness<T,S>::poll_inner
             at /Users/bugenzhao/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.37.0/src/runtime/task/harness.rs:208:27
  51: tokio::runtime::task::harness::Harness<T,S>::poll
             at /Users/bugenzhao/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.37.0/src/runtime/task/harness.rs:153:15
  52: tokio::runtime::task::raw::poll
             at /Users/bugenzhao/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.37.0/src/runtime/task/raw.rs:271:5
  53: tokio::runtime::task::raw::RawTask::poll
             at /Users/bugenzhao/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.37.0/src/runtime/task/raw.rs:201:18
  54: tokio::runtime::task::UnownedTask<S>::run
             at /Users/bugenzhao/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.37.0/src/runtime/task/mod.rs:464:9
  55: tokio::runtime::blocking::pool::Task::run
             at /Users/bugenzhao/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.37.0/src/runtime/blocking/pool.rs:159:9
  56: tokio::runtime::blocking::pool::Inner::run
             at /Users/bugenzhao/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.37.0/src/runtime/blocking/pool.rs:513:17
  57: tokio::runtime::blocking::pool::Spawner::spawn_thread::{{closure}}
             at /Users/bugenzhao/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.37.0/src/runtime/blocking/pool.rs:471:13
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

Two panics will occur when pressing Ctrl-C, resulting in an ungraceful exit process. We should consider improve this user experience.

BugenZhao commented 3 months ago

The cause is that we override the system behavior of killing the process immediately on SIGINT by handling them with tokio::signal::ctrl_c in order to implement graceful shutdown, but not actually comprehensive or correct.

I'm wondering that if some sort of graceful shutdown is really necessary in our system. What about simply letting the components be killed by the system? As...

cc @zwang28 @yezizp2012

yezizp2012 commented 3 months ago

What about simply letting the components be killed by the system?

I think it's acceptable since this has always been done in the madsim recovery test.

BugenZhao commented 3 months ago

What about simply letting the components be killed by the system?

I think it's acceptable since this has always been done in the madsim recovery test.

I guess madsim kill is still like a soft killing that triggers ctrl-c signal and runs destructors? @wangrunji0408 👀

yezizp2012 commented 3 months ago

What about simply letting the components be killed by the system?

I think it's acceptable since this has always been done in the madsim recovery test.

I guess madsim kill is still like a soft killing that triggers ctrl-c signal and runs destructors? @wangrunji0408 👀

IIUC, madsim only issue a soft killing to shutdown the cluster after all tests are completed. During recovery it's hard killing.

BugenZhao commented 3 months ago

I think graceful shutdown still has some benefits. For example, a compute node can actively tell the meta node to remove it when being killed, which is more responsive compared to the current approach that meta node passively finds that based on the timeout of heartbeat messages. (manual operation is required now: https://docs.risingwave.com/docs/current/k8s-cluster-scaling/#scale-in)

However, what we're doing now for graceful shutdown seems not to help at all, that is, shutting down each manager/worker one by one without doing any extra clean-ups. This can be replaced by exiting the process directly without any loss but make the code cleaner.

BugenZhao commented 3 months ago

Also link to https://github.com/risingwavelabs/risingwave/pull/7675 which added the tests for graceful shutdown under simulation. cc @wangrunji0408 Could you share some updates on this (if any)? For example, is it still used or enforced currently?

BugenZhao commented 1 month ago

Only clean-up & refactor tasks remaining.