risingwavelabs / risingwave

Best-in-class stream processing, analytics, and management. Perform continuous analytics, or build event-driven applications, real-time ETL pipelines, and feature stores in minutes. Unified streaming and batch. PostgreSQL compatible.
https://go.risingwave.com/slack
Apache License 2.0
7k stars 576 forks source link

frontend-node, compute-node-0 and compactor-0 suddenly crashed #12074

Closed QuantumBear closed 1 year ago

QuantumBear commented 1 year ago

Describe the bug

No operation, but frontend-node, compute-node-0 and compactor-0 suddenly crashed.

NAME                IMAGE                                                COMMAND                   SERVICE             CREATED             STATUS                       PORTS
compactor-0         ghcr.io/risingwavelabs/risingwave:nightly-20230828   "/risingwave/bin/ris…"    compactor-0         About an hour ago   Exited (1) 12 minutes ago
compute-node-0      ghcr.io/risingwavelabs/risingwave:nightly-20230828   "/risingwave/bin/ris…"    compute-node-0      About an hour ago   Exited (1) 12 minutes ago
connector-node      ghcr.io/risingwavelabs/risingwave:nightly-20230828   "/risingwave/bin/con…"    connector-node      About an hour ago   Up About an hour             0.0.0.0:53393->50051/tcp, 0.0.0.0:53394->50052/tcp
etcd-0              quay.io/coreos/etcd:v3.5.7                           "/usr/local/bin/etcd…"    etcd-0              About an hour ago   Up About an hour (healthy)   2379-2380/tcp, 0.0.0.0:2388-2389->2388-2389/tcp
frontend-node-0     ghcr.io/risingwavelabs/risingwave:nightly-20230828   "/risingwave/bin/ris…"    frontend-node-0     About an hour ago   Exited (1) 12 minutes ago
grafana-0           grafana/grafana-oss:latest                           "/run.sh"                 grafana-0           About an hour ago   Up About an hour (healthy)   3000/tcp, 0.0.0.0:3001->3001/tcp
message_queue       docker.vectorized.io/vectorized/redpanda:latest      "/entrypoint.sh redp…"    message_queue       About an hour ago   Up About an hour (healthy)   0.0.0.0:8081->8081/tcp, 0.0.0.0:9092->9092/tcp, 0.0.0.0:9644->9644/tcp, 0.0.0.0:29092->29092/tcp, 8082/tcp
meta-node-0         ghcr.io/risingwavelabs/risingwave:nightly-20230828   "/risingwave/bin/ris…"    meta-node-0         About an hour ago   Up About an hour (healthy)   1250/tcp, 0.0.0.0:5690-5691->5690-5691/tcp
minio-0             quay.io/minio/minio:latest                           "/bin/sh -c '\nset -e…"   minio-0             About an hour ago   Up About an hour (healthy)   0.0.0.0:9301->9301/tcp, 9000/tcp, 0.0.0.0:9400->9400/tcp
prometheus-0        prom/prometheus:latest                               "/bin/prometheus --c…"    prometheus-0        About an hour ago   Up About an hour (healthy)   9090/tcp, 0.0.0.0:9500->9500/tcp

Error message/log

log in compactor

2023-09-05T03:40:39.077988408Z ERROR risingwave_rpc_client::meta_client: worker expired: Invalid worker: 3, worker not found
2023-09-05T03:40:39.078039953Z ERROR risingwave_common_service::observer_manager: Stream of notification terminated.

log in compute-node

2023-09-05T03:40:39.134634829Z ERROR risingwave_rpc_client::meta_client: worker expired: Invalid worker: 2, worker not found
2023-09-05T03:40:39.134767658Z ERROR risingwave_common_service::observer_manager: Stream of notification terminated.

log in frontend-node

2023-09-05T03:40:39.032757743Z ERROR risingwave_common_service::observer_manager: Stream of notification terminated.
2023-09-05T03:40:39.067245561Z ERROR risingwave_rpc_client::meta_client: worker expired: Invalid worker: 2001, worker not found

meta-node didn't crash, but it contains exception callstack

2023-09-05T03:40:39.60402545Z  WARN risingwave_meta::barrier: Failed to complete epoch 5023305923756032: Rpc error: gRPC error (Unknown error): transport error
  backtrace of `MetaError`:
   0: <risingwave_meta::error::MetaError as core::convert::From<risingwave_meta::error::MetaErrorInner>>::from
             at ./risingwave/src/meta/src/error.rs:87:33
   1: <T as core::convert::Into<U>>::into
             at ./rustc/f0411ffcebcd7f75ac02ed45feb53ffd07b75398/library/core/src/convert/mod.rs:717:9
   2: <risingwave_meta::error::MetaError as core::convert::From<risingwave_rpc_client::error::RpcError>>::from
             at ./risingwave/src/meta/src/error.rs:174:37
   3: <T as core::convert::Into<U>>::into
             at ./rustc/f0411ffcebcd7f75ac02ed45feb53ffd07b75398/library/core/src/convert/mod.rs:717:9
   4: core::ops::function::FnOnce::call_once
             at ./rustc/f0411ffcebcd7f75ac02ed45feb53ffd07b75398/library/core/src/ops/function.rs:250:5
   5: core::result::Result<T,E>::map_err
             at ./rustc/f0411ffcebcd7f75ac02ed45feb53ffd07b75398/library/core/src/result.rs:828:27
   6: risingwave_meta::barrier::GlobalBarrierManager<S>::collect_barrier::{{closure}}
             at ./risingwave/src/meta/src/barrier/mod.rs:822:22
   7: <tracing::instrument::Instrumented<T> as core::future::future::Future>::poll
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tracing-0.1.37/src/instrument.rs:272:9
   8: tokio::runtime::task::core::Core<T,S>::poll::{{closure}}
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.31.0/src/runtime/task/core.rs:334:17
   9: tokio::loom::std::unsafe_cell::UnsafeCell<T>::with_mut
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.31.0/src/loom/std/unsafe_cell.rs:16:9
  10: tokio::runtime::task::core::Core<T,S>::poll
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.31.0/src/runtime/task/core.rs:323:30
  11: tokio::runtime::task::harness::poll_future::{{closure}}
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.31.0/src/runtime/task/harness.rs:485:19
  12: <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once
             at ./rustc/f0411ffcebcd7f75ac02ed45feb53ffd07b75398/library/core/src/panic/unwind_safe.rs:271:9
  13: std::panicking::try::do_call
             at ./rustc/f0411ffcebcd7f75ac02ed45feb53ffd07b75398/library/std/src/panicking.rs:500:40
  14: std::panicking::try
             at ./rustc/f0411ffcebcd7f75ac02ed45feb53ffd07b75398/library/std/src/panicking.rs:464:19
  15: std::panic::catch_unwind
             at ./rustc/f0411ffcebcd7f75ac02ed45feb53ffd07b75398/library/std/src/panic.rs:142:14
  16: tokio::runtime::task::harness::poll_future
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.31.0/src/runtime/task/harness.rs:473:18
  17: tokio::runtime::task::harness::Harness<T,S>::poll_inner
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.31.0/src/runtime/task/harness.rs:208:27
  18: tokio::runtime::task::harness::Harness<T,S>::poll
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.31.0/src/runtime/task/harness.rs:153:15
  19: tokio::runtime::task::raw::RawTask::poll
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.31.0/src/runtime/task/raw.rs:200:18
  20: tokio::runtime::task::LocalNotified<S>::run
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.31.0/src/runtime/task/mod.rs:400:9
  21: tokio::runtime::scheduler::multi_thread::worker::Context::run_task::{{closure}}
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.31.0/src/runtime/scheduler/multi_thread/worker.rs:639:22
  22: tokio::runtime::coop::with_budget
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.31.0/src/runtime/coop.rs:107:5
  23: tokio::runtime::coop::budget
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.31.0/src/runtime/coop.rs:73:5
  24: tokio::runtime::scheduler::multi_thread::worker::Context::run_task
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.31.0/src/runtime/scheduler/multi_thread/worker.rs:575:9
  25: tokio::runtime::scheduler::multi_thread::worker::Context::run
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.31.0/src/runtime/scheduler/multi_thread/worker.rs:538:24
  26: tokio::runtime::scheduler::multi_thread::worker::run::{{closure}}::{{closure}}
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.31.0/src/runtime/scheduler/multi_thread/worker.rs:491:21
  27: tokio::runtime::context::scoped::Scoped<T>::set
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.31.0/src/runtime/context/scoped.rs:40:9
  28: tokio::runtime::scheduler::multi_thread::worker::run::{{closure}}
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.31.0/src/runtime/scheduler/multi_thread/worker.rs:486:9
  29: tokio::runtime::context::runtime::enter_runtime
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.31.0/src/runtime/context/runtime.rs:65:16
  30: tokio::runtime::scheduler::multi_thread::worker::run
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.31.0/src/runtime/scheduler/multi_thread/worker.rs:478:5
  31: tokio::runtime::scheduler::multi_thread::worker::Launch::launch::{{closure}}
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.31.0/src/runtime/scheduler/multi_thread/worker.rs:447:45
  32: <tokio::runtime::blocking::task::BlockingTask<T> as core::future::future::Future>::poll
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.31.0/src/runtime/blocking/task.rs:42:21
  33: <tracing::instrument::Instrumented<T> as core::future::future::Future>::poll
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tracing-0.1.37/src/instrument.rs:272:9
  34: tokio::runtime::task::core::Core<T,S>::poll::{{closure}}
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.31.0/src/runtime/task/core.rs:334:17
  35: tokio::loom::std::unsafe_cell::UnsafeCell<T>::with_mut
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.31.0/src/loom/std/unsafe_cell.rs:16:9
  36: tokio::runtime::task::core::Core<T,S>::poll
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.31.0/src/runtime/task/core.rs:323:30
  37: tokio::runtime::task::harness::poll_future::{{closure}}
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.31.0/src/runtime/task/harness.rs:485:19
  38: <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once
             at ./rustc/f0411ffcebcd7f75ac02ed45feb53ffd07b75398/library/core/src/panic/unwind_safe.rs:271:9
  39: std::panicking::try::do_call
             at ./rustc/f0411ffcebcd7f75ac02ed45feb53ffd07b75398/library/std/src/panicking.rs:500:40
  40: std::panicking::try
             at ./rustc/f0411ffcebcd7f75ac02ed45feb53ffd07b75398/library/std/src/panicking.rs:464:19
  41: std::panic::catch_unwind
             at ./rustc/f0411ffcebcd7f75ac02ed45feb53ffd07b75398/library/std/src/panic.rs:142:14
  42: tokio::runtime::task::harness::poll_future
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.31.0/src/runtime/task/harness.rs:473:18
  43: tokio::runtime::task::harness::Harness<T,S>::poll_inner
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.31.0/src/runtime/task/harness.rs:208:27
  44: tokio::runtime::task::harness::Harness<T,S>::poll
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.31.0/src/runtime/task/harness.rs:153:15
  45: tokio::runtime::task::raw::RawTask::poll
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.31.0/src/runtime/task/raw.rs:200:18
  46: tokio::runtime::task::UnownedTask<S>::run
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.31.0/src/runtime/task/mod.rs:437:9
  47: tokio::runtime::blocking::pool::Task::run
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.31.0/src/runtime/blocking/pool.rs:159:9
  48: tokio::runtime::blocking::pool::Inner::run
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.31.0/src/runtime/blocking/pool.rs:513:17
  49: tokio::runtime::blocking::pool::Spawner::spawn_thread::{{closure}}
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.31.0/src/runtime/blocking/pool.rs:471:13
  50: std::sys_common::backtrace::__rust_begin_short_backtrace
             at ./rustc/f0411ffcebcd7f75ac02ed45feb53ffd07b75398/library/std/src/sys_common/backtrace.rs:135:18
  51: std::thread::Builder::spawn_unchecked_::{{closure}}::{{closure}}
             at ./rustc/f0411ffcebcd7f75ac02ed45feb53ffd07b75398/library/std/src/thread/mod.rs:529:17
  52: <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once
             at ./rustc/f0411ffcebcd7f75ac02ed45feb53ffd07b75398/library/core/src/panic/unwind_safe.rs:271:9
  53: std::panicking::try::do_call
             at ./rustc/f0411ffcebcd7f75ac02ed45feb53ffd07b75398/library/std/src/panicking.rs:500:40
  54: std::panicking::try
             at ./rustc/f0411ffcebcd7f75ac02ed45feb53ffd07b75398/library/std/src/panicking.rs:464:19
  55: std::panic::catch_unwind
             at ./rustc/f0411ffcebcd7f75ac02ed45feb53ffd07b75398/library/std/src/panic.rs:142:14
  56: std::thread::Builder::spawn_unchecked_::{{closure}}
             at ./rustc/f0411ffcebcd7f75ac02ed45feb53ffd07b75398/library/std/src/thread/mod.rs:528:30
  57: core::ops::function::FnOnce::call_once{{vtable.shim}}
             at ./rustc/f0411ffcebcd7f75ac02ed45feb53ffd07b75398/library/core/src/ops/function.rs:250:5
  58: <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once
             at ./rustc/f0411ffcebcd7f75ac02ed45feb53ffd07b75398/library/alloc/src/boxed.rs:1985:9
  59: <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once
             at ./rustc/f0411ffcebcd7f75ac02ed45feb53ffd07b75398/library/alloc/src/boxed.rs:1985:9
  60: std::sys::unix::thread::Thread::new::thread_start
             at ./rustc/f0411ffcebcd7f75ac02ed45feb53ffd07b75398/library/std/src/sys/unix/thread.rs:108:17
  61: start_thread
             at ./nptl/pthread_create.c:442:8
  62: clone3
             at ./misc/../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

2023-09-05T03:40:39.775946244Z  INFO failure_recovery{err=Rpc error: gRPC error (Unknown error): transport error prev_epoch=5023305792684032}: risingwave_meta::barrier::recovery: recovery start!
2023-09-05T03:40:39.777269411Z  INFO risingwave_meta::manager::sink_coordination::manager: sink manager worker start cleaning up
2023-09-05T03:40:39.777288204Z  INFO risingwave_meta::manager::sink_coordination::manager: sink manager worker finished cleaning up
2023-09-05T03:40:39.777301182Z  INFO failure_recovery{err=Rpc error: gRPC error (Unknown error): transport error prev_epoch=5023305792684032}: risingwave_meta::manager::sink_coordination::manager: successfully stop coordinator: None

### To Reproduce

_No response_

### Expected behavior

_No response_

### How did you deploy RisingWave?

_No response_

### The version of RisingWave

dev=> select version(); version

PostgreSQL 9.5-RisingWave-1.1.0-alpha (5818f62f6a1c8001f371db4c4e74534a505a10cd)



### Additional context

_No response_
BugenZhao commented 1 year ago

It seems all connections between worker nodes are closed unexpectedly. Did you deploy the cluster on a personal computer? Did it lock the screen or enter sleep mode at that time?

QuantumBear commented 1 year ago

No, I was working at other stuff at that monent.

fuyufjh commented 1 year ago

Please feel free to reopen if it recurs