risingwavelabs / risingwave

Best-in-class stream processing, analytics, and management. Perform continuous analytics, or build event-driven applications, real-time ETL pipelines, and feature stores in minutes. Unified streaming and batch. PostgreSQL compatible.
https://go.risingwave.com/slack
Apache License 2.0
6.78k stars 561 forks source link

The sink actor occasionally encounters errors during the process of scale-in #13787

Open shanicky opened 9 months ago

shanicky commented 9 months ago

Describe the bug

No response

Error message/log

gRPC request to stream service failed: Internal error: Actor 10548 exit unexpectedly: Executor error: Sink error: Remote sink error: INVALID_ARGUMENT: invalid epoch: new epoch ID should be larger than current epoch
  backtrace of inner error:
   0: anyhow::kind::Adhoc::new
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/anyhow-1.0.75/src/kind.rs:71:36
   1: risingwave_jni_core::Java_com_risingwave_java_binding_Binding_sendSinkWriterErrorToChannel::{{closure}}
             at ./risingwave/src/jni_core/src/lib.rs:954:50
   2: risingwave_jni_core::execute_and_catch::{{closure}}
             at ./risingwave/src/jni_core/src/lib.rs:206:59
   3: <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once
             at ./rustc/249624b5043013d18c00f0401ca431c1a6baa8cd/library/core/src/panic/unwind_safe.rs:271:9
   4: std::panicking::try::do_call
             at ./rustc/249624b5043013d18c00f0401ca431c1a6baa8cd/library/std/src/panicking.rs:504:40
   5: std::panicking::try
 at ./rustc/249624b5043013d18c00f0401ca431c1a6baa8cd/library/std/src/panicking.rs:468:19
   6: std::panic::catch_unwind
             at ./rustc/249624b5043013d18c00f0401ca431c1a6baa8cd/library/std/src/panic.rs:142:14
   7: risingwave_common::util::panic::rw_catch_unwind::{{closure}}
             at ./risingwave/src/common/src/util/panic.rs:44:9
   8: tokio::task::task_local::LocalKey<T>::scope_inner
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/task/task_local.rs:217:19
   9: tokio::task::task_local::LocalKey<T>::sync_scope
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/task/task_local.rs:173:15
  10: risingwave_common::util::panic::rw_catch_unwind
             at ./risingwave/src/common/src/util/panic.rs:42:5
  11: risingwave_jni_core::execute_and_catch
             at ./risingwave/src/jni_core/src/lib.rs:206:11
  12: Java_com_risingwave_java_binding_Binding_sendSinkWriterErrorToChannel
             at ./risingwave/src/jni_core/src/lib.rs:948:5
  13: <unknown>

  backtrace of `StreamError`:
   0: <risingwave_stream::error::Inner as core::convert::From<risingwave_stream::error::ErrorKind>>::from
             at ./risingwave/src/stream/src/error.rs:37:10
   1: <T as core::convert::Into<U>>::into
             at ./rustc/249624b5043013d18c00f0401ca431c1a6baa8cd/library/core/src/convert/mod.rs:757:9
   2: <risingwave_stream::error::StreamError as core::convert::From<risingwave_stream::error::ErrorKind>>::from
             at ./risingwave/src/stream/src/error.rs:110:34
   3: <T as core::convert::Into<U>>::into
             at ./rustc/249624b5043013d18c00f0401ca431c1a6baa8cd/library/core/src/convert/mod.rs:757:9
   4: <risingwave_stream::error::StreamError as core::convert::From<risingwave_stream::executor::error::StreamExecutorError>>::from
             at ./risingwave/src/stream/src/error.rs:139:36
   5: <core::result::Result<T,F> as core::ops::try_trait::FromResidual<core::result::Result<core::convert::Infallible,E>>>::from_residual
             at ./rustc/249624b5043013d18c00f0401ca431c1a6baa8cd/library/core/src/result.rs:1963:27
   6: <risingwave_stream::executor::dispatch::DispatchExecutor as risingwave_stream::executor::StreamConsumer>::execute::{{closure}}
             at ./risingwave/src/stream/src/executor/dispatch.rs:318:36
   7: <futures_async_stream::try_stream::GenTryStream<G> as futures_core::stream::Stream>::poll_next
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/futures-async-stream-0.2.9/src/lib.rs:506:33
   8: <core::pin::Pin<P> as futures_core::stream::Stream>::poll_next
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/futures-core-0.3.28/src/stream.rs:120:9
   9: <&mut S as futures_core::stream::Stream>::poll_next
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/futures-core-0.3.28/src/stream.rs:104:9
  10: <tokio_stream::stream_ext::next::Next<St> as core::future::future::Future>::poll
             at ./root/.cargo/git/checkouts/tokio-968c02b7a1a41bea/fe39bb8/tokio-stream/src/stream_ext/next.rs:42:9
  11: <tokio_stream::stream_ext::try_next::TryNext<St> as core::future::future::Future>::poll
             at ./root/.cargo/git/checkouts/tokio-968c02b7a1a41bea/fe39bb8/tokio-stream/src/stream_ext/try_next.rs:43:9
  12: <tracing::instrument::Instrumented<T> as core::future::future::Future>::poll
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tracing-0.1.37/src/instrument.rs:272:9
  13: <await_tree::future::Instrumented<F,_> as core::future::future::Future>::poll
    at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/await-tree-0.1.1/src/future.rs:124:23
  14: risingwave_stream::executor::actor::Actor<C>::run_consumer::{{closure}}
             at ./risingwave/src/stream/src/executor/actor.rs:196:18
  15: <tokio::future::maybe_done::MaybeDone<Fut> as core::future::future::Future>::poll
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/future/maybe_done.rs:68:48
  16: risingwave_stream::executor::actor::Actor<C>::run::{{closure}}::{{closure}}
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/macros/join.rs:126:24
  17: <tokio::future::poll_fn::PollFn<F> as core::future::future::Future>::poll
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/future/poll_fn.rs:58:9
  18: risingwave_stream::executor::actor::Actor<C>::run::{{closure}}
             at ./risingwave/src/stream/src/executor/actor.rs:155:9
  19: risingwave_stream::task::stream_manager::LocalStreamManagerCore::build_actors::{{closure}}::{{closure}}
             at ./risingwave/src/stream/src/task/stream_manager.rs:698:51
  20: <tokio::task::task_local::TaskLocalFuture<T,F> as core::future::future::Future>::poll::{{closure}}
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/task/task_local.rs:347:31
  21: tokio::task::task_local::LocalKey<T>::scope_inner
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/task/task_local.rs:217:19
  22: <tokio::task::task_local::TaskLocalFuture<T,F> as core::future::future::Future>::poll
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/task/task_local.rs:343:19
  23: await_tree::registry::TreeRoot::instrument::{{closure}}
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/await-tree-0.1.1/src/registry.rs:51:45
  24: <futures_util::future::either::Either<A,B> as core::future::future::Future>::poll
     at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/futures-util-0.3.28/src/future/either.rs:109:32
  25: core::ops::function::FnOnce::call_once
             at ./rustc/249624b5043013d18c00f0401ca431c1a6baa8cd/library/core/src/ops/function.rs:250:5
  26: tokio_metrics::task::instrument_poll
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-metrics-0.3.0/src/task.rs:2530:15
  27: <tokio_metrics::task::Instrumented<T> as core::future::future::Future>::poll
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-metrics-0.3.0/src/task.rs:2430:9
  28: <tracing::instrument::Instrumented<T> as core::future::future::Future>::poll
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tracing-0.1.37/src/instrument.rs:272:9
  29: tokio::runtime::task::core::Core<T,S>::poll::{{closure}}
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/runtime/task/core.rs:334:17
  30: tokio::loom::std::unsafe_cell::UnsafeCell<T>::with_mut
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/loom/std/unsafe_cell.rs:16:9
  31: tokio::runtime::task::core::Core<T,S>::poll
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/runtime/task/core.rs:323:13
  32: tokio::runtime::task::harness::poll_future::{{closure}}
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/runtime/task/harness.rs:485:19
  33: <core::panic::unwind_safe::AssertUnwindSafe<F> ascore::ops::function::FnOnce<()>>::call_once
             at ./rustc/249624b5043013d18c00f0401ca431c1a6baa8cd/library/core/src/panic/unwind_safe.rs:271:9
  34: std::panicking::try::do_call
             at ./rustc/249624b5043013d18c00f0401ca431c1a6baa8cd/library/std/src/panicking.rs:504:40
  35: std::panicking::try
             at ./rustc/249624b5043013d18c00f0401ca431c1a6baa8cd/library/std/src/panicking.rs:468:19
  36: std::panic::catch_unwind
     at ./rustc/249624b5043013d18c00f0401ca431c1a6baa8cd/library/std/src/panic.rs:142:14
  37: tokio::runtime::task::harness::poll_future
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/runtime/task/harness.rs:473:18
  38: tokio::runtime::task::harness::Harness<T,S>::poll_inner
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/runtime/task/harness.rs:208:27
  39: tokio::runtime::task::harness::Harness<T,S>::poll
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/runtime/task/harness.rs:153:15
  40: tokio::runtime::task::raw::RawTask::poll
   at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/runtime/task/raw.rs:200:18
  41: tokio::runtime::task::LocalNotified<S>::run
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/runtime/task/mod.rs:400:9
  42: tokio::runtime::scheduler::multi_thread::worker::Context::run_task::{{closure}}
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/runtime/scheduler/multi_thread/worker.rs:576:18
  43: tokio::runtime::coop::with_budget
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/runtime/coop.rs:107:5
  44: tokio::runtime::coop::budget
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/runtime/coop.rs:73:5
  45: tokio::runtime::scheduler::multi_thread::worker::Context::run_task
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/runtime/scheduler/multi_thread/worker.rs:575:9
  46: tokio::runtime::scheduler::multi_thread::worker::Context::run
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/runtime/scheduler/multi_thread/worker.rs:526:24
  47: tokio::runtime::scheduler::multi_thread::worker::run::{{closure}}::{{closure}}
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/runtime/scheduler/multi_thread/worker.rs:491:21
  48: tokio::runtime::context::scoped::Scoped<T>::set
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/runtime/context/scoped.rs:40:9
  49: tokio::runtime::scheduler::multi_thread::worker::run::{{closure}}
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/runtime/scheduler/multi_thread/worker.rs:486:9
  50: tokio::runtime::context::runtime::enter_runtime
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/runtime/context/runtime.rs:65:16
  51: tokio::runtime::scheduler::multi_thread::worker::run
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/runtime/scheduler/multi_thread/worker.rs:478:5
  52: tokio::runtime::scheduler::multi_thread::worker::Launch::launch::{{closure}}
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/runtime/scheduler/multi_thread/worker.rs:447:45
  53: <tokio::runtime::blocking::task::BlockingTask<T> as core::future::future::Future>::poll
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/runtime/blocking/task.rs:42:21
  54: <tracing::instrument::Instrumented<T> as core::future::future::Future>::poll
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tracing-0.1.37/src/instrument.rs:272:9
  55: tokio::runtime::task::core::Core<T,S>::poll::{{closure}}
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/runtime/task/core.rs:334:17
  56: tokio::loom::std::unsafe_cell::UnsafeCell<T>::with_mut
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/loom/std/unsafe_cell.rs:16:9
  57: tokio::runtime::task::core::Core<T,S>::poll
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/runtime/task/core.rs:323:13
  58: tokio::runtime::task::harness::poll_future::{{closure}}
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/runtime/task/harness.rs:485:19
  59: <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once
             at ./rustc/249624b5043013d18c00f0401ca431c1a6baa8cd/library/core/src/panic/unwind_safe.rs:271:9
  60: std::panicking::try::do_call
             at ./rustc/249624b5043013d18c00f0401ca431c1a6baa8cd/library/std/src/panicking.rs:504:40
  61: std::panicking::try
             at ./rustc/249624b5043013d18c00f0401ca431c1a6baa8cd/library/std/src/panicking.rs:468:19
  62: std::panic::catch_unwind
             at ./rustc/249624b5043013d18c00f0401ca431c1a6baa8cd/library/std/src/panic.rs:142:14
  63: tokio::runtime::task::harness::poll_future
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/runtime/task/harness.rs:473:18
  64: tokio::runtime::task::harness::Harness<T,S>::poll_inner
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/runtime/task/harness.rs:208:27
  65: tokio::runtime::task::harness::Harness<T,S>::poll
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/runtime/task/harness.rs:153:15
  66: tokio::runtime::task::raw::RawTask::poll
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/runtime/task/raw.rs:200:18
  67: tokio::runtime::task::UnownedTask<S>::run
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/runtime/task/mod.rs:437:9
  68: tokio::runtime::blocking::pool::Task::run
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/runtime/blocking/pool.rs:159:9
  69: tokio::runtime::blocking::pool::Inner::run
             at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/runtime/blocking/pool.rs:513:17
  70: tokio::runtime::blocking::pool::Spawner::spawn_thread::{{closure}}
            at ./root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.32.0/src/runtime/blocking/pool.rs:471:13
  71: std::sys_common::backtrace::__rust_begin_short_backtrace
             at ./rustc/249624b5043013d18c00f0401ca431c1a6baa8cd/library/std/src/sys_common/backtrace.rs:154:18
  72: std::thread::Builder::spawn_unchecked_::{{closure}}::{{closure}}
             at ./rustc/249624b5043013d18c00f0401ca431c1a6baa8cd/library/std/src/thread/mod.rs:529:17
  73: <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once
             at ./rustc/249624b5043013d18c00f0401ca431c1a6baa8cd/library/core/src/panic/unwind_safe.rs:271:9
  74: std::panicking::try::do_call
             at ./rustc/249624b5043013d18c00f0401ca431c1a6baa8cd/library/std/src/panicking.rs:504:40
  75: std::panicking::try
             at ./rustc/249624b5043013d18c00f0401ca431c1a6baa8cd/library/std/src/panicking.rs:468:19
  76: std::panic::catch_unwind
             at ./rustc/249624b5043013d18c00f0401ca431c1a6baa8cd/library/std/src/panic.rs:142:14
  77: std::thread::Builder::spawn_unchecked_::{{closure}}
  at ./rustc/249624b5043013d18c00f0401ca431c1a6baa8cd/library/std/src/thread/mod.rs:528:30
  78: core::ops::function::FnOnce::call_once{{vtable.shim}}
             at ./rustc/249624b5043013d18c00f0401ca431c1a6baa8cd/library/core/src/ops/function.rs:250:5
  79: <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once
             at ./rustc/249624b5043013d18c00f0401ca431c1a6baa8cd/library/alloc/src/boxed.rs:2007:9
  80: <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once
             at ./rustc/249624b5043013d18c00f0401ca431c1a6baa8cd/library/alloc/src/boxed.rs:2007:9
  81: std::sys::unix::thread::Thread::new::thread_start
             at ./rustc/249624b5043013d18c00f0401ca431c1a6baa8cd/library/std/src/sys/unix/thread.rs:108:17
  82: start_thread
             at ./nptl/pthread_create.c:442:8
  83: clone3
             at ./misc/../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

Full log

t.txt

hzxa21 commented 9 months ago

@wenym1 PTAL

wenym1 commented 9 months ago

@shanicky Any context for how the bug comes from? Is there any way to reproduce the bug?

fuyufjh commented 9 months ago

@shanicky Any context for how the bug comes from? Is there any way to reproduce the bug?

Please talk to @shanicky. This happened in a customer's environment and blocked us to scaling in a cluster.

wenym1 commented 9 months ago

This code is a sanity check on whether the epoch is increasing monotonically. I just check the logic of starting epoch and sending barrier to the remote sink, and the logic LGTM.

Is it possible that the barrier of an epoch is sent twice during scale-in? @shanicky

wenym1 commented 9 months ago

More information can be provided in the log with https://github.com/risingwavelabs/risingwave/pull/13806