risingwavelabs / risingwave

Best-in-class stream processing, analytics, and management. Perform continuous analytics, or build event-driven applications, real-time ETL pipelines, and feature stores in minutes. Unified streaming and batch. PostgreSQL compatible.
https://go.risingwave.com/slack
Apache License 2.0
6.78k stars 561 forks source link

bug(cdc source): should retry on upstream failure instead of blocking recovery #17807

Open fuyufjh opened 1 month ago

fuyufjh commented 1 month ago

Describe the bug

{"timestamp":"2024-07-25T01:39:26.827182302Z","level":"WARN","fields":{"message":"build_actors failed","error":"merged RPC Error: worker node 140004, gRPC request to stream service failed: Internal error: Executor error: Connector error: Postgres error: db error: ERROR: relation \"public.source_priorities\" does not exist;"},"target":"risingwave_meta::barrier::recovery","spans":[{"prev_epoch":6857419982438400,"name":"bootstrap_recovery"},{"name":"recovery_attempt"}]}

Currently, if the upstream of an PG CDC table doesn't work, the recovery will fail, causing the entire cluster unavailable.

The expected behavior should be keep retrying.

Error message/log

No response

To Reproduce

No response

Expected behavior

No response

How did you deploy RisingWave?

No response

The version of RisingWave

No response

Additional context

No response

fuyufjh commented 1 month ago

A similar problem exists for Pulsar source

7-25T12:11:08.180149+08:00  INFO bootstrap_recovery{prev_epoch=6746070748823552}:recovery_attempt: risingwave_meta::manager::sink_coordination::manager: successfully stop coordinator: None
2024-07-25T12:11:08.201655+08:00 DEBUG bootstrap_recovery{prev_epoch=6746070748823552}:recovery_attempt: risingwave_meta::barrier::rpc: update actors request_id="00cc74b3-3496-40b8-95b1-aea0cf83b6f1" actors=[40, 39, 38, 37, 46, 48, 45, 47, 44, 43, 42, 41]
2024-07-25T12:11:08.204114+08:00 DEBUG bootstrap_recovery{prev_epoch=6746070748823552}:recovery_attempt: risingwave_meta::barrier::rpc: build actors request_id="1ea1826e-c12a-4be0-8feb-708dfed3a7ed" actors=[46, 47, 48, 40, 43, 39, 45, 38, 44, 42, 37, 41]
2024-07-25T12:11:13.371849+08:00  WARN bootstrap_recovery{prev_epoch=6746070748823552}:recovery_attempt: risingwave_meta::barrier::rpc: get error from response stream node=WorkerNode { id: 2, r#type: ComputeNode, host: Some(HostAddress { host: "127.0.0.1", port: 5688 }), state: Running, parallel_units: [ParallelUnit { id: 0, worker_node_id: 2 }, ParallelUnit { id: 1024, worker_node_id: 2 }, ParallelUnit { id: 2048, worker_node_id: 2 }, ParallelUnit { id: 3072, worker_node_id: 2 }], property: Some(Property { is_streaming: true, is_serving: true, is_unschedulable: false }), transactional_id: Some(0), resource: Some(Resource { rw_version: "1.11.0-alpha", total_memory_bytes: 34359738368, total_cpu_cores: 10 }), started_at: Some(1721880487) } err=gRPC request to stream service failed: Internal error: failed to collect barrier for epoch [6746070748823552]: Actor 45 exited unexpectedly: Executor error: Connector error: Connection error: fatal error when connecting to the Pulsar server
2024-07-25T12:11:16.238496+08:00 DEBUG build:new:new:connect:connect_inner: pulsar::connection_manager: ConnectionManager::connect(BrokerAddress { url: Url { scheme: "pulsar+ssl", cannot_be_a_base: false, username: "", password: None, host: Some(Domain("********")), port: Some(6651), path: "", query: None, fragment: None }, broker_url: "************", proxy: false })    
2024-07-25T12:11:16.243887+08:00 DEBUG build:new:new:connect:connect_inner:new: pulsar::connection: Connecting to pulsar+ssl://************, as 9a1ffb8f-c466-4350-9c89-0644b07fea92    
2024-07-25T12:11:16.405823+08:00  INFO bootstrap_recovery{prev_epoch=6746070748823552}:recovery_attempt: risingwave_meta::barrier::recovery: recovering mview progress
2024-07-25T12:11:16.45146+08:00  INFO bootstrap_recovery{prev_epoch=6746070748823552}:recovery_attempt: risingwave_meta::barrier::recovery: recovered mview progress
2024-07-25T12:11:16.484738+08:00 DEBUG bootstrap_recovery{prev_epoch=6746070748823552}:recovery_attempt: risingwave_meta::barrier::recovery: start resetting actors distribution
BugenZhao commented 1 month ago

I find this a dilemma. 😕

Perhaps we need to differentiate these two cases for better experiences. For example, introduce the validation step for all kinds of sources when creating. Due to the difference on the behavior (e.g., check the connection & schema vs really select data from the source), however, I guess it's still possible to have some false positives. 🤔

fuyufjh commented 1 month ago

Good point.

From the Meta node's perspective, these create actors requests are called in different paths

To address the dilemma, I believe

introduce the validation step for all kinds of sources when creating

is the perfect solution. I am not very sure but in my mind the existing validation stage can already eliminate most problems, such as missing tables, wrong credentials or unreachable network, etc. With a well-designed validation, we can treat both cases above as "always retry".