bug(cdc source): should retry on upstream failure instead of blocking recovery

fuyufjh commented 1 month ago

Describe the bug

{"timestamp":"2024-07-25T01:39:26.827182302Z","level":"WARN","fields":{"message":"build_actors failed","error":"merged RPC Error: worker node 140004, gRPC request to stream service failed: Internal error: Executor error: Connector error: Postgres error: db error: ERROR: relation \"public.source_priorities\" does not exist;"},"target":"risingwave_meta::barrier::recovery","spans":[{"prev_epoch":6857419982438400,"name":"bootstrap_recovery"},{"name":"recovery_attempt"}]}

Currently, if the upstream of an PG CDC table doesn't work, the recovery will fail, causing the entire cluster unavailable.

The expected behavior should be keep retrying.

Error message/log

No response

To Reproduce

No response

Expected behavior

No response

How did you deploy RisingWave?

No response

The version of RisingWave

No response

Additional context

No response

fuyufjh commented 1 month ago

A similar problem exists for Pulsar source

7-25T12:11:08.180149+08:00  INFO bootstrap_recovery{prev_epoch=6746070748823552}:recovery_attempt: risingwave_meta::manager::sink_coordination::manager: successfully stop coordinator: None
2024-07-25T12:11:08.201655+08:00 DEBUG bootstrap_recovery{prev_epoch=6746070748823552}:recovery_attempt: risingwave_meta::barrier::rpc: update actors request_id="00cc74b3-3496-40b8-95b1-aea0cf83b6f1" actors=[40, 39, 38, 37, 46, 48, 45, 47, 44, 43, 42, 41]
2024-07-25T12:11:08.204114+08:00 DEBUG bootstrap_recovery{prev_epoch=6746070748823552}:recovery_attempt: risingwave_meta::barrier::rpc: build actors request_id="1ea1826e-c12a-4be0-8feb-708dfed3a7ed" actors=[46, 47, 48, 40, 43, 39, 45, 38, 44, 42, 37, 41]
2024-07-25T12:11:13.371849+08:00  WARN bootstrap_recovery{prev_epoch=6746070748823552}:recovery_attempt: risingwave_meta::barrier::rpc: get error from response stream node=WorkerNode { id: 2, r#type: ComputeNode, host: Some(HostAddress { host: "127.0.0.1", port: 5688 }), state: Running, parallel_units: [ParallelUnit { id: 0, worker_node_id: 2 }, ParallelUnit { id: 1024, worker_node_id: 2 }, ParallelUnit { id: 2048, worker_node_id: 2 }, ParallelUnit { id: 3072, worker_node_id: 2 }], property: Some(Property { is_streaming: true, is_serving: true, is_unschedulable: false }), transactional_id: Some(0), resource: Some(Resource { rw_version: "1.11.0-alpha", total_memory_bytes: 34359738368, total_cpu_cores: 10 }), started_at: Some(1721880487) } err=gRPC request to stream service failed: Internal error: failed to collect barrier for epoch [6746070748823552]: Actor 45 exited unexpectedly: Executor error: Connector error: Connection error: fatal error when connecting to the Pulsar server
2024-07-25T12:11:16.238496+08:00 DEBUG build:new:new:connect:connect_inner: pulsar::connection_manager: ConnectionManager::connect(BrokerAddress { url: Url { scheme: "pulsar+ssl", cannot_be_a_base: false, username: "", password: None, host: Some(Domain("********")), port: Some(6651), path: "", query: None, fragment: None }, broker_url: "************", proxy: false })    
2024-07-25T12:11:16.243887+08:00 DEBUG build:new:new:connect:connect_inner:new: pulsar::connection: Connecting to pulsar+ssl://************, as 9a1ffb8f-c466-4350-9c89-0644b07fea92    
2024-07-25T12:11:16.405823+08:00  INFO bootstrap_recovery{prev_epoch=6746070748823552}:recovery_attempt: risingwave_meta::barrier::recovery: recovering mview progress
2024-07-25T12:11:16.45146+08:00  INFO bootstrap_recovery{prev_epoch=6746070748823552}:recovery_attempt: risingwave_meta::barrier::recovery: recovered mview progress
2024-07-25T12:11:16.484738+08:00 DEBUG bootstrap_recovery{prev_epoch=6746070748823552}:recovery_attempt: risingwave_meta::barrier::recovery: start resetting actors distribution

BugenZhao commented 1 month ago

I find this a dilemma. 😕

When creating the physical executor of a new source, we recommend throwing errors to users immediately when creating actors and calling from_proto.
When recovering, we tend to let the executor retry on any connection issue itself during its runtime, instead of failing the actor creation or the recovery progress.

Perhaps we need to differentiate these two cases for better experiences. For example, introduce the validation step for all kinds of sources when creating. Due to the difference on the behavior (e.g., check the connection & schema vs really select data from the source), however, I guess it's still possible to have some false positives. 🤔

fuyufjh commented 1 month ago

Good point.

From the Meta node's perspective, these create actors requests are called in different paths

When creating the physical executor of a new source, the failure is somehow expected.
When recovering, the failure is unacceptable.

To address the dilemma, I believe

introduce the validation step for all kinds of sources when creating

is the perfect solution. I am not very sure but in my mind the existing validation stage can already eliminate most problems, such as missing tables, wrong credentials or unreachable network, etc. With a well-designed validation, we can treat both cases above as "always retry".

risingwavelabs / risingwave