Open fuyufjh opened 1 month ago
A similar problem exists for Pulsar source
7-25T12:11:08.180149+08:00 INFO bootstrap_recovery{prev_epoch=6746070748823552}:recovery_attempt: risingwave_meta::manager::sink_coordination::manager: successfully stop coordinator: None
2024-07-25T12:11:08.201655+08:00 DEBUG bootstrap_recovery{prev_epoch=6746070748823552}:recovery_attempt: risingwave_meta::barrier::rpc: update actors request_id="00cc74b3-3496-40b8-95b1-aea0cf83b6f1" actors=[40, 39, 38, 37, 46, 48, 45, 47, 44, 43, 42, 41]
2024-07-25T12:11:08.204114+08:00 DEBUG bootstrap_recovery{prev_epoch=6746070748823552}:recovery_attempt: risingwave_meta::barrier::rpc: build actors request_id="1ea1826e-c12a-4be0-8feb-708dfed3a7ed" actors=[46, 47, 48, 40, 43, 39, 45, 38, 44, 42, 37, 41]
2024-07-25T12:11:13.371849+08:00 WARN bootstrap_recovery{prev_epoch=6746070748823552}:recovery_attempt: risingwave_meta::barrier::rpc: get error from response stream node=WorkerNode { id: 2, r#type: ComputeNode, host: Some(HostAddress { host: "127.0.0.1", port: 5688 }), state: Running, parallel_units: [ParallelUnit { id: 0, worker_node_id: 2 }, ParallelUnit { id: 1024, worker_node_id: 2 }, ParallelUnit { id: 2048, worker_node_id: 2 }, ParallelUnit { id: 3072, worker_node_id: 2 }], property: Some(Property { is_streaming: true, is_serving: true, is_unschedulable: false }), transactional_id: Some(0), resource: Some(Resource { rw_version: "1.11.0-alpha", total_memory_bytes: 34359738368, total_cpu_cores: 10 }), started_at: Some(1721880487) } err=gRPC request to stream service failed: Internal error: failed to collect barrier for epoch [6746070748823552]: Actor 45 exited unexpectedly: Executor error: Connector error: Connection error: fatal error when connecting to the Pulsar server
2024-07-25T12:11:16.238496+08:00 DEBUG build:new:new:connect:connect_inner: pulsar::connection_manager: ConnectionManager::connect(BrokerAddress { url: Url { scheme: "pulsar+ssl", cannot_be_a_base: false, username: "", password: None, host: Some(Domain("********")), port: Some(6651), path: "", query: None, fragment: None }, broker_url: "************", proxy: false })
2024-07-25T12:11:16.243887+08:00 DEBUG build:new:new:connect:connect_inner:new: pulsar::connection: Connecting to pulsar+ssl://************, as 9a1ffb8f-c466-4350-9c89-0644b07fea92
2024-07-25T12:11:16.405823+08:00 INFO bootstrap_recovery{prev_epoch=6746070748823552}:recovery_attempt: risingwave_meta::barrier::recovery: recovering mview progress
2024-07-25T12:11:16.45146+08:00 INFO bootstrap_recovery{prev_epoch=6746070748823552}:recovery_attempt: risingwave_meta::barrier::recovery: recovered mview progress
2024-07-25T12:11:16.484738+08:00 DEBUG bootstrap_recovery{prev_epoch=6746070748823552}:recovery_attempt: risingwave_meta::barrier::recovery: start resetting actors distribution
I find this a dilemma. 😕
from_proto
.Perhaps we need to differentiate these two cases for better experiences. For example, introduce the validation step for all kinds of sources when creating. Due to the difference on the behavior (e.g., check the connection & schema vs really select data from the source), however, I guess it's still possible to have some false positives. 🤔
Good point.
From the Meta node's perspective, these create actors
requests are called in different paths
To address the dilemma, I believe
introduce the validation step for all kinds of sources when creating
is the perfect solution. I am not very sure but in my mind the existing validation stage can already eliminate most problems, such as missing tables, wrong credentials or unreachable network, etc. With a well-designed validation, we can treat both cases above as "always retry".
Describe the bug
Currently, if the upstream of an PG CDC table doesn't work, the recovery will fail, causing the entire cluster unavailable.
The expected behavior should be keep retrying.
Error message/log
No response
To Reproduce
No response
Expected behavior
No response
How did you deploy RisingWave?
No response
The version of RisingWave
No response
Additional context
No response