Open ly9chee opened 5 months ago
@tabVersion Any recommendations / thoughts on this?
Seeing the log here
ERROR risingwave_stream::task::stream_manager: actor exit with error actor_id=4490 error=Executor error: Sink error: Kafka error: Message production error: QueueFull (Local: Queue full)
It indicates the batching is too small and it triggers a small failover to ingest the data within the same epoch over and over again.
I'd suggest increasing properties.queue.buffering.max.ms
and `` to allow larger batching cache and reduce the overhead for each batch. Related issue https://github.com/confluentinc/librdkafka/issues/2247
@tabVersion Thanks for the suggestions, but I have a concern about how to set those properties well, because in practice, the upstream thruput can change frequently, it may work pretty well when the upstream thruput is 1k/s, but enters a recovery loop when the upstream thruput gets high(100k/s) or Kafka is experiencing high load. And when we encounter this error, the only thing we can do is drop this sink to prevent cluster from continuing to crash.
It seems that in KafkaPayloadWriter
implementation, when a Queue Full
error is encountered, we only await one delivery to be sent and immediately create a new delivery future.
https://github.com/risingwavelabs/risingwave/blob/91b7ee29ce4d846f9c2ee6d9f56264bab414250a/src/connector/src/sink/kafka.rs#L511-L522
In this case, I think we might await all inflight deliveries being drained or wait a sufficient time before doing a retry, otherwise the producer queue may keep reaching full.
It seems this issue is not urgent, because I can't reproduce it.😅
remove from milestone, keep open for tracking
Describe the bug
Description
The cluster entering a recovery loop when creating kafka sink from a mv (about 4 million records); when the problematic sink was dropped, the system went back to normal.
20 minutes later, the sink had been successfully recreated.
Other observed phenomena
Kafka should work fine, because within the recovery period, the topic had 140 million records written into it, but the upstream mv only had 4 million records.
Not sure increasing
properties.retry.max
can solve this issue.Error message/log
To Reproduce
No response
Expected behavior
No response
How did you deploy RisingWave?
No response
The version of RisingWave
v1.8.2
Additional context
No response