Detect offline shotover nodes for KafkaSinkCluster

justinweng-instaclustr commented 1 month ago

After introducing ShotoverNodeState to ShotoverNode in https://github.com/shotover/shotover-proxy/pull/1758, we should add a task to detect down shotover nodes and set ShotoverNodeState accordingly.

This PR adds a background task check_shotover_peers looping over peer shotover nodes and trying to open a TCP connection to each peer shotover node. If the connection cannot be established within connect_timeout_ms, the peer node is marked as down.

connect_timeout_ms is the same configuration used when creating a connection to a destination kafka broker.
Each check is delayed for (check_shotover_peers_delay_ms + random(-check_shotover_peers_delay_ms/10, check_shotover_peers_delay_ms/10)) before moving to the next peer shotover node.
start_shotover_peers_check is called when the instance of KafkaSinkClusterBuilder is being created and hence is called exactly once.
check_shotover_peers is be invoked at all if there's no peer shotover node (i.e., there's only 1 shotover node in the cluster)
check_shotover_peers is restarted if the creation of random number generator fails.

The next PR will change metadata rewrites to exclude down shotover nodes.

codspeed-hq[bot] commented 1 month ago

CodSpeed Performance Report

Merging #1762 will degrade performances by 11.83%

_{Comparing justinweng-instaclustr:handle-offline-shotover-nodes (b1e7742) with main (2b11e0c)}

Summary

❌ 1 regressions ✅ 38 untouched benchmarks

:warning: Please fix the performance issues or acknowledge them on CodSpeed.

Benchmarks breakdown

	Benchmark	`main`	`justinweng-instaclustr:handle-offline-shotover-nodes`	Change
❌	`encode_system.local_result_v5_no_compression`	93.1 µs	105.6 µs	-11.83%

justinweng-instaclustr commented 1 month ago

The regression benchmark encode_system.local_result_v5_no_compression is for Cassandra and hence a noise.

shotover / shotover-proxy