shotover / shotover-proxy

L7 data-layer proxy
https://docs.shotover.io
Apache License 2.0
83 stars 16 forks source link

KafkaSinkCluster route by NodeState #1734

Closed rukai closed 1 week ago

rukai commented 2 weeks ago

blocked on https://github.com/shotover/shotover-proxy/pull/1715

Tests

A new test case is added to certain tests to kill a kafka node while accessing an RF3 partition. An extra node is added to these test's docker-compose.yaml (making a total of 4 nodes) to give the cluster room to add another replica and maintain RF3.

To support clusters of more than 3 nodes with TLS (needed for the cluster_sasl_scram_over_mtls_single_shotover test), extra DNS names (kafka3, kafka4, kafka5) are added to the TLS cert we generate for testing. Technically we only need kafka3 added, but I added more since these kinds of TLS failures are tricky to debug.

Implementation

Routing

The methods store_topic_names/store_topic_ids/store_group contain logic for checking if we are missing any metadata needed for routing incoming requests. This PR extends that logic by making them also check if the metadata points at a node that is now down as this would indicate that the metadata may be out of date, so it should be refetched.

Dont explode when nodes disappear

Some of the logic in KafkaSinkCluster is naive and assumes that once a node has existed in the metadata of kafka nodes it will always exist. This is however not true, a "down" node will be eventually removed from the clusters advertised list of nodes. So to fix this, in the rewrite_metadata_response method, this PR replaces some unwraps with if lets to gracefully handle the absence of nodes that we expect to exist.

codspeed-hq[bot] commented 2 weeks ago

CodSpeed Performance Report

Merging #1734 will not alter performance

Comparing rukai:kafka_sink_cluster_route_by_node_state (cd075d6) with main (b5c98c9)

Summary

✅ 39 untouched benchmarks