A new test case is added to certain tests to kill a kafka node while accessing an RF3 partition.
An extra node is added to these test's docker-compose.yaml (making a total of 4 nodes) to give the cluster room to add another replica and maintain RF3.
To support clusters of more than 3 nodes with TLS (needed for the cluster_sasl_scram_over_mtls_single_shotover test), extra DNS names (kafka3, kafka4, kafka5) are added to the TLS cert we generate for testing.
Technically we only need kafka3 added, but I added more since these kinds of TLS failures are tricky to debug.
Implementation
Routing
The methods store_topic_names/store_topic_ids/store_group contain logic for checking if we are missing any metadata needed for routing incoming requests.
This PR extends that logic by making them also check if the metadata points at a node that is now down as this would indicate that the metadata may be out of date, so it should be refetched.
Dont explode when nodes disappear
Some of the logic in KafkaSinkCluster is naive and assumes that once a node has existed in the metadata of kafka nodes it will always exist.
This is however not true, a "down" node will be eventually removed from the clusters advertised list of nodes.
So to fix this, in the rewrite_metadata_response method, this PR replaces some unwraps with if lets to gracefully handle the absence of nodes that we expect to exist.
blocked on https://github.com/shotover/shotover-proxy/pull/1715
Tests
A new test case is added to certain tests to kill a kafka node while accessing an RF3 partition. An extra node is added to these test's docker-compose.yaml (making a total of 4 nodes) to give the cluster room to add another replica and maintain RF3.
To support clusters of more than 3 nodes with TLS (needed for the
cluster_sasl_scram_over_mtls_single_shotover
test), extra DNS names (kafka3, kafka4, kafka5) are added to the TLS cert we generate for testing. Technically we only need kafka3 added, but I added more since these kinds of TLS failures are tricky to debug.Implementation
Routing
The methods
store_topic_names
/store_topic_ids
/store_group
contain logic for checking if we are missing any metadata needed for routing incoming requests. This PR extends that logic by making them also check if the metadata points at a node that is now down as this would indicate that the metadata may be out of date, so it should be refetched.Dont explode when nodes disappear
Some of the logic in KafkaSinkCluster is naive and assumes that once a node has existed in the metadata of kafka nodes it will always exist. This is however not true, a "down" node will be eventually removed from the clusters advertised list of nodes. So to fix this, in the
rewrite_metadata_response
method, this PR replaces someunwrap
s withif let
s to gracefully handle the absence of nodes that we expect to exist.