Open scholzj opened 3 weeks ago
It looks like there is no way to get the list of registered nodes. The Admin APi describeMetadataQuorum
method seems to list them as observers until they are rolled. But not anymore. So for example in the scenario where you find out about the issue only after Kafka upgrade when trying to update the metadata, then you have no way to find out the list of nodes. That also means that it might be hard for Strimzi to track and unregister the nodes without keep the list of used node IDs somewhere in the Kafka CR status.
Update: I opened https://issues.apache.org/jira/browse/KAFKA-17094 to track the Kafka limitations related to this.
The most obvious solution for this would be to query the registered nodes from Kafka, compare them with the list of current nodes, and unregister those that were removed. However, Kafka cannot provide this information reliable today because of the issue linked above, and assuming it is confirmed, it seems unlikely to be fixed in 3.8 which should be shortly before an RC1.
We can work around this Kafka issue by storing a full list of used node IDs in the Kafka CR status. That way, we would have our own reliable tracking of the nodes that existed and we can unregister them. However, if we do this, we will change the API and it will be hard to unchange it. So even if Kafka fixes this later, we would be stick with the node IDs field in the Kafka CR status.
We should decide:
Discussed on the community call on 10.7.2024: KAFKA-17094 is currently under discussion in the Kafka project. We should wait for that discussion to be finished. That should gives us better idea when and how it might be addressed in Kafka and then we can decide how to deal with it in Strimzi.
Apparently, the Kafka nodes should be unregistered using the Kafka Admin API when scaling-down. Without that, the cluster will still expect them to be present and for example be unable to handle the metadata upgrade:
This would be simple to implement for regular scale-downs. But will be non-trivial for node pool deletions.