Open ppatierno opened 2 months ago
Triaged on 19.9.2024: This should be fixed.
After some investigation, I came to conclusion that it is not straight forward to solve this issue with the suggested flag due to our current rebalance reconcile flow. The current flow is:
Rabalancing
state, as the rebalance is in progress/stop_proposal_execution
endpoint to stop the current rebalance operationProposalReady
stateProposalReady
state, therefore the operator sends a request to execute the removal of the updated set of brokers. When we send this request, we cannot set the stop_ongoing_execution
conditionally because the annotation is already removed in the previous round of reconciliation.NotReady
state. Settingstop_ongoing_execution
flag to true whenever we request a full run rebalance would result in stopping all kinds of in progress executions including unrelated executions from topic operators. Currently it is not straight forward to set this flag, only on refresh annotation either. This flag can be only set to true, when dry mode is set to false (both cannot be set to true).
Although, it is not simple for the operator to automatically refresh the rebalance in this scenario, the user would be notified with the reason for NotReady
status. The error message makes it clear that the current execution needs to completed before submitting a new one, so user can wait and then apply the refresh annotation again on the KafkaRebalance CR. This would set the KafkaRabalance state from NotReady
to New
.
status:
conditions:
- lastTransitionTime: "2024-10-29T09:32:38.728781722Z"
message: 'Error for request: my-cluster-cruise-control.default.svc:9090/kafkacruisecontrol/remove_broker?json=true&dryrun=false&verbose=true&skip_hard_goal_check=false&brokerid=2%2C4%2C5.
Server returned: Error processing POST request ''/remove_broker'' due to:
''java.lang.IllegalStateException: Cannot start a new execution while there
is an ongoing execution. Please use stop_ongoing_execution=true to stop ongoing
execution and start a new one.''.'
reason: CruiseControlRestException
status: "True"
type: NotReady
One improvement we could do is to handle this error and modify the error message slightly. Instead of the Please use stop_ongoing_execution=true to stop ongoing execution and start a new one.
part, it could be something like Please wait for a few minutes until the ongoing execution is completed and then use the
refreshannotation to ask for a new rebalance request again.
.
@ppatierno please let me know what you think.
So on one side, I would leave the message as it is because that's exactly what we get from Cruise Control instead of starting to handle specific errors (we don't know how many others we can face in the future) and changing the message for a more understandable one for the user. On the other side, this message doesn't really explain to the user what to do. They could just apply the 'refresh' again hoping that there is no execution running and it will go through. I am interested to know what the other maintainers think as well @strimzi/maintainers ?
Create a
Kafka
custom resource (for example with 7 brokers) with thecruiseControl
field to run Cruise Control within the cluster deployment. Run a rebalancing by creating aKafkaRebalance
custom resource to remove nodes 5, 6 (with auto-approval enabled), like this:Wait for the rebalancing to go from ProposalPendy, to ProposalReady and automatically (auto-approval enabled) to Rebalancing. While rebalancing is running, ask for a new rebalancing (using the "refresh" annotation on the already existing custom resource) including nodes 3, 4 as well, so having all 3,4,5 and 6, like this:
Sometimes (it could depending on the timing and where Cruise Control is on the current rebalancing), the operator will go through the following log error and the
KafkaRebalance
moves toNotReady
state:It seems that asking for a new rebalancing with different nodes to remove needs that currently running task is stopped via
stop_ongoing_execution=true
in the query string on the POST request to the REST API. Maybe we should have this addition in any POST operation for rebalancing when our intention is to not waiting for the current operation ending but starting a new one straight away.