strimzi / strimzi-kafka-operator

Apache Kafka® running on Kubernetes
https://strimzi.io/
Apache License 2.0
4.84k stars 1.3k forks source link

Refreshing a remove brokers operation on `KafkaRebalance` resource while one rebalancing is already running can drive to a `NotReady` state #10571

Open ppatierno opened 1 month ago

ppatierno commented 1 month ago

Create a Kafka custom resource (for example with 7 brokers) with the cruiseControl field to run Cruise Control within the cluster deployment. Run a rebalancing by creating a KafkaRebalance custom resource to remove nodes 5, 6 (with auto-approval enabled), like this:

apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaRebalance
metadata:
  name: my-rebalance
  labels:
    strimzi.io/cluster: my-cluster
  annotations:
    strimzi.io/rebalance-auto-approval: "true"
# no goals specified, using the default goals from the Cruise Control configuration
spec:
  mode: remove-brokers
  brokers: [5, 6]

Wait for the rebalancing to go from ProposalPendy, to ProposalReady and automatically (auto-approval enabled) to Rebalancing. While rebalancing is running, ask for a new rebalancing (using the "refresh" annotation on the already existing custom resource) including nodes 3, 4 as well, so having all 3,4,5 and 6, like this:

apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaRebalance
metadata:
  name: my-rebalance
  labels:
    strimzi.io/cluster: my-cluster
  annotations:
    strimzi.io/rebalance-auto-approval: "true"
    strimzi.io/rebalance: "refresh"
# no goals specified, using the default goals from the Cruise Control configuration
spec:
  mode: remove-brokers
  brokers: [3, 4, 5, 6]

Sometimes (it could depending on the timing and where Cruise Control is on the current rebalancing), the operator will go through the following log error and the KafkaRebalance moves to NotReady state:

2024-09-12 13:51:10 INFO  KafkaRebalanceAssemblyOperator:1113 - Reconciliation #59(watch) KafkaRebalance(myproject/my-rebalance): Requesting Cruise Control rebalance [dryrun=true]
2024-09-12 13:51:11 INFO  KafkaRebalanceAssemblyOperator:351 - Reconciliation #59(watch) KafkaRebalance(myproject/my-rebalance): KafkaRebalance state is now updated to [ProposalReady] with annotation strimzi.io/rebalance=refresh applied on the KafkaRebalance resource
2024-09-12 13:51:11 INFO  KafkaRebalanceAssemblyOperator:359 - Reconciliation #59(watch) KafkaRebalance(myproject/my-rebalance): Removing annotation strimzi.io/rebalance=refresh
2024-09-12 13:51:11 INFO  AbstractOperator:520 - Reconciliation #60(watch) KafkaRebalance(myproject/my-rebalance): KafkaRebalance my-rebalance in namespace myproject was MODIFIED
2024-09-12 13:51:11 INFO  AbstractOperator:520 - Reconciliation #61(watch) KafkaRebalance(myproject/my-rebalance): KafkaRebalance my-rebalance in namespace myproject was MODIFIED
2024-09-12 13:51:11 INFO  CrdOperator:123 - Reconciliation #59(watch) KafkaRebalance(myproject/my-rebalance): Status of KafkaRebalance my-rebalance in namespace myproject has been updated
2024-09-12 13:51:11 INFO  AbstractOperator:546 - Reconciliation #59(watch) KafkaRebalance(myproject/my-rebalance): reconciled
2024-09-12 13:51:11 INFO  AbstractOperator:266 - Reconciliation #60(watch) KafkaRebalance(myproject/my-rebalance): KafkaRebalance my-rebalance will be checked for creation or modification
2024-09-12 13:51:11 INFO  KafkaRebalanceAssemblyOperator:317 - Reconciliation #60(watch) KafkaRebalance(myproject/my-rebalance): Rebalance action is performed and KafkaRebalance resource is currently in [ProposalReady] state
2024-09-12 13:51:11 INFO  KafkaRebalanceAssemblyOperator:788 - Reconciliation #60(watch) KafkaRebalance(myproject/my-rebalance): Auto-approval set on the KafkaRebalance resource
2024-09-12 13:51:11 INFO  KafkaRebalanceAssemblyOperator:1113 - Reconciliation #60(watch) KafkaRebalance(myproject/my-rebalance): Requesting Cruise Control rebalance [dryrun=false]
2024-09-12 13:51:11 ERROR KafkaRebalanceAssemblyOperator:378 - Reconciliation #60(watch) KafkaRebalance(myproject/my-rebalance): Status updated to [NotReady] due to error: Error for request: my-cluster-cruise-control.myproject.svc:9090/kafkacruisecontrol/remove_broker?json=true&dryrun=false&verbose=true&skip_hard_goal_check=false&brokerid=3%2C4%2C5%2C6. Server returned: Error processing POST request '/remove_broker' due to: 'java.lang.IllegalStateException: Cannot start a new execution while there is an ongoing execution. Please use stop_ongoing_execution=true to stop ongoing execution and start a new one.'.

It seems that asking for a new rebalancing with different nodes to remove needs that currently running task is stopped via stop_ongoing_execution=true in the query string on the POST request to the REST API. Maybe we should have this addition in any POST operation for rebalancing when our intention is to not waiting for the current operation ending but starting a new one straight away.

im-konge commented 1 month ago

Triaged on 19.9.2024: This should be fixed.

tinaselenge commented 1 week ago

After some investigation, I came to conclusion that it is not straight forward to solve this issue with the suggested flag due to our current rebalance reconcile flow. The current flow is:

Settingstop_ongoing_execution flag to true whenever we request a full run rebalance would result in stopping all kinds of in progress executions including unrelated executions from topic operators. Currently it is not straight forward to set this flag, only on refresh annotation either. This flag can be only set to true, when dry mode is set to false (both cannot be set to true).

Although, it is not simple for the operator to automatically refresh the rebalance in this scenario, the user would be notified with the reason for NotReady status. The error message makes it clear that the current execution needs to completed before submitting a new one, so user can wait and then apply the refresh annotation again on the KafkaRebalance CR. This would set the KafkaRabalance state from NotReady to New.

 status:
    conditions:
    - lastTransitionTime: "2024-10-29T09:32:38.728781722Z"
      message: 'Error for request: my-cluster-cruise-control.default.svc:9090/kafkacruisecontrol/remove_broker?json=true&dryrun=false&verbose=true&skip_hard_goal_check=false&brokerid=2%2C4%2C5.
        Server returned: Error processing POST request ''/remove_broker'' due to:
        ''java.lang.IllegalStateException: Cannot start a new execution while there
        is an ongoing execution. Please use stop_ongoing_execution=true to stop ongoing
        execution and start a new one.''.'
      reason: CruiseControlRestException
      status: "True"
      type: NotReady

One improvement we could do is to handle this error and modify the error message slightly. Instead of the Please use stop_ongoing_execution=true to stop ongoing execution and start a new one. part, it could be something like Please wait for a few minutes until the ongoing execution is completed and then use therefreshannotation to ask for a new rebalance request again..

@ppatierno please let me know what you think.

ppatierno commented 1 week ago

So on one side, I would leave the message as it is because that's exactly what we get from Cruise Control instead of starting to handle specific errors (we don't know how many others we can face in the future) and changing the message for a more understandable one for the user. On the other side, this message doesn't really explain to the user what to do. They could just apply the 'refresh' again hoping that there is no execution running and it will go through. I am interested to know what the other maintainers think as well @strimzi/maintainers ?