[Bug]: Cruise-Control Reblancing seems to timeout for long rebalancing tasks

eslam-gomaa commented 1 year ago

Bug Description

This is a benchmarking cluster and I was performing full Rebalancing (moving 8TB+ of data) which is taking hours, but seems that the operator is timing out after some time or re-executing the balancing API.

The KafkaRebalance resource has switched from "Rebalancing" to "NotReady" because it tried to execute the command again "Cannot start a new execution while there is an ongoing execution. Please use stop_ongoing_execution=true to stop ongoing execution and start a new one."

kubectl -n kafka get kafkarebalance full-rebalance-job
NAME                 CLUSTER         PENDINGPROPOSAL   PROPOSALREADY   REBALANCING   READY   NOTREADY
full-rebalance-job   kafka-cluster                                     True                  

kubectl -n kafka get kafkarebalance full-rebalance-job
NAME                 CLUSTER         PENDINGPROPOSAL   PROPOSALREADY   REBALANCING   READY   NOTREADY
full-rebalance-job   kafka-cluster                                                           True

Status:
  Conditions:
    Last Transition Time:  2023-04-07T23:13:57.793951102Z
    Message:               Error for request: kafka-cluster-cruise-control.kafka.svc:9090/kafkacruisecontrol/rebalance?json=true&dryrun=false&verbose=true&skip_hard_goal_check=true&goals=MinTopicLeadersPerBrokerGoal%2CDiskCapacityGoal%2CDiskUsageDistributionGoal%2CCpuCapacityGoal%2CCpuUsageDistributionGoal%2CReplicaCapacityGoal%2CReplicaDistributionGoal%2CLeaderReplicaDistributionGoal%2CLeaderBytesInDistributionGoal%2CTopicReplicaDistributionGoal%2CPreferredLeaderElectionGoal&rebalance_disk=false. Server returned: Error processing POST request '/rebalance' due to: 'java.lang.IllegalStateException: Cannot start a new execution while there is an ongoing execution. Please use stop_ongoing_execution=true to stop ongoing execution and start a new one.'.
    Reason:                CruiseControlRestException
    Status:                True
    Type:                  NotReady
  Observed Generation:     1
Events:                    <none>

Steps to reproduce

Produce huge amount of data to the Kafka cluster
run KafkaRebalance (that will move terrabytes of data (8tb+ in my case))

Expected behavior

wait till the re-balancing is finished. (preferably provide the task progress as part of the result)

Strimzi version

0.34.0

Kubernetes version

1.22

Installation method

Helm Chart

Infrastructure

Amazon EKS

Configuration files and logs

No response

Additional context

No response

scholzj commented 1 year ago

I'm not sure if this is really is a Strimzi bug. I think that you need to provide the operator and the CruiseControl logs for the time when this happens => from when you created the Rebalance resource up to the failure. Otherwise, it is not really clear what timed out and where.

eslam-gomaa commented 1 year ago

I'm running KafkaRebalance resource from a CronJob that runs every night (the CronJob doesn't do something special, just creates the KafkaRebalance resource) and the job completes immediately

I'll reproduce it again and share the operator and cruise-control logs

eslam-gomaa commented 1 year ago

I re-tested it and didn't face the issue, Thank you will close the issue then.

strimzi / strimzi-kafka-operator