strimzi / strimzi-kafka-operator

Apache Kafka® running on Kubernetes
https://strimzi.io/
Apache License 2.0
4.81k stars 1.29k forks source link

[Enhancement]: Enrich KafkaRebalance status with progress info #10278

Open Vince-Chenal opened 3 months ago

Vince-Chenal commented 3 months ago

Related problem

Today a KafkaRebalance object has only a Rebalancing status which means the rebalance is ongoing. It would be nice to have progress info in the KafkaRebalance status.

Suggested solution

During KafkaRebalance reconcile, when UserTask status is IN_EXECUTION, we could call kafkacruisecontrol/state?substates=EXECUTOR&verbose=true&json=true in order to get info on the ongoing rebalancing (executor_state_example.json).

This way we would be able to enrich KafkaRebalance status with progress info like:

Alternatives

No response

Additional context

No response

ppatierno commented 3 months ago

While I think this is a good idea, at the same time I was wondering which info would be really useful but mainly how much big the verbose response could taking into account a big cluster with hundreds of topics and corresponding partitions. From the example JSON you shared I could see useful totalDataToMove together with finishedDataMovement, numTotalPartitionMovements together with numFinishedPartitionMovements. But then you also have more details for specific topic partitions. From your example I see just one topic (it's mytopic) and I was wondering how much this JSON can grow with tons of topics? Maybe it's just possible to avoid the verbose=true to get what we really need? Did you try it?

scholzj commented 3 months ago

Discussed on the community call on 10.7.2024: This should be discussed on the next call with more Cruise Control SMEs on the call.

ppatierno commented 2 months ago

Not able to join today's call but I am waiting for @Vince-Chenal to think about my last questions/thoughts.

Vince-Chenal commented 2 months ago

Hey @ppatierno, sorry I completely missed the notification.

I just ran another test and you're right without verbose option we still have enough information to show rebalance progress:

{
    "ExecutorState": {
        "finishedDataMovement": 0,
        "numInProgressPartitionMovements": 0,
        "triggeredTaskReason": "No reason provided (Client: 10.66.25.180, Date: 2024-08-09T08:00:32Z)",
        "abortingPartitions": 0,
        "triggeredSelfHealingTaskId": "",
        "numFinishedPartitionMovements": 8,
        "maximumConcurrentPartitionMovementsPerBroker": 1,
        "numPendingPartitionMovements": 86,
        "triggeredUserTaskId": "9eefaa9f-4c2a-47fb-9fcb-95c3d350c45a",
        "totalDataToMove": 91212,
        "minimumConcurrentPartitionMovementsPerBroker": 1,
        "state": "INTER_BROKER_REPLICA_MOVEMENT_TASK_IN_PROGRESS",
        "averageConcurrentPartitionMovementsPerBroker": 1.0,
        "numTotalPartitionMovements": 94
    },
    "version": 1
}

verbose adds details for every topic partitions in completedPartitionMovement/pendingPartitionMovement fields which we don't need :+1:

ppatierno commented 2 months ago

Thanks @Vince-Chenal, at this point I think it's something we can add. Are you willing to contribute to it?

scholzj commented 2 months ago

@ppatierno I do not think it is that simple. You would also need to create a mechanism to avoid the reconciliation running in a tight loop because of this. So understanding the information that might be interesting is just one part. But it still needs to be triaged for the technical part and likely have a proposal.

ppatierno commented 2 months ago

I agree. I was asking because, it's not something as high priority imho. So it could not have an ETA or not being planned soon. So if anyone from the community would help, it's much better.

Vince-Chenal commented 2 months ago

I have the impression that this getCruiseControlState function is never called anywhere

Do you know where it was called before and why it's been removed? We could have reuse this existing code

scholzj commented 2 months ago

Discussed on a community call on 22.8.2024: This would be a good feature. But it should have a proposal to cover: