strimzi / strimzi-kafka-operator

Apache Kafka® running on Kubernetes
https://strimzi.io/
Apache License 2.0
4.81k stars 1.29k forks source link

[Bug]: Error getting status of rebalance task via /user_tasks endpoint results in "NotReady" state #10704

Open tinaselenge opened 1 week ago

tinaselenge commented 1 week ago

Bug Description

When getting status of rebalance task via /users_tasks, it could return 500 with an error such as:

2024-10-09 15:44:34 ERROR KafkaCruiseControlRequestHandler:88 - Error processing GET request '/user_tasks' due to: 'There are already 5 active user tasks, which has reached the servlet capacity.'.
java.lang.RuntimeException: There are already 5 active user tasks, which has reached the servlet capacity.

This has nothing to do with the actual rebalance task itself, as it is still maybe in progress. This seems to be a failure in generating a new user task for getting the status. When one of the existing user tasks complete, it gets removed from the active user task list e.g:

2024-10-09 15:44:36 INFO  UserTaskManager:349 - UserTask 7e280130-47d2-4940-99da-f57f117c3f26 is completed and removed from active tasks list

Once an existing task is completed and removed, we should be able to send a request to /users_tasks without hitting 500. Since this failure does not reflect the actual status of the rebalance task that we are trying to query about, I don't think it makes sense to result in "NotReady" for the KafkaRebalance. We should maybe retry the endpoint again, in the next reconciliation.

Steps to reproduce

Create KafkaRebalance CR for removing/adding brokers with auto approve set, and then immediately apply refresh annotation to create a new rebalance task. This is an intermittent failure depending on how quickly tasks complete.

Expected behavior

No response

Strimzi version

main

Kubernetes version

1.29

Installation method

No response

Infrastructure

No response

Configuration files and logs

No response

Additional context

No response

ppatierno commented 1 week ago

That's interesting because as stated in https://github.com/strimzi/strimzi-kafka-operator/pull/10701, we see errors coming from CC to get ignored and not reported as KR in NotReady state.

ppatierno commented 6 days ago

Ignore last comment ;-) We were wrong.

Said that good catch @tinaselenge. I think it could be make easily reproducible by shortening the max.active.user.tasks when configuring Cruise Control in the Kafka custom resource. Its value is 5 which is exactly what you have.

ppatierno commented 4 days ago

Triaged on 17/10/2024: agreed to fix this, at least not moving the KafkaRebalance to NotReady state straight when it happens but waiting for next reconciliation(s) as retries. @tinaselenge is going to take a look at it. Thanks Tina!