strimzi / strimzi-kafka-operator

Apache Kafka® running on Kubernetes
https://strimzi.io/
Apache License 2.0
4.81k stars 1.28k forks source link

[Bug]: Topic Operator cannot start due to a readiness probe failure. #8517

Closed ruslan-maiboroda closed 1 year ago

ruslan-maiboroda commented 1 year ago

Bug Description

I have ~40,000 kafka topics

After rebooting the entity-operator, it failed the readiness probe, and the issue seems to be with the topic-operator pod.

I have discovered that increasing the maxbuffer might resolve the issue. However, it's worth noting that according to the Zookeper documentation, they do not recommend exceeding the default value for this property due to certain reasons. Therefore, there might be an alternative approach to fix the problem.

Steps to reproduce

No response

Expected behavior

Readiness probe should pass without error

Strimzi version

0.33.0

Kubernetes version

1.24

Installation method

Helm chart

Infrastructure

Amazon EKS

Configuration files and logs

Here are the logs for the entity-operator pod:

tls-sidecar.txt user-operator.txt topic-operator.txt

Additional context

No response

scholzj commented 1 year ago

I wonder how does your architecture look like with 40k topics. If you want, you can try to tune the healtchecks or give it more resources. But I think the Topic Operator was not designed for this kind of scale. It is also being replaced to make it compatible with ZooKeeper-less Kafka (see https://github.com/strimzi/proposals/blob/main/051-unidirectional-topic-operator.md for more details). So I do not think any improvements to the scalability of the old version are planned.

ruslan-maiboroda commented 1 year ago

I'm not convinced that health checks will be effective since there are constantly exceptions recorded in the logs. Additionally, the resources being utilized are extremely low.

scholzj commented 1 year ago

That is fine - I'm not convinced they would really solve it either. But it is probably the only thing you can try easily. As I said - I think 40k topics are out of scale for the Topic Operator. TBH, I would probably not want to have 40k KafkaTopic resources in the Kubernetes cluster itself as that might cause a lot of issues even in Kubernetes alone.

tombentley commented 1 year ago

Triaged on community call 18/5/2023: The bidirectional topic operator was not designed to scale to 40,000 topics, and with the proposal for the unidirectional topic operator now accepted it doesn't really make sense to try to improve the old one. The new topic operator has been written with the needs to scale to a larger number of topics in mind, but 40,000 seems ambitious even in that case, not least the effect of having that many resources in Kube (irrespective of the operator accessing them) needs to be understood. Marking as won't fix, at least for the bidirectional topic operator case.

sachintandon-nexla commented 1 year ago

@tombentley Do we have an ETA on which release the unidirectional topic operator will be available ?

scholzj commented 1 year ago

The Unidirectional Topic Operator is available from 0.36.0. It is behind a feature gate which is disabled by default. So you would need to enable it (also, you should be aware that things might change or not work when behind an alpha feature gate). The current plan for it to be enabled by default is in Strimzi 0.39

sachintandon-nexla commented 1 year ago

Thanks @scholzj for the update this helps us in running strmizi kafka more confidently in production. Just want to check do you see any expected impact on the strimzi kafka if we run close to 40K topics on our kafka cluster. Our use case needs a new topic per requests.

scholzj commented 1 year ago

I have no idea.

But I think using topic-per-request might be a bad pattern. It sounds like you need something like ActiveMQ for example rather than Kafka.

sachintandon-nexla commented 1 year ago

40K topics in kubernetes without using topic Operator

scholzj commented 1 year ago

Not sure I follow that. The KafkaTopic resources have no meaning without the Topic Operator. Plus I guess 40k resources might be a bit of a challenge for a regular Kube cluster as well.

sachintandon-nexla commented 10 months ago

@scholzj with the unidirectionatopic operator released ,