Open see-quick opened 6 months ago
Discussed on the community call on 16.5.2024: @mimaison will try to have a look at the Kafka error in more detail and we can get back to it next time.
@see-quick , do you have full log for the error case? I am thinking if there's some errors while creating topic earlier?
You can find all kinds of logs here [1]. Or do you mean something anything? Also I have attached both logs i.e., topic-operator and Kafka in the description.
Identified a bug in upstream Kafka. Filed KAFKA-16814 for this issue.
Triaged on 30/5/2024: let's keep this open and waiting for the corresponding Kafka issue to be fixed and available in next releases Kafka 3.8.0 and 3.7.1.
Related problem
In performance capacity tests for Topic Operators using KRaft-based Kafka clusters, several scaling limitations were observed. Notably, ARM architecture displayed better scalability than Intel's, with issues arising when the number of KafkaTopics approached 4000. Specific errors related to the LogManager and connection refusals were noted in Kafka broker logs, indicating potential faults in topic ID management and network communications under high-load scenarios.
Here is the problem with the LogManager (full log [1]):
and this is the Topic Operator container log (full log [2])
[1] - https://artifacts.dev.testing-farm.io/ce820a24-0fe5-41c2-8701-fc0e045f5097/work-performance-topic-operator-capacity025kvr2n/systemtest/tmt/plans/performance-topic-operator-capacity/data/logs/logs/2024-05-02-14-02-54/io.strimzi.systemtest.performance.TopicOperatorPerformance/testCapacity/co-namespace/logs-pod-cluster-03ebd4e7-b-cc352780-0-container-kafka.log
[2] - https://artifacts.dev.testing-farm.io/ce820a24-0fe5-41c2-8701-fc0e045f5097/work-performance-topic-operator-capacity025kvr2n/systemtest/tmt/plans/performance-topic-operator-capacity/data/logs/logs/2024-05-02-14-02-54/io.strimzi.systemtest.performance.TopicOperatorPerformance/testCapacity/co-namespace/logs-pod-cluster-03ebd4e7-entity-operator-d78d79d5c-sjnrg-container-topic-operator.log
Suggested solution
Not sure, we should observe what's the main root cause.
Alternatives
No response
Additional context
The issue was specifically noted in scenarios where the number of KafkaTopics was scaled up to 4000, suggesting a threshold where current configurations begin to falter. Performance logs and detailed test outcomes from both ARM and Intel architecture setups (available in linked artefacts [1]) underline the need for targeted improvements in scalability and error management. All measurements:
And this is my specific findings: Multi-node (ZK-based)
Multi-node (KRaft-based)
Testing farm ARM: Use Case: capacityUseCase
Testing farm Intel Use Case: capacityUseCase
[1] - https://artifacts.dev.testing-farm.io/ce820a24-0fe5-41c2-8701-fc0e045f5097/work-performance-topic-operator-capacity025kvr2n/systemtest/tmt/plans/performance-topic-operator-capacity/data/logs/logs/2024-05-02-14-02-54/io.strimzi.systemtest.performance.TopicOperatorPerformance/testCapacity/co-namespace/