strimzi / strimzi-kafka-operator

Apache Kafka® running on Kubernetes
https://strimzi.io/
Apache License 2.0
4.8k stars 1.28k forks source link

[Bug]: Flaky test TopicReplicasChangeST.testKafkaTopicReplicaChangePositiveRoundTrip #10295

Closed fvaleri closed 2 months ago

fvaleri commented 3 months ago

Bug Description

The system test TopicReplicasChangeST.testKafkaTopicReplicaChangePositiveRoundTrip often fails with a timeout when running locally the entire class, while it always works when running it in isolation. I think this was already observed before by other people.

Steps to reproduce

mvn verify -B -Pall -pl systemtest -Dit.test="TopicReplicasChangeST" \
  -DexcludedGroups=flaky,loadbalancer -Dmaven.javadoc.skip -Dskip.surefire.tests

Expected behavior

The test should always work.

Strimzi version

main

Kubernetes version

1.30

scholzj commented 3 months ago

It does not seem to be a problem in the Azure pipelines.

fvaleri commented 3 months ago

It does not seem to be a problem in the Azure pipelines.

I know, but have you tried locally? I would be good to have some confirmation besides me.

scholzj commented 3 months ago

Funny enough, I saw it fail once in the next ST run after I wrote that it doesn't seem to fail. :-( That said, I still think that in the CI it seems to fail only rarely. There are several tests that are significantly more flaky. The general issue with this is that unless it is a real problem, nobody is going to jump on it and fix it. So if you think it is flaky, it will usually be up to you to deal with it.

I know, but have you tried locally? It would be good to have some confirmation besides me.

Interesting, I ran it 5 times and it failed twice. Normally my local is much more reliable than Azure, not the other way around.

fvaleri commented 3 months ago

There are several tests that are significantly more flaky.

Sure, and we should log them so that they can eventually be fixed.

This one is tricky as you also found. The test looks correct, and we applied some optimization to make the CC model generation faster. There must be some race condition going on, possibly related to the testing infrastructure.

scholzj commented 3 months ago

I think all of the flaky tests are related to Cruise Control. It is really unpredictable. Sometimes you have a proposal and rebalance done in 30 seconds. Sometimes you wait 5 minutes for a proposal.

fvaleri commented 3 months ago

That's true. In this class I added some optimizations to make it faster and a bit more predictable. If I remember correctly, the flakiness was already there before that.

see-quick commented 3 months ago

That's true. In this class I added some optimizations to make it faster and a bit more predictable. If I remember correctly, the flakiness was already there before that.

Correct, not sure if increasing timeout would solve a race there. There have been countless problems with the CC, where many race conditions happen.

Sometimes you have a proposal and rebalance done in 30 seconds. Sometimes you wait 5 minutes for a proposal.

This is true...and even with our optimizations is still un-predictable sometimes. So not sure how we could improve it more.

fvaleri commented 3 months ago

First thing we have to confirm is that this issue is actually related to the cluster model not being ready in time, which I'm not sure at the moment. I'll try to have a look if I find some time.

Frawless commented 3 months ago

Test executor should collect logs from operands and CO so I guess the problem should be visible in CC log or maybe in operator/kafka log.

scholzj commented 3 months ago

Discussed on the community call on 10.7.2024: @fvaleri plans to look at this and try to improve the stability.

fvaleri commented 2 months ago

It took me many runs to hit the timeout a couple of times. As we thought, it is related to the Cruise Control's cluster model not being ready in time. This is confirmed by the TO logs where we have more than 5 minutes from the first pending to the ongoing state.

# test logs
2024-07-13 08:05:07 [main] INFO  [KafkaTopicUtils:315] Starting wait for condition: replica change status not present
io.strimzi.test.WaitException: Timeout after 300000 ms waiting for replica change status not present
    at io.strimzi.test.TestUtils.waitFor(TestUtils.java:182)
    at io.strimzi.test.TestUtils.waitFor(TestUtils.java:133)
    at io.strimzi.systemtest.utils.kafkaUtils.KafkaTopicUtils.waitForCondition(KafkaTopicUtils.java:316)
    at io.strimzi.systemtest.utils.kafkaUtils.KafkaTopicUtils.waitForReplicaChangeStatusNotPresent(KafkaTopicUtils.java:377)
    at io.strimzi.systemtest.operators.topic.TopicReplicasChangeST.testKafkaTopicReplicaChangePositiveRoundTrip(TopicReplicasChangeST.java:170)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.base/java.lang.reflect.Method.invoke(Method.java:568)
    at org.junit.platform.commons.util.ReflectionUtils.invokeMethod(ReflectionUtils.java:725)
        ....    

# TO logs
2024-07-13 08:04:57,18740 INFO  [LoopRunnable-0] ReplicasChangeHandler:176 - Replicas change pending, Topics: [my-topic-1976557641-236539828]
...
2024-07-13 08:10:11,05290 INFO  [LoopRunnable-0] ReplicasChangeHandler:184 - Replicas change ongoing, Topics: [my-topic-1976557641-236539828]

# CC logs
2024-07-13 08:05:09 WARN  OperationRunnable:50 - Received exception when trying to execute runnable for "Update Topic Configuration"
com.linkedin.kafka.cruisecontrol.exception.KafkaCruiseControlException: com.linkedin.cruisecontrol.exception.NotEnoughValidWindowsException: There are only 0 valid windows when aggregating in range [-1, 1720857909207] for aggregation options (minValidEntityRatio=1.00, minValidEntityGroupRatio=0.00, minValidWindows=1, numEntitiesToInclude=68, granularity=ENTITY)
    at com.linkedin.kafka.cruisecontrol.servlet.handler.async.runnable.GoalBasedOperationRunnable.computeResult(GoalBasedOperationRunnable.java:167) ~[cruise-control-2.5.137.jar:?]
    at com.linkedin.kafka.cruisecontrol.servlet.handler.async.runnable.UpdateTopicConfigurationRunnable.getResult(UpdateTopicConfigurationRunnable.java:157) ~[cruise-control-2.5.137.jar:?]
    at com.linkedin.kafka.cruisecontrol.servlet.handler.async.runnable.UpdateTopicConfigurationRunnable.getResult(UpdateTopicConfigurationRunnable.java:73) ~[cruise-control-2.5.137.jar:?]
    at com.linkedin.kafka.cruisecontrol.servlet.handler.async.runnable.OperationRunnable.run(OperationRunnable.java:45) ~[cruise-control-2.5.137.jar:?]
    at com.linkedin.kafka.cruisecontrol.servlet.handler.async.runnable.GoalBasedOperationRunnable.run(GoalBasedOperationRunnable.java:36) ~[cruise-control-2.5.137.jar:?]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
    at java.lang.Thread.run(Thread.java:840) ~[?:?]
    ....

I opened a new PR which should help to make the model generation faster. If it doesn't work, then we could try to raise the timeout to 8 or 10 minutes.