strimzi / strimzi-kafka-operator

Apache Kafka® running on Kubernetes
https://strimzi.io/
Apache License 2.0
4.76k stars 1.27k forks source link

[Bug]: Fix StrimziUpgradeST.testUpgradeAcrossVersionsWithUnsupportedKafkaVersion to use unsupported Kafka version instead of a supported one #10495

Open egyedt opened 2 weeks ago

egyedt commented 2 weeks ago

Related problem

StrimziUpgradeST.testUpgradeAcrossVersionsWithUnsupportedKafkaVersion is use a supported Kafka version from start so the required test with an unsupported Kafka version is not happen.

Suggested solution

Alternatives

Remove test case, since it has no additional value in current form

Additional context

This is related to Strimzi upgrade tests.

im-konge commented 2 weeks ago

Hey, they are not the same, as the names are suggesting. One is testing upgrade without Kafka version and one is testing upgrade with unsupported Kafka version. That's why the waits are not there.

When the Kafka version is not supported, you need to update it and then check the rolling updates. So I don't think that you would "fix" it by adding additional waits for rolling updates before the Kafka version change.

im-konge commented 2 weeks ago

but in the first test case we are not waiting for pods to be rolling updated, so timing issues can occur. (for me it happened that the entity-operators were not upgraded in time)

TBH we are not hitting this anywhere, so maybe it can be some race condition on your environment? And what does this: for me it happened that the entity-operators were not upgraded in time mean? Do you have some logs from the test run where it fails for you? Thanks.

egyedt commented 2 weeks ago

Yeah I already knew that the two tests are not the same... thanks for the clarification!

As I stated, they are almost the same. The differences:

Other tests usually use waitForKafkaClusterRollingUpdate method of AbstractUpgradeST to wait until the upgrade really happened to all related pods (brokers, ZK, co, eo, ...). So this StrimziUpgradeST.testUpgradeAcrossVersionsWithUnsupportedKafkaVersion test is exceptional, since it contains no wait for the upgrade.

Related logs:

java.lang.AssertionError: 
Used image for Pod: co-namespace/my-cluster-entity-operator-5688b48f4b-84sb2 is not valid!
Expected: a string containing " strimzi/operator:latest"
     but: was ".../ strimzi/operator:0.40"
    at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
    at io.strimzi.systemtest.upgrade.AbstractUpgradeST.checkContainerImages(AbstractUpgradeST.java:349)
    at io.strimzi.systemtest.upgrade.AbstractUpgradeST.checkContainerImages(AbstractUpgradeST.java:341)
    at io.strimzi.systemtest.upgrade.AbstractUpgradeST.checkAllImages(AbstractUpgradeST.java:336)
    at io.strimzi.systemtest.upgrade.regular.StrimziUpgradeST.testUpgradeAcrossVersionsWithUnsupportedKafkaVersion(StrimziUpgradeST.java:125)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.base/java.lang.reflect.Method.invoke(Method.java:569)
    at org.junit.platform.commons.util.ReflectionUtils.invokeMethod(ReflectionUtils.java:725)
    at org.junit.jupiter.engine.execution.MethodInvocation.proceed(MethodInvocation.java:60)
    at org.junit.jupiter.engine.execution.InvocationInterceptorChain$ValidatingInvocation.proceed(InvocationInterceptorChain.java:131)
    at org.junit.jupiter.engine.extension.TimeoutExtension.intercept(TimeoutExtension.java:149)
    at org.junit.jupiter.engine.extension.TimeoutExtension.interceptTestableMethod(TimeoutExtension.java:140)
    at org.junit.jupiter.engine.extension.TimeoutExtension.interceptTestMethod(TimeoutExtension.java:84)
    at org.junit.jupiter.engine.execution.ExecutableInvoker$ReflectiveInterceptorCall.lambda$ofVoidMethod$0(ExecutableInvoker.java:115)
    at org.junit.jupiter.engine.execution.ExecutableInvoker.lambda$invoke$0(ExecutableInvoker.java:105)
    at org.junit.jupiter.engine.execution.InvocationInterceptorChain$InterceptedInvocation.proceed(InvocationInterceptorChain.java:106)
    at org.junit.jupiter.engine.execution.InvocationInterceptorChain.proceed(InvocationInterceptorChain.java:64)
    at org.junit.jupiter.engine.execution.InvocationInterceptorChain.chainAndInvoke(InvocationInterceptorChain.java:45)
    at org.junit.jupiter.engine.execution.InvocationInterceptorChain.invoke(InvocationInterceptorChain.java:37)
    at org.junit.jupiter.engine.execution.ExecutableInvoker.invoke(ExecutableInvoker.java:104)
    at org.junit.jupiter.engine.execution.ExecutableInvoker.invoke(ExecutableInvoker.java:98)
    at org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.lambda$invokeTestMethod$7(TestMethodTestDescriptor.java:214)
    at org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:73)
    at org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.invokeTestMethod(TestMethodTestDescriptor.java:210)
    at org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.execute(TestMethodTestDescriptor.java:135)
    at org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.execute(TestMethodTestDescriptor.java:66)
    at org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$6(NodeTestTask.java:151)
    at org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:73)
    at org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$8(NodeTestTask.java:141)
    at org.junit.platform.engine.support.hierarchical.Node.around(Node.java:137)
    at org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$9(NodeTestTask.java:139)
    at org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:73)
    at org.junit.platform.engine.support.hierarchical.NodeTestTask.executeRecursively(NodeTestTask.java:138)
    at org.junit.platform.engine.support.hierarchical.NodeTestTask.execute(NodeTestTask.java:95)
    at java.base/java.util.ArrayList.forEach(ArrayList.java:1511)
    at org.junit.platform.engine.support.hierarchical.SameThreadHierarchicalTestExecutorService.invokeAll(SameThreadHierarchicalTestExecutorService.java:41)
    at org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$6(NodeTestTask.java:155)
    at org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:73)
    at org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$8(NodeTestTask.java:141)
    at org.junit.platform.engine.support.hierarchical.Node.around(Node.java:137)
    at org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$9(NodeTestTask.java:139)
    at org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:73)
    at org.junit.platform.engine.support.hierarchical.NodeTestTask.executeRecursively(NodeTestTask.java:138)
    at org.junit.platform.engine.support.hierarchical.NodeTestTask.execute(NodeTestTask.java:95)
    at java.base/java.util.ArrayList.forEach(ArrayList.java:1511)
    at org.junit.platform.engine.support.hierarchical.SameThreadHierarchicalTestExecutorService.invokeAll(SameThreadHierarchicalTestExecutorService.java:41)
    at org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$6(NodeTestTask.java:155)
    at org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:73)
    at org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$8(NodeTestTask.java:141)
    at org.junit.platform.engine.support.hierarchical.Node.around(Node.java:137)
    at org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$9(NodeTestTask.java:139)
    at org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:73)
    at org.junit.platform.engine.support.hierarchical.NodeTestTask.executeRecursively(NodeTestTask.java:138)
    at org.junit.platform.engine.support.hierarchical.NodeTestTask.execute(NodeTestTask.java:95)
    at org.junit.platform.engine.support.hierarchical.SameThreadHierarchicalTestExecutorService.submit(SameThreadHierarchicalTestExecutorService.java:35)
    at org.junit.platform.engine.support.hierarchical.HierarchicalTestExecutor.execute(HierarchicalTestExecutor.java:57)
    at org.junit.platform.engine.support.hierarchical.HierarchicalTestEngine.execute(HierarchicalTestEngine.java:54)
    at org.junit.platform.launcher.core.EngineExecutionOrchestrator.execute(EngineExecutionOrchestrator.java:107)
    at org.junit.platform.launcher.core.EngineExecutionOrchestrator.execute(EngineExecutionOrchestrator.java:88)
    at org.junit.platform.launcher.core.EngineExecutionOrchestrator.lambda$execute$0(EngineExecutionOrchestrator.java:54)
    at org.junit.platform.launcher.core.EngineExecutionOrchestrator.withInterceptedStreams(EngineExecutionOrchestrator.java:67)
    at org.junit.platform.launcher.core.EngineExecutionOrchestrator.execute(EngineExecutionOrchestrator.java:52)
    at org.junit.platform.launcher.core.DefaultLauncher.execute(DefaultLauncher.java:114)
    at org.junit.platform.launcher.core.DefaultLauncher.execute(DefaultLauncher.java:86)
    at org.junit.platform.launcher.core.DefaultLauncherSession$DelegatingLauncher.execute(DefaultLauncherSession.java:86)
    at org.apache.maven.surefire.junitplatform.LazyLauncher.execute(LazyLauncher.java:56)
    at org.apache.maven.surefire.junitplatform.JUnitPlatformProvider.execute(JUnitPlatformProvider.java:184)
    at org.apache.maven.surefire.junitplatform.JUnitPlatformProvider.invokeAllTests(JUnitPlatformProvider.java:148)
    at org.apache.maven.surefire.junitplatform.JUnitPlatformProvider.invoke(JUnitPlatformProvider.java:122)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.base/java.lang.reflect.Method.invoke(Method.java:569)
    at org.apache.maven.surefire.api.util.ReflectionUtils.invokeMethodWithArray2(ReflectionUtils.java:137)
    at org.apache.maven.surefire.booter.ProviderFactory$ProviderProxy.invoke(ProviderFactory.java:148)
    at org.apache.maven.surefire.booter.ProviderFactory.invokeProvider(ProviderFactory.java:88)
    at org.apache.maven.plugin.surefire.InPluginVMSurefireStarter.runSuitesInProcess(InPluginVMSurefireStarter.java:91)
    at org.apache.maven.plugin.surefire.AbstractSurefireMojo.executeProvider(AbstractSurefireMojo.java:1212)
    at org.apache.maven.plugin.surefire.AbstractSurefireMojo.executeAfterPreconditionsChecked(AbstractSurefireMojo.java:1090)
    at org.apache.maven.plugin.surefire.AbstractSurefireMojo.execute(AbstractSurefireMojo.java:910)
    at org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:137)
    at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:210)
    at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:156)
    at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:148)
    at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:117)
    at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:81)
    at org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build(SingleThreadedBuilder.java:56)
    at org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:128)
    at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:305)
    at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:192)
    at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:105)
    at org.apache.maven.cli.MavenCli.execute(MavenCli.java:957)
    at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:289)
    at org.apache.maven.cli.MavenCli.main(MavenCli.java:193)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.base/java.lang.reflect.Method.invoke(Method.java:569)
    at org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:282)
    at org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:225)
    at org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:406)
    at org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:347)
im-konge commented 2 weeks ago

I'm checking the test-case now, it seems that it even wrongly sets the Kafka version that should be used there. And thanks for the exception. I will let you know what I found, but maybe the test itself is wrong. Just for clarification -> you are running the test on the main branch? Because I see the 0.40.0 version there, which is not there for few releases. Thanks for all the info :)

im-konge commented 2 weeks ago

Yeah so as I thought, the test is parsing the Kafka version wrongly. That means that it takes the supported Kafka version (instead of unsupported), deploys everything and then upgrades CO, which will cause that all the ZK, Kafka and EO Pods are rolling. When I switched it really to unsupported Kafka version, I got:

  status:
    clusterId: wvnIjVd_R4mfeA8TUHU4tQ
    conditions:
    - lastTransitionTime: "2024-08-26T18:25:05.618017238Z"
      message: 'Unsupported Kafka.spec.kafka.version: 3.6.0. Supported versions are:
        [3.7.0, 3.7.1, 3.8.0]'
      reason: UnsupportedKafkaVersionException
      status: "True"
      type: NotReady
    kafkaMetadataState: ZooKeeper
    kafkaVersion: 3.6.0
    observedGeneration: 1
    operatorLastSuccessfulVersion: 0.42.0

and the only rolled Pod was the CO's:

NAME                                          READY   STATUS    RESTARTS   AGE
cluster-a1f38780-consumer-continuous-mt5g4    1/1     Running   0          4m19s
cluster-a1f38780-producer-continuous-t56md    1/1     Running   0          4m19s
my-cluster-entity-operator-7c99854844-jfhq7   2/2     Running   0          5m28s
my-cluster-kafka-0                            1/1     Running   0          6m2s
my-cluster-kafka-1                            1/1     Running   0          6m2s
my-cluster-kafka-2                            1/1     Running   0          6m2s
my-cluster-zookeeper-0                        1/1     Running   0          6m47s
my-cluster-zookeeper-1                        1/1     Running   0          6m47s
my-cluster-zookeeper-2                        1/1     Running   0          6m47s
strimzi-cluster-operator-678d4f8595-cxvq9     1/1     Running   0          3m16s

So if you would add those waits there, it will contain the same fault and yeah, it would not differ to other tests.

Anyway, good catch :) we will fix the parsing of the Kafka version (that should fix your race condition as well).

egyedt commented 2 weeks ago

Thanks, @im-konge, this explains the issue and timing problems in my executions. Should we fix the wrong version parsing problem in this issue and in the related PR or should we close this and open another Issue-PR pair?

im-konge commented 2 weeks ago

If you want to have a look at it (I will be really glad :) ) we can do it as part of this issue and the PR you opened, I would just edit the name and description of the issue. Should I do it or do you want to do it? Thanks a lot once more for checking it :)

egyedt commented 2 weeks ago

@im-konge, I fixed the title and description. Do you have an idea how can we always have a/some unsupported Kafka version(s)?

If I understand the main reason behind this test, then the expected behaviour of this case is the following:

Since we read the config from bundleUpgrade.yaml and it may occur that two releases supports exactly the same set of Kafka versions, how can we sure about this test? If the supported Kafka versions set is exactly the same, then we will not be able to create a proper starter state for this test...

Do you have an idea how should we avoid scenarios like this? Should we skip this test with a junit Assumption in these edge cases?

im-konge commented 2 weeks ago

@egyedt the scenario is exactly as you wrote, good job. I will need to think about it, what will be the best way, as doing something complicated doesn't make sense IMHO.

I can think about something like this:

So basically as you mentioned there.

Most probably you will need to add some supporting methods into the TestKafkaVersion - currently there is just the method that returns list of Kafka versions from file on specified URL. So from there you will need to list those Kafka versions that are supported and pick the oldest one. Then you can use it in the UpgradeKafkaVersion class.

If there is anything I can help you with or some question I can answer, please let me know :)

And thanks for changing the description (and that you are working on this)!