spring-cloud / spring-cloud-kubernetes

Kubernetes integration with Spring Cloud Discovery Client, Configuration, etc...
Apache License 2.0
3.45k stars 1.03k forks source link

Investivate IT Failures #1494

Closed ryanjbaxter closed 10 months ago

ryanjbaxter commented 10 months ago

@wind57 I am seeing some failing IT on our Jenkins builds. There are two failures

ActuatorRefreshIT

16:50:53 [INFO] Running org.springframework.cloud.kubernetes.configuration.watcher.ActuatorRefreshIT
16:50:53 20:50:53.371 [main] INFO  org.testcontainers.utility.ImageNameSubstitutor - Image name substitution will be performed by: DefaultImageNameSubstitutor (composite of 'ConfigurationFileImageNameSubstitutor' and 'PrefixingImageNameSubstitutor')
16:50:53 20:50:53.640 [main] INFO  org.testcontainers.dockerclient.DockerClientProviderStrategy - Loaded org.testcontainers.dockerclient.UnixSocketClientProviderStrategy from ~/.testcontainers.properties, will try it first
16:50:53 20:50:53.955 [main] INFO  org.testcontainers.dockerclient.DockerClientProviderStrategy - Found Docker environment with local Unix socket (unix:///var/run/docker.sock)
16:50:53 20:50:53.975 [main] INFO  org.testcontainers.DockerClientFactory - Docker host IP address is localhost
16:50:53 20:50:53.994 [main] INFO  org.testcontainers.DockerClientFactory - Connected to docker: 
16:50:53   Server Version: 24.0.2
16:50:53   API Version: 1.43
16:50:53   Operating System: Ubuntu 18.04.6 LTS
16:50:53   Total Memory: 16040 MB
16:50:54 20:50:54.092 [main] INFO  tc.testcontainers/ryuk:0.4.0 - Creating container for image: testcontainers/ryuk:0.4.0
16:50:54 20:50:54.342 [main] INFO  tc.testcontainers/ryuk:0.4.0 - Container testcontainers/ryuk:0.4.0 is starting: a2791ce5582f13aaa7a3de373ef89513ec3384c81fed3552976772e725160b69
16:50:54 20:50:54.810 [main] INFO  tc.testcontainers/ryuk:0.4.0 - Container testcontainers/ryuk:0.4.0 started in PT0.79561106S
16:50:54 20:50:54.821 [main] INFO  org.testcontainers.utility.RyukResourceReaper - Ryuk started - will monitor and terminate Testcontainers containers on JVM exit
16:50:54 20:50:54.821 [main] INFO  org.testcontainers.DockerClientFactory - Checking the system...
16:50:54 20:50:54.822 [main] INFO  org.testcontainers.DockerClientFactory - ✔︎ Docker server version should be at least 1.6.0
16:50:54 20:50:54.823 [main] INFO  tc.rancher/k3s:v1.25.4-k3s1 - Creating container for image: rancher/k3s:v1.25.4-k3s1
16:50:55 20:50:55.543 [main] WARN  tc.rancher/k3s:v1.25.4-k3s1 - Reuse was requested but the environment does not support the reuse of containers
16:50:55 To enable reuse of containers, you must set 'testcontainers.reuse.enable=true' in a file located at /home/jenkins/.testcontainers.properties
16:50:56 20:50:56.835 [main] INFO  tc.rancher/k3s:v1.25.4-k3s1 - Container rancher/k3s:v1.25.4-k3s1 is starting: 65511c88e6da78637adee334e8c2eb9a5abf41d86a97fcab4bc48dc75429c457
16:50:56 20:50:56.986 [main] ERROR tc.rancher/k3s:v1.25.4-k3s1 - Could not start container
16:50:56 com.github.dockerjava.api.exception.InternalServerErrorException: Status 500: {"message":"driver failed programming external connectivity on endpoint ecstatic_sutherland (323372cadd8703631131595d75df66a1537f55be41027f9ff9be9b34d6c45cb0): Bind for 0.0.0.0:9092 failed: port is already allocated"}
16:50:56 
16:50:56    at org.testcontainers.shaded.com.github.dockerjava.core.DefaultInvocationBuilder.execute(DefaultInvocationBuilder.java:247)
16:50:56    at org.testcontainers.shaded.com.github.dockerjava.core.DefaultInvocationBuilder.post(DefaultInvocationBuilder.java:102)
16:50:56    at org.testcontainers.shaded.com.github.dockerjava.core.exec.StartContainerCmdExec.execute(StartContainerCmdExec.java:31)
16:50:56    at org.testcontainers.shaded.com.github.dockerjava.core.exec.StartContainerCmdExec.execute(StartContainerCmdExec.java:13)
16:50:56    at org.testcontainers.shaded.com.github.dockerjava.core.exec.AbstrSyncDockerCmdExec.exec(AbstrSyncDockerCmdExec.java:21)
16:50:56    at org.testcontainers.shaded.com.github.dockerjava.core.command.AbstrDockerCmd.exec(AbstrDockerCmd.java:35)
16:50:56    at org.testcontainers.shaded.com.github.dockerjava.core.command.StartContainerCmdImpl.exec(StartContainerCmdImpl.java:43)
16:50:56    at org.testcontainers.containers.GenericContainer.tryStart(GenericContainer.java:441)
16:50:56    at org.testcontainers.containers.GenericContainer.lambda$doStart$0(GenericContainer.java:344)
16:50:56    at org.rnorth.ducttape.unreliables.Unreliables.retryUntilSuccess(Unreliables.java:81)
16:50:56    at org.testcontainers.containers.GenericContainer.doStart(GenericContainer.java:334)
16:50:56    at org.testcontainers.containers.GenericContainer.start(GenericContainer.java:322)
16:50:56    at org.springframework.cloud.kubernetes.configuration.watcher.ActuatorRefreshIT.beforeAll(ActuatorRefreshIT.java:66)
16:50:56    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
16:50:56    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
16:50:56    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
16:50:56    at java.base/java.lang.reflect.Method.invoke(Method.java:568)
16:50:56    at org.junit.platform.commons.util.ReflectionUtils.invokeMethod(ReflectionUtils.java:727)
16:50:56    at org.junit.jupiter.engine.execution.MethodInvocation.proceed(MethodInvocation.java:60)
16:50:56    at org.junit.jupiter.engine.execution.InvocationInterceptorChain$ValidatingInvocation.proceed(InvocationInterceptorChain.java:131)
16:50:57    at org.junit.jupiter.engine.extension.TimeoutExtension.intercept(TimeoutExtension.java:156)
16:50:57    at org.junit.jupiter.engine.extension.TimeoutExtension.interceptLifecycleMethod(TimeoutExtension.java:128)
16:50:57    at org.junit.jupiter.engine.extension.TimeoutExtension.interceptBeforeAllMethod(TimeoutExtension.java:70)
16:50:57    at org.junit.jupiter.engine.execution.InterceptingExecutableInvoker$ReflectiveInterceptorCall.lambda$ofVoidMethod$0(InterceptingExecutableInvoker.java:103)
16:50:57    at org.junit.jupiter.engine.execution.InterceptingExecutableInvoker.lambda$invoke$0(InterceptingExecutableInvoker.java:93)
16:50:57    at org.junit.jupiter.engine.execution.InvocationInterceptorChain$InterceptedInvocation.proceed(InvocationInterceptorChain.java:106)
16:50:57    at org.junit.jupiter.engine.execution.InvocationInterceptorChain.proceed(InvocationInterceptorChain.java:64)
16:50:57    at org.junit.jupiter.engine.execution.InvocationInterceptorChain.chainAndInvoke(InvocationInterceptorChain.java:45)
16:50:57    at org.junit.jupiter.engine.execution.InvocationInterceptorChain.invoke(InvocationInterceptorChain.java:37)
16:50:57    at org.junit.jupiter.engine.execution.InterceptingExecutableInvoker.invoke(InterceptingExecutableInvoker.java:92)
16:50:57    at org.junit.jupiter.engine.execution.InterceptingExecutableInvoker.invoke(InterceptingExecutableInvoker.java:86)
16:50:57    at org.junit.jupiter.engine.descriptor.ClassBasedTestDescriptor.lambda$invokeBeforeAllMethods$13(ClassBasedTestDescriptor.java:411)
16:50:57    at org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:73)
16:50:57    at org.junit.jupiter.engine.descriptor.ClassBasedTestDescriptor.invokeBeforeAllMethods(ClassBasedTestDescriptor.java:409)
16:50:57    at org.junit.jupiter.engine.descriptor.ClassBasedTestDescriptor.before(ClassBasedTestDescriptor.java:215)
16:50:57    at org.junit.jupiter.engine.descriptor.ClassBasedTestDescriptor.before(ClassBasedTestDescriptor.java:84)
16:50:57    at org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$6(NodeTestTask.java:148)
16:50:57    at org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:73)
16:50:57    at org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$8(NodeTestTask.java:141)
16:50:57    at org.junit.platform.engine.support.hierarchical.Node.around(Node.java:137)
16:50:57    at org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$9(NodeTestTask.java:139)
16:50:57    at org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:73)
16:50:57    at org.junit.platform.engine.support.hierarchical.NodeTestTask.executeRecursively(NodeTestTask.java:138)
16:50:57    at org.junit.platform.engine.support.hierarchical.NodeTestTask.execute(NodeTestTask.java:95)
16:50:57    at java.base/java.util.ArrayList.forEach(ArrayList.java:1511)
16:50:57    at org.junit.platform.engine.support.hierarchical.SameThreadHierarchicalTestExecutorService.invokeAll(SameThreadHierarchicalTestExecutorService.java:41)
16:50:57    at org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$6(NodeTestTask.java:155)
16:50:57    at org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:73)
16:50:57    at org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$8(NodeTestTask.java:141)
16:50:57    at org.junit.platform.engine.support.hierarchical.Node.around(Node.java:137)
16:50:57    at org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$9(NodeTestTask.java:139)
16:50:57    at org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:73)
16:50:57    at org.junit.platform.engine.support.hierarchical.NodeTestTask.executeRecursively(NodeTestTask.java:138)
16:50:57    at org.junit.platform.engine.support.hierarchical.NodeTestTask.execute(NodeTestTask.java:95)
16:50:57    at org.junit.platform.engine.support.hierarchical.SameThreadHierarchicalTestExecutorService.submit(SameThreadHierarchicalTestExecutorService.java:35)
16:50:57    at org.junit.platform.engine.support.hierarchical.HierarchicalTestExecutor.execute(HierarchicalTestExecutor.java:57)
16:50:57    at org.junit.platform.engine.support.hierarchical.HierarchicalTestEngine.execute(HierarchicalTestEngine.java:54)
16:50:57    at org.junit.platform.launcher.core.EngineExecutionOrchestrator.execute(EngineExecutionOrchestrator.java:147)
16:50:57    at org.junit.platform.launcher.core.EngineExecutionOrchestrator.execute(EngineExecutionOrchestrator.java:127)
16:50:57    at org.junit.platform.launcher.core.EngineExecutionOrchestrator.execute(EngineExecutionOrchestrator.java:90)
16:50:57    at org.junit.platform.launcher.core.EngineExecutionOrchestrator.lambda$execute$0(EngineExecutionOrchestrator.java:55)
16:50:57    at org.junit.platform.launcher.core.EngineExecutionOrchestrator.withInterceptedStreams(EngineExecutionOrchestrator.java:102)
16:50:57    at org.junit.platform.launcher.core.EngineExecutionOrchestrator.execute(EngineExecutionOrchestrator.java:54)
16:50:57    at org.junit.platform.launcher.core.DefaultLauncher.execute(DefaultLauncher.java:114)
16:50:57    at org.junit.platform.launcher.core.DefaultLauncher.execute(DefaultLauncher.java:86)
16:50:57    at org.junit.platform.launcher.core.DefaultLauncherSession$DelegatingLauncher.execute(DefaultLauncherSession.java:86)
16:50:57    at org.junit.platform.launcher.core.SessionPerRequestLauncher.execute(SessionPerRequestLauncher.java:53)
16:50:57    at org.apache.maven.surefire.junitplatform.JUnitPlatformProvider.invokeAllTests(JUnitPlatformProvider.java:150)
16:50:57    at org.apache.maven.surefire.junitplatform.JUnitPlatformProvider.invoke(JUnitPlatformProvider.java:124)
16:50:57    at org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384)
16:50:57    at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345)
16:50:57    at org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126)
16:50:57    at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418)
16:50:57 20:50:56.997 [main] ERROR tc.rancher/k3s:v1.25.4-k3s1 - There are no stdout/stderr logs available for the failed container
16:50:57 [ERROR] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 3.898 s <<< FAILURE! - in org.springframework.cloud.kubernetes.configuration.watcher.ActuatorRefreshIT
16:50:57 [ERROR] org.springframework.cloud.kubernetes.configuration.watcher.ActuatorRefreshIT  Time elapsed: 3.898 s  <<< ERROR!
16:50:57 org.testcontainers.containers.ContainerLaunchException: Container startup failed for image rancher/k3s:v1.25.4-k3s1
16:50:57    at org.springframework.cloud.kubernetes.configuration.watcher.ActuatorRefreshIT.beforeAll(ActuatorRefreshIT.java:66)
16:50:57 Caused by: org.rnorth.ducttape.RetryCountExceededException: Retry limit hit with exception
16:50:57    at org.springframework.cloud.kubernetes.configuration.watcher.ActuatorRefreshIT.beforeAll(ActuatorRefreshIT.java:66)
16:50:57 Caused by: org.testcontainers.containers.ContainerLaunchException: Could not create/start container
16:50:57    at org.springframework.cloud.kubernetes.configuration.watcher.ActuatorRefreshIT.beforeAll(ActuatorRefreshIT.java:66)
16:50:57 Caused by: com.github.dockerjava.api.exception.InternalServerErrorException: 
16:50:57 Status 500: {"message":"driver failed programming external connectivity on endpoint ecstatic_sutherland (323372cadd8703631131595d75df66a1537f55be41027f9ff9be9b34d6c45cb0): Bind for 0.0.0.0:9092 failed: port is already allocated"}
16:50:57 
16:50:57    at org.springframework.cloud.kubernetes.configuration.watcher.ActuatorRefreshIT.beforeAll(ActuatorRefreshIT.java:66)

This then causes ActuatorRefreshMultipleNamespacesIT to fail because I believe its assuming the first test starts the container.

However when I try to run ActuatorRefreshIT locally on my mac I am getting an error trying to import the configuration watcher image

ctr i import /var/folders/k3/zv8hzdm17vv69j485fv3cf9r0000gp/T/spring-cloud-kubernetes-configura
tion-watcher.tar
unpacking docker.io/springcloud/spring-cloud-kubernetes-configuration-watcher:3.0.5-SNAPSHOT (sha256:33464bca4f9f92a099928b6bd2f6aa668f1cf760a2b7ab63caddf2b806f412a1)...ctr: content digest sha256:548a79621a426b4eb077c926eabac5a8620c454fb230640253e1b44dc7dd7562: not found

Any idea what is going on?

wind57 commented 10 months ago

what branch is this failing in?

ryanjbaxter commented 10 months ago

3.0.x and main

ryanjbaxter commented 10 months ago

It seems like one of the other rancher containers is not being shut down (on jenkins, the local issue is something different clearly)

wind57 commented 10 months ago

I can't reproduce the local problem, which I am trying to figure out first.

I deleted my local docker images, then from 3.0.x:

wind57@wind57s-MacBook-Pro ~/D/p/s/s/s/spring-cloud-kubernetes-configuration-watcher (3.0.x)> java -version
openjdk version "17.0.1" 2021-10-19 LTS
OpenJDK Runtime Environment (build 17.0.1+12-LTS)
OpenJDK 64-Bit Server VM (build 17.0.1+12-LTS, mixed mode, sharing)

mvn clean install -Dspring-boot.build-image.builder=dashaun/builder:tiny -DskipTests

Then from Intellij, Run ActuatorRefreshIT. Can you try re-building the entire project and images? I usually do:

wind57@wind57s-MacBook-Pro ~/D/p/s/spring-cloud-kubernetes (3.0.x)> mvn clean install -Dskip.build.image=true -DskipITs -DskipTests  -T1C

wind57@wind57s-MacBook-Pro ~/D/p/s/spring-cloud-kubernetes (3.0.x)> cd spring-cloud-kubernetes-controllers/spring-cloud-kubernetes-configuration-watcher/
wind57@wind57s-MacBook-Pro ~/D/p/s/s/s/spring-cloud-kubernetes-configuration-watcher (3.0.x)> mvn clean install -Dspring-boot.build-image.builder=dashaun/builder:tiny -DskipTests

wind57@wind57s-MacBook-Pro ~/D/p/s/s/s/spring-cloud-kubernetes-configuration-watcher (3.0.x)> cd ../../spring-cloud-kubernetes-integration-tests/spring-cloud-kubernetes-k8s-client-configuration-watcher/
wind57@wind57s-MacBook-Pro ~/D/p/s/s/s/spring-cloud-kubernetes-k8s-client-configuration-watcher (3.0.x)> mvn clean install -Dspring-boot.build-image.builder=dashaun/builder:tiny -DskipTests

Looks very odd the content digest ... not found issue

ryanjbaxter commented 10 months ago

I forgot -Dspring-boot.build-image.builder=dashaun/builder:tiny that fixed it locally, sorry about the noise on that 🤦‍♂️

wind57 commented 10 months ago

No worries at all, I've done it far too many times too ))


The jenkins issue is rather interesting too:

16:50:55 20:50:55.543 [main] WARN  tc.rancher/k3s:v1.25.4-k3s1 - Reuse was requested but the environment does not support the reuse of containers
16:50:55 To enable reuse of containers, you must set 'testcontainers.reuse.enable=true' in a file located at /home/jenkins/.testcontainers.properties

Has anything changed in your jenkins server recently? The thing is, without that re-use support, this issue is rather expected

wind57 commented 10 months ago

In github actions, we set that re-use support via proper actions, for example:

https://github.com/spring-cloud/spring-cloud-kubernetes/blob/3a6aabdf8eda586a74fb108a7c124d117d6bc95e/.github/workflows/maven.yaml#L132

but I don't know if that happens in jenkins

ryanjbaxter commented 10 months ago

Not that I know of. If I look at the last green build I see that we were not reusing the container there as well

08:33:56.518 [main] INFO  tc.testcontainers/ryuk:0.5.1 - Creating container for image: testcontainers/ryuk:0.5.1
08:33:56.715 [main] INFO  tc.testcontainers/ryuk:0.5.1 - Container testcontainers/ryuk:0.5.1 is starting: 92c0fb507a19cbf49c33627c678971e91419deb7e249ec6cc10fc59baddce6fe
08:33:57.089 [main] INFO  tc.testcontainers/ryuk:0.5.1 - Container testcontainers/ryuk:0.5.1 started in PT0.57130536S
08:33:57.095 [main] INFO  org.testcontainers.utility.RyukResourceReaper - Ryuk started - will monitor and terminate Testcontainers containers on JVM exit
08:33:57.095 [main] INFO  org.testcontainers.DockerClientFactory - Checking the system...
08:33:57.095 [main] INFO  org.testcontainers.DockerClientFactory - ✔︎ Docker server version should be at least 1.6.0
08:33:57.096 [main] INFO  tc.rancher/k3s:v1.25.4-k3s1 - Creating container for image: rancher/k3s:v1.25.4-k3s1
08:33:57.796 [main] WARN  tc.rancher/k3s:v1.25.4-k3s1 - Reuse was requested but the environment does not support the reuse of containers
To enable reuse of containers, you must set 'testcontainers.reuse.enable=true' in a file located at /home/jenkins/.testcontainers.properties
08:33:59.040 [main] INFO  tc.rancher/k3s:v1.25.4-k3s1 - Container rancher/k3s:v1.25.4-k3s1 is starting: 0dc8dcf679c57b699885f29e713fbb57a897789129afd8486a69995d96e4f2a1
08:34:18.540 [main] INFO  tc.rancher/k3s:v1.25.4-k3s1 - Container rancher/k3s:v1.25.4-k3s1 started in PT21.444034897S
wind57 commented 10 months ago

Do you run parallel builds on jenkins too? How we do it on github?

ryanjbaxter commented 10 months ago

No its all run sequentially

wind57 commented 10 months ago

I am honestly confused how this works on jenkins tbh. In our integration tests, all of them have a @BeforeAll that issue a K3sContainer::start, where K3sContainer CONTAINER = new FixedPortsK3sContainer ... with fixed ports. So I would have expected any second integration test to fail with the error that you are seeing.

I am sure I miss something on how jenkins is set-up of what happens there :(

ryanjbaxter commented 10 months ago

Maybe it is that for most cases the container is shut down before the next test is run but we happen to uncover a situation now where that is not the case. I ignored the tests for now and the builds are passing. I am working through our 2023.0.0-RC1 release at the moment but once that is done I will revert the changes and try to create the testcontainers.properties file in our jenkins builds. I am guessing that will make the issue go away.

wind57 commented 10 months ago

I see, it was probably ryuk that was shutting down the containers in this case. Otherwise, I agree that creating a testcontainers.properties file with re-use enabled should fix it. This should also speed up tests a lot. Anyway, not going to bug you until you are done with the release.

wind57 commented 10 months ago

so you reverted the code and tests are now enabled. this issue can be closed as such?

ryanjbaxter commented 10 months ago

I made some changes to our Jenkins build so we create the .testcontainers.properties file. It seems like this worked, but now its failing for another reason. In any case I think we can close this one