Open raviagarwal7 opened 3 years ago
https://github.com/apache/geode/pull/5804#issuecomment-738050602
@jdeppe-pivotal thanks! We will release 1.15.1 soon that updates docker-java with some changes to Socket read operations, I hope it will solve this issue, so please give it a try once released! :)
@bsideup I see that 1.15.1 was released 5 days ago. Do we know if this issue was resolved?
@raviagarwal7 I guess it is better to ask @jdeppe-pivotal and whether it helped them or not (I hope it did! :))
Thanks, @bsideup. have asked @jdeppe-pivotal
I will also try out 1.15.1 and report back if it fixes the issue for us.
I tried on 1.15.1 and the issue still exists.
same is the case reported by @jdeppe-pivotal https://github.com/apache/geode/pull/5804#issuecomment-746446587
I ran into the same issue, starting Ryak seems to just hung on me:
[20:29:14] : [Step 1/1] 2021-01-18 20:29:14,669 INFO | main | o.t.dockerclient.DockerClientProviderStrategy | Loaded org.testcontainers.dockerclient.UnixSocketClientProviderStrategy from ~/.testcontainers.properties, will try it first
[20:29:15] : [Step 1/1] 2021-01-18 20:29:15,293 INFO | main | o.t.dockerclient.DockerClientProviderStrategy | Found Docker environment with local Unix socket (unix:///var/run/docker.sock)
[20:29:15] : [Step 1/1] 2021-01-18 20:29:15,294 INFO | main | org.testcontainers.DockerClientFactory | Docker host IP address is localhost
[20:29:15] : [Step 1/1] 2021-01-18 20:29:15,321 INFO | main | org.testcontainers.DockerClientFactory | Connected to docker:
[20:29:15] : [Step 1/1] Server Version: 18.09.0
[20:29:15] : [Step 1/1] API Version: 1.39
[20:29:15] : [Step 1/1] Operating System: Red Hat Enterprise Linux Server 7.4 (Maipo)
[20:29:15] : [Step 1/1] Total Memory: 31211 MB
[20:29:15] : [Step 1/1] 2021-01-18 20:29:15,325 INFO | main | org.testcontainers.utility.ImageNameSubstitutor | Image name substitution will be performed by: DefaultImageNameSubstitutor (composite of 'ConfigurationFileImageNameSubstitutor' and 'PrefixingImageNameSubstitutor')
[20:29:15]i: [Step 1/1] Docker event: {"status":"create","id":"bfef100aecc22dff2c2193685593f6c42172fccdf365edcb9df940035b64208b","from":"testcontainers/ryuk:0.3.0","Type":"container","Action":"create","Actor":{"ID":"bfef100aecc22dff2c2193685593f6c42172fccdf365edcb9df940035b64208b","Attributes":{"image":"testcontainers/ryuk:0.3.0","name":"testcontainers-ryuk-53ff9e8d-c141-45c9-a9a3-f2e9bd9e4741","org.testcontainers":"true"}},"scope":"local","time":1611001755,"timeNano":1611001755485478257}
[20:29:15]i: [Step 1/1] Create docker info file: /home/ec2-user/buildAgent/temp/buildTmp/.teamcity/docker/build_73/events.json
[20:29:15]i: [Step 1/1] Docker event: {"status":"start","id":"bfef100aecc22dff2c2193685593f6c42172fccdf365edcb9df940035b64208b","from":"testcontainers/ryuk:0.3.0","Type":"container","Action":"start","Actor":{"ID":"bfef100aecc22dff2c2193685593f6c42172fccdf365edcb9df940035b64208b","Attributes":{"image":"testcontainers/ryuk:0.3.0","name":"testcontainers-ryuk-53ff9e8d-c141-45c9-a9a3-f2e9bd9e4741","org.testcontainers":"true"}},"scope":"local","time":1611001755,"timeNano":1611001755779756677}
[00:00:17]E: [Step 1/1] The build Release Process Builds::Test Release #73 {buildId=796001} has been running for more than 240 minutes. Terminating...
This is with test containers 1.15.1
This might be a duplicate of https://github.com/testcontainers/testcontainers-java/issues/3183
I am using 1.15.1
and also running into this issue. The behavior is basically the same to what @raviagarwal7 describes.
For now, the workaround for us would be to downgrade to the 1.14.3
, although that brings its own challenges.
Can't wait to see a fix coming out.
EDIT: Details of our setup: Running a Jenkins pipeline on a repo containing 70+ modules, many of them using testcontainers. Typical Jenkins' node setup: CPUs: 4, RAM: 8192MB, Disk: 40GB, SWAP: 0MB, Ephemeral: 0GB. An alternative way to reproduce is to run a single module (the one we tested uses Postgresql test container) many times in a row. With 50 iterations, the incidence of the error raises above 80 %, I would guess.
Updated to 1.15.2. Unfortunately, the issue still happens. Same setup as in my previous comment.
@jsmrcka could you please try the httpclient5 transport? You can enable it by setting transport.type=httpclient5
in your config (see the available locations here: https://www.testcontainers.org/features/configuration/ )
@jsmrcka could you please try the httpclient5 transport? You can enable it by setting
transport.type=httpclient5
in your config (see the available locations here: https://www.testcontainers.org/features/configuration/ )
I tried to run quite a few tests on 1.15.2 with transport.type=httpclient5
and was not able to reproduce the issue so far.
It seems like an effective workaround, thanks for the tip.
@bsideup Is this setting documented somewhere? I could not find anything but a short mention in 1.15.0 release notes (https://github.com/testcontainers/testcontainers-java/releases/tag/1.15.0).
I think I might be getting this as well. We run about 70-80 builds a day on Jenkins, and we get anywhere from 1-5 builds a day that hang forever running our integration tests.
We've been on 1.15.1
for a while and were getting this. I updated it to 1.15.2
yesterday, and we're still getting testcontainer hangs.
I can see in the Surefire reports that the test completes, but the build never progresses past whatever test it decides to hang on. If I login to the build server and run docker ps
, I just see the ryuk container.
I've set TESTCONTAINERS_TRANSPORT_TYPE="httpclient5"
as an environment variable to see if that does anything, and I'll try to capture a stacktrace if it happens again.
No hung jobs today at all. That's very unusual. The transport.type fix seems good so far.
Quick update: after a few more runs, 1.15.2 with transport.type=httpclient5
works fine.
I have also not had the issue recur. I am very happy about this. httpclient5
is now my favorite transport type.
Thanks everyone for reporting back! We will switch to httpclient5
soon, and, thanks to your input, we have more confided in it, as it seems not only be 100% compatible with everything we're doing, but also more reliable 👍
This is a race condition in the docker-java
library between two threads reading the same okio.RealBufferedSource
obtained by a okhttp3.Response
. The two threads are thread #1214
(docker-java-stream-2071761001) and Thread #28
(DelegateRunnerWithTimeout-1) in the thread dump of the issue description. Both threads reading simultaneously from the buffered source, which is not thread-safe, screws up the buffer's internal bookkeeping leading to an infinite loop on both of them.
The issue is triggered in the ResourceReaper
with the logCallback
initialized here: https://github.com/testcontainers/testcontainers-java/blob/f588082636dc2749f4129e943b02ede62527b41c/core/src/main/java/org/testcontainers/utility/ResourceReaper.java#L114-L124
The .exec
invocation internally spawns a thread named docker-java-stream-...
in the docker-java
library that starts streaming the response of the log container command:
https://github.com/docker-java/docker-java/blob/f9d2db6efff4d7fc7092b53209027d6846de1894/docker-java-core/src/main/java/com/github/dockerjava/core/DefaultInvocationBuilder.java#L262-L284
The callback
parameter contains the same object as logCallback
in the ResourceReaper
above. The DockerHttpClient.Response
retrieved on line 269 is handed to the sourceConsumer
. The sourceConsumer
is a FramedInputStreamConsumer
(https://github.com/docker-java/docker-java/blob/f9d2db6efff4d7fc7092b53209027d6846de1894/docker-java-core/src/main/java/com/github/dockerjava/core/FramedInputStreamConsumer.java)
that was initialized with the same callback. The consumer starts reading the response stream and invokes onNext
(ResourceReaper
line 121 above) with each new frame read from the stream.
Additionally, the DockerHttpClient.Response
retrieved on line 269 is also used in the lambda that is passed into the callback
via the onStart
method. This lambda will be called when the close
method of the callback is invoked (it instantiates the Closeable
functional interface):
https://github.com/docker-java/docker-java/blob/f9d2db6efff4d7fc7092b53209027d6846de1894/docker-java-api/src/main/java/com/github/dockerjava/api/async/ResultCallbackTemplate.java#L36-L40 https://github.com/docker-java/docker-java/blob/f9d2db6efff4d7fc7092b53209027d6846de1894/docker-java-api/src/main/java/com/github/dockerjava/api/async/ResultCallbackTemplate.java#L72-83
Now to close the circle, while the response is being streamed by the streaming thread, the callback is eventually closed in the ResourceReaper
by the thread that issued the log container command:
https://github.com/testcontainers/testcontainers-java/blob/f588082636dc2749f4129e943b02ede62527b41c/core/src/main/java/org/testcontainers/utility/ResourceReaper.java#L205
As seen above, this closes the DockerHttpClient.Response
in the lambda which internally reads the buffered source (see stack trace of thread #28
). Together with the streaming thread, this makes two threads simultaneously reading from the buffered source.
The issue was most likely introduced by this PR: https://github.com/docker-java/docker-java/pull/1421.
I'm not entirely sure how to fix it though. I guess ideally, the streaming thread should close the response itself and the main thread should only cancel the request, such that the response stream is finished and the streaming thread stops reading.
(embedding code snippets from another repository apparently doesn't work 😢)
Hello All, We are using 1.17.6 with spring boot and kotlin and one of our test classes hangs up in bitbucket cloud pipeline. also happens when running on local on a macOS ventura. Had to disable the problematic test cases for the time being. No errors in the stacktrace just had to manually kill the instance.
I recently observed this behavior with testcontainers 1.15.2 (with default okhttp
transport) and observed that upgrading to testcontainers 1.16.0 (and using the new default httpclient5
transport) immediately resolved the issue.
Can we consider this issue resolved/closed by #4287 (changing default transport to httpclient5
) and/or #5113 (removing okhttp
transport) ?
The jvm gets stuck and has to be manually terminated. This happens intermittently and is hard to reproduce. This issue only happens after upgrading to 1.15.0 (was also seeing issues on 1.15.0-rc2`, 1.14.3 was working fine. Would appreciate any help with figuring out why this happens.
Environment information
Here is the jvm thread trace