Execution gets stuck on ResourceReaper.start on linux host

raviagarwal7 commented 3 years ago

The jvm gets stuck and has to be manually terminated. This happens intermittently and is hard to reproduce. This issue only happens after upgrading to 1.15.0 (was also seeing issues on 1.15.0-rc2`, 1.14.3 was working fine. Would appreciate any help with figuring out why this happens.

Environment information

$ docker version 
Client: Docker Engine - Community
 Version:           19.03.12
 API version:       1.40
 Go version:        go1.13.10
 Git commit:        48a66213fe
 Built:             Tue Jul 28 02:14:52 2020
 OS/Arch:           linux/amd64
 Experimental:      false

Server: Docker Engine - Community
 Engine:
  Version:          19.03.12
  API version:      1.40 (minimum version 1.12)
  Go version:       go1.13.10
  Git commit:       48a66213fe
  Built:            Tue Jul 28 02:14:13 2020
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          v1.3.6
  GitCommit:        be75852b8d7849474a20192f9ed1bf34fdd454f1
 runc:
  Version:          1.0.0-rc10
  GitCommit:        dc9208a3303feef5b3839f4323d9beb36df0a9dd
 docker-init:
  Version:          0.18.0
  GitCommit:        fec3683

$ uname -r
4.14.158

$ cat /etc/os-release
PRETTY_NAME="Debian GNU/Linux 8 (jessie)"
NAME="Debian GNU/Linux"
VERSION_ID="8"
VERSION="8 (jessie)"
ID=debian

Here is the jvm thread trace

Test has timed out!  Here is a trace of what it is currently doing:
2020-11-23 13:41:46
Full thread dump OpenJDK 64-Bit Server VM (25.265-b01 mixed mode):

"Attach Listener" #1216 daemon prio=9 os_prio=0 tid=0x00007fbde8001000 nid=0x563 waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

   Locked ownable synchronizers:
    - None

"testcontainers-ryuk" #1215 daemon prio=5 os_prio=0 tid=0x00007fbd80488000 nid=0xf592 in Object.wait() [0x00007fb8157d6000]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
    at java.lang.Object.wait(Native Method)
    at org.testcontainers.utility.ResourceReaper.lambda$null$1(ResourceReaper.java:142)
    - locked <0x000000073a8214c8> (a java.util.ArrayList)
    at org.testcontainers.utility.ResourceReaper$$Lambda$589/432240114.run(Unknown Source)
    at org.rnorth.ducttape.ratelimits.RateLimiter.doWhenReady(RateLimiter.java:27)
    at org.testcontainers.utility.ResourceReaper.lambda$start$2(ResourceReaper.java:133)
    at org.testcontainers.utility.ResourceReaper$$Lambda$588/1491766901.run(Unknown Source)
    at java.lang.Thread.run(Thread.java:748)

   Locked ownable synchronizers:
    - None

"docker-java-stream-2071761001" #1214 daemon prio=5 os_prio=0 tid=0x00007fbd80377800 nid=0xf568 runnable [0x00007fb8153d2000]
   java.lang.Thread.State: RUNNABLE
    at org.testcontainers.shaded.okio.Buffer.getByte(Buffer.java:312)
    at org.testcontainers.shaded.okio.RealBufferedSource.readHexadecimalUnsignedLong(RealBufferedSource.java:310)
    at org.testcontainers.shaded.okhttp3.internal.http1.Http1ExchangeCodec$ChunkedSource.readChunkSize(Http1ExchangeCodec.java:492)
    at org.testcontainers.shaded.okhttp3.internal.http1.Http1ExchangeCodec$ChunkedSource.read(Http1ExchangeCodec.java:471)
    at org.testcontainers.shaded.okhttp3.internal.connection.Exchange$ResponseBodySource.read(Exchange.java:286)
    at org.testcontainers.shaded.okio.RealBufferedSource$1.read(RealBufferedSource.java:436)
    at org.testcontainers.shaded.com.github.dockerjava.core.FramedInputStreamConsumer.accept(FramedInputStreamConsumer.java:30)
    at org.testcontainers.shaded.com.github.dockerjava.core.FramedInputStreamConsumer.accept(FramedInputStreamConsumer.java:12)
    at org.testcontainers.shaded.com.github.dockerjava.core.DefaultInvocationBuilder.lambda$executeAndStream$1(DefaultInvocationBuilder.java:275)
    at org.testcontainers.shaded.com.github.dockerjava.core.DefaultInvocationBuilder$$Lambda$578/715686037.run(Unknown Source)
    at java.lang.Thread.run(Thread.java:748)

   Locked ownable synchronizers:
    - None

"OkHttp ConnectionPool" #1206 daemon prio=5 os_prio=0 tid=0x00007fb76003c000 nid=0xf32b in Object.wait() [0x00007fb8155d4000]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
    at java.lang.Object.wait(Native Method)
    at java.lang.Object.wait(Object.java:460)
    at org.testcontainers.shaded.okhttp3.internal.connection.RealConnectionPool.lambda$new$0(RealConnectionPool.java:62)
    - locked <0x000000070f5aed10> (a org.testcontainers.shaded.okhttp3.internal.connection.RealConnectionPool)
    at org.testcontainers.shaded.okhttp3.internal.connection.RealConnectionPool$$Lambda$550/880380434.run(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

   Locked ownable synchronizers:
    - <0x000000073df35e20> (a java.util.concurrent.ThreadPoolExecutor$Worker)

"spatialIndexComponent-258" #1051 daemon prio=5 os_prio=0 tid=0x00007fbd80139000 nid=0xf0cc waiting on condition [0x00007fb815ad9000]
   java.lang.Thread.State: WAITING (parking)
    at sun.misc.Unsafe.park(Native Method)
    - parking to wait for  <0x0000000710880420> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
    at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
    at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
    at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

   Locked ownable synchronizers:
    - None

"spatialIndexComponent-257"    
.
.
.
"spatialIndexComponent-0"

"DelegateRunnerWithTimeout-1" #28 prio=5 os_prio=0 tid=0x00007fbfdd9dd000 nid=0xca86 runnable [0x00007fbda08af000]
   java.lang.Thread.State: RUNNABLE
    at org.testcontainers.shaded.okio.Buffer.indexOf(Buffer.java:1463)
    at org.testcontainers.shaded.okio.RealBufferedSource.indexOf(RealBufferedSource.java:352)
    at org.testcontainers.shaded.okio.RealBufferedSource.readUtf8LineStrict(RealBufferedSource.java:230)
    at org.testcontainers.shaded.okio.RealBufferedSource.readUtf8LineStrict(RealBufferedSource.java:224)
    at org.testcontainers.shaded.okhttp3.internal.http1.Http1ExchangeCodec$ChunkedSource.readChunkSize(Http1ExchangeCodec.java:489)
    at org.testcontainers.shaded.okhttp3.internal.http1.Http1ExchangeCodec$ChunkedSource.read(Http1ExchangeCodec.java:471)
    at org.testcontainers.shaded.okhttp3.internal.Util.skipAll(Util.java:204)
    at org.testcontainers.shaded.okhttp3.internal.Util.discard(Util.java:186)
    at org.testcontainers.shaded.okhttp3.internal.http1.Http1ExchangeCodec$ChunkedSource.close(Http1ExchangeCodec.java:511)
    at org.testcontainers.shaded.okio.ForwardingSource.close(ForwardingSource.java:43)
    at org.testcontainers.shaded.okhttp3.internal.connection.Exchange$ResponseBodySource.close(Exchange.java:313)
    at org.testcontainers.shaded.okio.RealBufferedSource.close(RealBufferedSource.java:476)
    at org.testcontainers.shaded.okhttp3.internal.Util.closeQuietly(Util.java:139)
    at org.testcontainers.shaded.okhttp3.ResponseBody.close(ResponseBody.java:192)
    at org.testcontainers.shaded.okhttp3.Response.close(Response.java:290)
    at org.testcontainers.shaded.com.github.dockerjava.okhttp.OkDockerHttpClient$OkResponse.close(OkDockerHttpClient.java:280)
    at org.testcontainers.shaded.com.github.dockerjava.core.DefaultInvocationBuilder.lambda$null$0(DefaultInvocationBuilder.java:272)
    at org.testcontainers.shaded.com.github.dockerjava.core.DefaultInvocationBuilder$$Lambda$579/1129446976.close(Unknown Source)
    at com.github.dockerjava.api.async.ResultCallbackTemplate.close(ResultCallbackTemplate.java:77)
    at org.testcontainers.utility.ResourceReaper.start(ResourceReaper.java:177)
    at org.testcontainers.DockerClientFactory.client(DockerClientFactory.java:203)
    - locked <0x0000000710684178> (a [Ljava.lang.Object;)
    at org.testcontainers.LazyDockerClient.getDockerClient(LazyDockerClient.java:14)
    at org.testcontainers.LazyDockerClient.createContainerCmd(LazyDockerClient.java:12)
    at org.testcontainers.containers.GenericContainer.tryStart(GenericContainer.java:355)
    at org.testcontainers.containers.GenericContainer.lambda$doStart$0(GenericContainer.java:322)
    at org.testcontainers.containers.GenericContainer$$Lambda$534/2098911188.call(Unknown Source)
    at org.rnorth.ducttape.unreliables.Unreliables.retryUntilSuccess(Unreliables.java:81)
    at org.testcontainers.containers.GenericContainer.doStart(GenericContainer.java:320)
    at org.testcontainers.containers.GenericContainer.start(GenericContainer.java:308)
    at com.uber.fievel.testing.mysql.UberMariaDBContainer.start(UberMariaDBContainer.java:56)
    at org.testcontainers.containers.GenericContainer.starting(GenericContainer.java:1021)
    at org.testcontainers.containers.FailureDetectingExternalResource$1.evaluate(FailureDetectingExternalResource.java:29)
    at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48)

raviagarwal7 commented 3 years ago

https://github.com/apache/geode/pull/5804#issuecomment-738050602

@jdeppe-pivotal thanks! We will release 1.15.1 soon that updates docker-java with some changes to Socket read operations, I hope it will solve this issue, so please give it a try once released! :)

@bsideup I see that 1.15.1 was released 5 days ago. Do we know if this issue was resolved?

bsideup commented 3 years ago

@raviagarwal7 I guess it is better to ask @jdeppe-pivotal and whether it helped them or not (I hope it did! :))

raviagarwal7 commented 3 years ago

Thanks, @bsideup. have asked @jdeppe-pivotal

I will also try out 1.15.1 and report back if it fixes the issue for us.

raviagarwal7 commented 3 years ago

I tried on 1.15.1 and the issue still exists.

same is the case reported by @jdeppe-pivotal https://github.com/apache/geode/pull/5804#issuecomment-746446587

gabor-hargitai-privitar commented 3 years ago

I ran into the same issue, starting Ryak seems to just hung on me:

[20:29:14] :     [Step 1/1] 2021-01-18 20:29:14,669  INFO | main | o.t.dockerclient.DockerClientProviderStrategy           | Loaded org.testcontainers.dockerclient.UnixSocketClientProviderStrategy from ~/.testcontainers.properties, will try it first 
[20:29:15] :     [Step 1/1] 2021-01-18 20:29:15,293  INFO | main | o.t.dockerclient.DockerClientProviderStrategy           | Found Docker environment with local Unix socket (unix:///var/run/docker.sock) 
[20:29:15] :     [Step 1/1] 2021-01-18 20:29:15,294  INFO | main | org.testcontainers.DockerClientFactory                  | Docker host IP address is localhost 
[20:29:15] :     [Step 1/1] 2021-01-18 20:29:15,321  INFO | main | org.testcontainers.DockerClientFactory                  | Connected to docker: 
[20:29:15] :     [Step 1/1]   Server Version: 18.09.0
[20:29:15] :     [Step 1/1]   API Version: 1.39
[20:29:15] :     [Step 1/1]   Operating System: Red Hat Enterprise Linux Server 7.4 (Maipo)
[20:29:15] :     [Step 1/1]   Total Memory: 31211 MB 
[20:29:15] :     [Step 1/1] 2021-01-18 20:29:15,325  INFO | main | org.testcontainers.utility.ImageNameSubstitutor         | Image name substitution will be performed by: DefaultImageNameSubstitutor (composite of 'ConfigurationFileImageNameSubstitutor' and 'PrefixingImageNameSubstitutor') 
[20:29:15]i:     [Step 1/1] Docker event: {"status":"create","id":"bfef100aecc22dff2c2193685593f6c42172fccdf365edcb9df940035b64208b","from":"testcontainers/ryuk:0.3.0","Type":"container","Action":"create","Actor":{"ID":"bfef100aecc22dff2c2193685593f6c42172fccdf365edcb9df940035b64208b","Attributes":{"image":"testcontainers/ryuk:0.3.0","name":"testcontainers-ryuk-53ff9e8d-c141-45c9-a9a3-f2e9bd9e4741","org.testcontainers":"true"}},"scope":"local","time":1611001755,"timeNano":1611001755485478257}
[20:29:15]i:     [Step 1/1] Create docker info file: /home/ec2-user/buildAgent/temp/buildTmp/.teamcity/docker/build_73/events.json
[20:29:15]i:     [Step 1/1] Docker event: {"status":"start","id":"bfef100aecc22dff2c2193685593f6c42172fccdf365edcb9df940035b64208b","from":"testcontainers/ryuk:0.3.0","Type":"container","Action":"start","Actor":{"ID":"bfef100aecc22dff2c2193685593f6c42172fccdf365edcb9df940035b64208b","Attributes":{"image":"testcontainers/ryuk:0.3.0","name":"testcontainers-ryuk-53ff9e8d-c141-45c9-a9a3-f2e9bd9e4741","org.testcontainers":"true"}},"scope":"local","time":1611001755,"timeNano":1611001755779756677}
[00:00:17]E:     [Step 1/1] The build Release Process Builds::Test Release #73 {buildId=796001} has been running for more than 240 minutes. Terminating...

This is with test containers 1.15.1

gabor-hargitai-privitar commented 3 years ago

This might be a duplicate of https://github.com/testcontainers/testcontainers-java/issues/3183

jsmrcka commented 3 years ago

I am using 1.15.1 and also running into this issue. The behavior is basically the same to what @raviagarwal7 describes.

For now, the workaround for us would be to downgrade to the 1.14.3, although that brings its own challenges.

Can't wait to see a fix coming out.

EDIT: Details of our setup: Running a Jenkins pipeline on a repo containing 70+ modules, many of them using testcontainers. Typical Jenkins' node setup: CPUs: 4, RAM: 8192MB, Disk: 40GB, SWAP: 0MB, Ephemeral: 0GB. An alternative way to reproduce is to run a single module (the one we tested uses Postgresql test container) many times in a row. With 50 iterations, the incidence of the error raises above 80 %, I would guess.

jsmrcka commented 3 years ago

Updated to 1.15.2. Unfortunately, the issue still happens. Same setup as in my previous comment.

bsideup commented 3 years ago

@jsmrcka could you please try the httpclient5 transport? You can enable it by setting transport.type=httpclient5 in your config (see the available locations here: https://www.testcontainers.org/features/configuration/ )

jsmrcka commented 3 years ago

@jsmrcka could you please try the httpclient5 transport? You can enable it by setting transport.type=httpclient5 in your config (see the available locations here: https://www.testcontainers.org/features/configuration/ )

I tried to run quite a few tests on 1.15.2 with transport.type=httpclient5 and was not able to reproduce the issue so far. It seems like an effective workaround, thanks for the tip.

@bsideup Is this setting documented somewhere? I could not find anything but a short mention in 1.15.0 release notes (https://github.com/testcontainers/testcontainers-java/releases/tag/1.15.0).

malcyon commented 3 years ago

I think I might be getting this as well. We run about 70-80 builds a day on Jenkins, and we get anywhere from 1-5 builds a day that hang forever running our integration tests.

We've been on 1.15.1 for a while and were getting this. I updated it to 1.15.2 yesterday, and we're still getting testcontainer hangs.

I can see in the Surefire reports that the test completes, but the build never progresses past whatever test it decides to hang on. If I login to the build server and run docker ps, I just see the ryuk container.

I've set TESTCONTAINERS_TRANSPORT_TYPE="httpclient5" as an environment variable to see if that does anything, and I'll try to capture a stacktrace if it happens again.

malcyon commented 3 years ago

No hung jobs today at all. That's very unusual. The transport.type fix seems good so far.

jsmrcka commented 3 years ago

Quick update: after a few more runs, 1.15.2 with transport.type=httpclient5 works fine.

malcyon commented 3 years ago

I have also not had the issue recur. I am very happy about this. httpclient5 is now my favorite transport type.

bsideup commented 3 years ago

Thanks everyone for reporting back! We will switch to httpclient5 soon, and, thanks to your input, we have more confided in it, as it seems not only be 100% compatible with everything we're doing, but also more reliable 👍

jonathan-meier commented 3 years ago

This is a race condition in the docker-java library between two threads reading the same okio.RealBufferedSource obtained by a okhttp3.Response. The two threads are thread #1214 (docker-java-stream-2071761001) and Thread #28 (DelegateRunnerWithTimeout-1) in the thread dump of the issue description. Both threads reading simultaneously from the buffered source, which is not thread-safe, screws up the buffer's internal bookkeeping leading to an infinite loop on both of them.

The issue is triggered in the ResourceReaper with the logCallback initialized here: https://github.com/testcontainers/testcontainers-java/blob/f588082636dc2749f4129e943b02ede62527b41c/core/src/main/java/org/testcontainers/utility/ResourceReaper.java#L114-L124

The .exec invocation internally spawns a thread named docker-java-stream-... in the docker-java library that starts streaming the response of the log container command: https://github.com/docker-java/docker-java/blob/f9d2db6efff4d7fc7092b53209027d6846de1894/docker-java-core/src/main/java/com/github/dockerjava/core/DefaultInvocationBuilder.java#L262-L284

The callback parameter contains the same object as logCallback in the ResourceReaper above. The DockerHttpClient.Response retrieved on line 269 is handed to the sourceConsumer. The sourceConsumer is a FramedInputStreamConsumer (https://github.com/docker-java/docker-java/blob/f9d2db6efff4d7fc7092b53209027d6846de1894/docker-java-core/src/main/java/com/github/dockerjava/core/FramedInputStreamConsumer.java) that was initialized with the same callback. The consumer starts reading the response stream and invokes onNext (ResourceReaper line 121 above) with each new frame read from the stream.

Additionally, the DockerHttpClient.Response retrieved on line 269 is also used in the lambda that is passed into the callback via the onStart method. This lambda will be called when the close method of the callback is invoked (it instantiates the Closeable functional interface):

https://github.com/docker-java/docker-java/blob/f9d2db6efff4d7fc7092b53209027d6846de1894/docker-java-api/src/main/java/com/github/dockerjava/api/async/ResultCallbackTemplate.java#L36-L40 https://github.com/docker-java/docker-java/blob/f9d2db6efff4d7fc7092b53209027d6846de1894/docker-java-api/src/main/java/com/github/dockerjava/api/async/ResultCallbackTemplate.java#L72-83

Now to close the circle, while the response is being streamed by the streaming thread, the callback is eventually closed in the ResourceReaper by the thread that issued the log container command: https://github.com/testcontainers/testcontainers-java/blob/f588082636dc2749f4129e943b02ede62527b41c/core/src/main/java/org/testcontainers/utility/ResourceReaper.java#L205

As seen above, this closes the DockerHttpClient.Response in the lambda which internally reads the buffered source (see stack trace of thread #28). Together with the streaming thread, this makes two threads simultaneously reading from the buffered source.

The issue was most likely introduced by this PR: https://github.com/docker-java/docker-java/pull/1421.

I'm not entirely sure how to fix it though. I guess ideally, the streaming thread should close the response itself and the main thread should only cancel the request, such that the response stream is finished and the streaming thread stops reading.

(embedding code snippets from another repository apparently doesn't work 😢)

pouyan021 commented 1 year ago

Hello All, We are using 1.17.6 with spring boot and kotlin and one of our test classes hangs up in bitbucket cloud pipeline. also happens when running on local on a macOS ventura. Had to disable the problematic test cases for the time being. No errors in the stacktrace just had to manually kill the instance.

sumbach commented 9 months ago

I recently observed this behavior with testcontainers 1.15.2 (with default okhttp transport) and observed that upgrading to testcontainers 1.16.0 (and using the new default httpclient5 transport) immediately resolved the issue.

Can we consider this issue resolved/closed by #4287 (changing default transport to httpclient5) and/or #5113 (removing okhttp transport) ?

testcontainers / testcontainers-java

Execution gets stuck on ResourceReaper.start on linux host #3531