Closed tswift242 closed 7 years ago
/cc @tonistiigi @estesp
Wanted to mention that the "context deadline exceeded" is affecting exec's as well as creates/starts. In the docker log file I attached above, you'll see both
"Create container failed with error: shim error: containerd-shim: opening /var/run/docker/libcontainerd/1367eefd0a921b11d29b72b3ae9b7a69ec63ed9287f37316b4ff6cc75515a205/init-stdout failed: context deadline exceeded"
and also
"Error running exec in container: rpc error: code = 2 desc = shim error: containerd-shim: opening /var/run/docker/libcontainerd/a51d471bd301f8391791526f09d9bf14cdffc2664dc6810d2c042f25d5ebd171/8633e3c0cec2933cade81cd15b90fe15b718c0c99bdc03653d9a251989509678-stdout failed: context deadline exceeded"
We do a docker exec to run a script in each container immediately after it's restored.
Ping @thaJeztah. Can someone please take a look at this? Really interested to know if there's something Docker can do to prevent/mitigate this, or if there's something we can do to mitigate this.
"context deadline exceeded" is quite generic, but possibly @mlaventure has some ideas for debugging
Is the "stream copy error: reading from a closed fifo" (with stack trace) less generic? That always precedes "context deadline exceeded".
oh, you're right, I see I pinged @tonistiigi about that. You meant this stack trace, correct?
level=error msg="stream copy error: reading from a closed fifo
github.com/docker/docker/vendor/github.com/tonistiigi/fifo.(*fifo).Read
/go/src/github.com/docker/docker/vendor/github.com/tonistiigi/fifo/fifo.go:142
bufio.(*Reader).fill
/usr/local/go/src/bufio/bufio.go:97
bufio.(*Reader).WriteTo
/usr/local/go/src/bufio/bufio.go:472
io.copyBuffer
/usr/local/go/src/io/io.go:380
io.Copy
/usr/local/go/src/io/io.go:360
github.com/docker/docker/pkg/pools.Copy
/go/src/github.com/docker/docker/pkg/pools/pools.go:60
github.com/docker/docker/container/stream.(*Config).CopyToPipe.func1.1
/go/src/github.com/docker/docker/container/stream/streams.go:119
runtime.goexit
/usr/local/go/src/runtime/asm_amd64.s:2086"
Have you also tried upgrading to 1.13.0-rc4? Not sure if there were changes in this area, but worth checking.
Errors a output here container/stream/streams.go#L120, vendor/github.com/tonistiigi/fifo/fifo.go#L142
@thaJeztah Yes, that's the stack trace and error I mean. Both with my actual application and the toy example I provided, I always see about a dozen instances of that error in the daemon logs before I start seeing instances of "context deadline exceeded".
I have not tried 1.13.0-rc4 yet. It didn't seem like it had any fixes in this space, but I could be wrong.
The deadline being exceeded is this one: https://github.com/docker/containerd/blob/bb338ce1b949372986f784f491301caa4cc2e5f9/containerd-shim/process_linux.go#L33
This is more likely to occur due to lack of CPU time being allocated to that program or the kernel being contended on a lock in the open syscall.
Unfortunately, as of now your best solution is to prevent too many concurrent restores to be done as to not tax the CPU / filesystem path too much.
Regarding the criu
error, it looks like you may be calling restore before all the mountpoints that were present at the time of the checkpoint creation are mounted again.
@mlaventure I don't think CPU exhaustion is the cause of this. Using both AWS CloudWatch and top, I've confirmed that the CPU usage isn't ever close to 100% on the instance when these concurrent restores are failing. I've got a screenshot of the CloudWatch CPU utilization below. The "stream copy error" and "context deadline exceeded" errors both started occurring at 16:39 UTC according to the daemon logs. The CPU utilization at this point is less than 12%.
@mlaventure As far as locking contention on the open syscall, is that applicable here? I would assume that the criu image files are being opened read-only, and so wouldn't be subject to contention. Is this assumption incorrect? Or, what else is being opened that is subject to locking contention? Note that I can start 45 of these same containers concurrently without any of these issues when I don't use C/R. So the fact that a start from a checkpoint is happening, rather than just a normal start, matters.
I don't think it would, unless there's a global lock for FIFO. Just mentioned it there in case. My bet is on scheduling :-)
On Tue, Dec 20, 2016, 9:04 AM Thomas Swift notifications@github.com wrote:
@mlaventure https://github.com/mlaventure As far as locking contention on the open syscall, is that applicable here? I would assume that the criu image files are being opened read-only, and so wouldn't be subject to contention. Is this assumption incorrect? Or, what else is being opened that is subject to locking contention? Note that I can start 45 of these same containers concurrently without any of these issues when I don't use C/R. So the fact that a start from a checkpoint is happening, rather than just a normal start, matters.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/docker/docker/issues/29369#issuecomment-268298036, or mute the thread https://github.com/notifications/unsubscribe-auth/APUifgmNTSMhyzFfV5CGo7A1xiDlo3Uiks5rKAqPgaJpZM4LMQ3V .
@tswift242 Just saw your graph. It does indeed doesn't look like the CPU is in cause.
containerd-shim
is trying to open this file in WRONLY
meaning it'll block until the other side opens it in RDONLY
.
For some reason this doesn't happen within 15secs (those 15secs being used for the opening of all 3 IOs atm btw)
I'll try to spend some time with @tonistiigi to think about it :)
@mlaventure Thanks a lot for thinking about this! Really appreciate it!
@tswift242 I think we figured it out. Here is our assumption:
I had forgotten, but containerd
only allow starting 10 containers at a time, the other ones are queued. Since docker
doesn't account for this, the fifo opening on its side timeouts before containerd-shim
even have the time to start. This is the cause of the stream copy error
. Then later on containerd-shim
finally starts and timeout on its side because docker
doesn't have the other end open anymore. This occurs because restoring a checkpoint slows down the whole container start time quite a bit.
As a work around could you update your system to only have 10 restore requests at a time? That should prevent the issue without having too big of a performance cost.
@mlaventure Thanks for the explanation! Yea, we're working on limiting the number of concurrent restores to get unblocked by this. Going forward, will there be changes to docker and/or containerd to address this situation? Either by relaxing the 10 containers at a time containerd limit (which would be great), and/or making sure docker accounts for the containerd queuing.
One question I still have is how does exec play into this? In one of my comments, I mentioned that docker exec
calls were failing with the same "context deadline exceeded" error as well, in addition to docker start
calls. docker exec
's were sometimes failing when run on the containers for which the restore was successful. EDIT: In this scenario, the daemon was under heavy load because we were retrying each restore that failed, and so things like docker ps
took a very long time to run. Not sure if this is the cause of the exec's failing, or if there's more to it.
Yes, we will be trying to address this issue, hence why we haven't closed it
On Tue, Dec 20, 2016, 12:05 PM Thomas Swift notifications@github.com wrote:
@mlaventure https://github.com/mlaventure Yea, we're working on limiting the number of concurrent restores to get unblocked by this. Going forward, will there be changes to docker and/or containerd to address this situation? Either by relaxing the 10 containers at a time containerd limit (which would be great), and/or making sure docker accounts for the containerd queuing.
One question I still have is how does exec play into this? In one of my comments, I mentioned that docker exec calls were failing with the same "context deadline exceeded" error as well, in addition to docker start calls.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/docker/docker/issues/29369#issuecomment-268344126, or mute the thread https://github.com/notifications/unsubscribe-auth/APUifuQSe5GrhasesnPQWBG8ILCPykDkks5rKDT1gaJpZM4LMQ3V .
@mlaventure Awesome, thanks! Is there any chance that this fix (or part of it) will make it into the upcoming 1.13 release?
Ping @mlaventure any updates on this?
@mlaventure Maybe increase timeout as a quick fix for v1.13
.
@mlaventure Any updates?
Saw on https://github.com/docker/docker/pull/30140 that the fix will land on 1.13.1.
I encounter this issue too, but in different way. I have a script to bring up all my services online using docker service create
(approx 60+ services). It never reached the point that every container is up and running. I had to split the script into multiple small scripts and run them. Wait for every container is up and then run the next script. I think this issue could be the same case but please correct me if I'm wrong. Will report again when 1.13.1 is released.
@PanJ without the error messages you get when trying to bring them all up it's hard to tell. If it's "context deadline expired" then indeed it may be this issue
Now that 1.13.1-rc1 is out I'm not seeing this fix referenced? Is it still planned for 1.13.1?
@stuart-c the fix is in rc1
Great :-) Didn't see it in the changelog
ping @vieux for changelog update :)
Description
I'm using checkpoint/restore to restore multiple containers from the same checkpoint. Sometimes these restores happen concurrently. In particular, as part of our bootstrap process, we restore 45 containers from the same checkpoint when a new EC2 machine is launched; these restores happen concurrently. What I'm finding is that doing many concurrent restores usually leads to a number of these restores failing. The failure mode is:
In the daemon logs, I always see several occurrences of the following error prior to any occurrences of the "context deadline exceeded" error:
The application I'm running in my container has a large memory footprint (~1.1G when idle), and so it's not too surprising that a few dozen concurrent restores of this application leads to some kind of issue. In fact, I've confirmed that adding sleep calls in between the restores in our bootstrap prevents the issue from occuring. So, resource overload/exhaustion is definitely at least one part of this. However, I was curious if there's anything to be done on the Docker/criu side to better support use cases like this.
I've been looking into how memory and I/O usage play into this. When I tried doing 45+ concurrent restores of a simple Ubuntu container running a bash loop, I was unable to reproduce the issue. I have, however, been able to reproduce the issue when I run a simple C program in a container which just allocates several hundred MB of memory. I've confirmed that the issue doesn't seem to be with exhausting the host's memory. I've measured memory usage on the host during the bootstrap of our actual application, and the "context deadline exceeded" error starts happening when there's still 90GB of free memory on the host. Experiments with the toy container have also supported this.
As for I/O, this seems more likely to be where the problem is, and we've had issues before with being throttled due to doing too much I/O on our EC2 instance's root device. However, I/O throttling doesn't seem to be the issue here. I've tried setting the exec-root, docker root ("graph"), and checkpoint-dir all to locations on an instance's ephemeral storage, which isn't subject to throttling by AWS and which should achieve better performance, and the issue did not go away or improve. I also tried increasing the number of IOPS provisioned for the root device from 100 to 1000, and this also did not seem to really make a difference.
The other thing I've noticed is that the number of concurrent restores is a factor, independent of e.g. memory consumption. I've tried playing around with the size of my toy C program and the number of concurrent restores, and I've found that the latter perhaps matters more than the former for reproducting the "context deadline exceeded" issue. With a 1.1GB application running in the container, doing 100 concurrent restores will rarely lead to any failures (total of 110GB of memory usage). However, doing 180 concurrent restores of a 550MB application will lead to at least 16 failed restores (total of 99GB of memory usage). Furthermore, doing 250 concurrent restores of a 300MB application will lead to at least 44 failed restores (total of just 75GB of memory usage). It seems like concurrent restores don't quite scale as well as maybe they should.
Although in my experiments I've often done 100-200+ concurrent restores to reproduce the issue, keep in mind that with our actual application, the issue is seen when doing just 45 concurrent restores, which doesn't seem to be an unreasonably high number.
One last thing is that while doing these experiments, in addition to the error provided above, I've also occasionally seen the following criu error: "Error (criu/mount.c:414): mnt: No root found for mountpoint 978 (@./sys/fs/pstore)". The actual mountpoint flagged varies, but the message is the same.
I'm attaching the docker daemon logs from the bootstrapping of my actual application, as well as the sample C program, Dockerfile, and shell script I used to reproduce the issue more simply. docker_cr_concurrent_restores.zip
Steps to reproduce the issue:
docker build -t test:latest .
docker run -d --security-opt=seccomp:unconfined --name cr test
docker checkpoint create --checkpoint-dir /var/lib/docker/checkpoints cr test_cp
./stress_cr.sh 250
stress_cr.sh
) and the application size inlots_of_memory.c
Describe the results you received: Many restores fail with "context deadline exceeded"
Describe the results you expected: Restores succeed
Output of
docker version
:Output of
docker info
:Additional environment details (AWS, VirtualBox, physical, etc.): AWS EC2. Using r3.4xlarge instance type. 20GB root device with 100 IOPS. Also tried using 1000 IOPS.
Have tried using criu version 2.8 and 2.9.