Concurrently starting many containers from same checkpoint leads to error: "context deadline exceeded"

tswift242 commented 7 years ago

Description

I'm using checkpoint/restore to restore multiple containers from the same checkpoint. Sometimes these restores happen concurrently. In particular, as part of our bootstrap process, we restore 45 containers from the same checkpoint when a new EC2 machine is launched; these restores happen concurrently. What I'm finding is that doing many concurrent restores usually leads to a number of these restores failing. The failure mode is:

"Error response from daemon: shim error: containerd-shim: opening /var/run/docker/libcontainerd/f7bbb5a85645fb1d34b2ab4966b2893c4507c900d77e0104515451f440181d29/init-stdout failed: context deadline exceeded"

In the daemon logs, I always see several occurrences of the following error prior to any occurrences of the "context deadline exceeded" error:

level=error msg="stream copy error: reading from a closed fifo\ngithub.com/docker/docker/vendor/github.com/tonistiigi/fifo.(*fifo).Read\n\t/go/src/github.com/docker/docker/vendor/github.com/tonistiigi/fifo/fifo.go:142\nbufio.(*Reader).fill\n\t/usr/local/go/src/bufio/bufio.go:97\nbufio.(*Reader).WriteTo\n\t/usr/local/go/src/bufio/bufio.go:472\nio.copyBuffer\n\t/usr/local/go/src/io/io.go:380\nio.Copy\n\t/usr/local/go/src/io/io.go:360\ngithub.com/docker/docker/pkg/pools.Copy\n\t/go/src/github.com/docker/docker/pkg/pools/pools.go:60\ngithub.com/docker/docker/container/stream.(*Config).CopyToPipe.func1.1\n\t/go/src/github.com/docker/docker/container/stream/streams.go:119\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:2086"

The application I'm running in my container has a large memory footprint (~1.1G when idle), and so it's not too surprising that a few dozen concurrent restores of this application leads to some kind of issue. In fact, I've confirmed that adding sleep calls in between the restores in our bootstrap prevents the issue from occuring. So, resource overload/exhaustion is definitely at least one part of this. However, I was curious if there's anything to be done on the Docker/criu side to better support use cases like this.

I've been looking into how memory and I/O usage play into this. When I tried doing 45+ concurrent restores of a simple Ubuntu container running a bash loop, I was unable to reproduce the issue. I have, however, been able to reproduce the issue when I run a simple C program in a container which just allocates several hundred MB of memory. I've confirmed that the issue doesn't seem to be with exhausting the host's memory. I've measured memory usage on the host during the bootstrap of our actual application, and the "context deadline exceeded" error starts happening when there's still 90GB of free memory on the host. Experiments with the toy container have also supported this.

As for I/O, this seems more likely to be where the problem is, and we've had issues before with being throttled due to doing too much I/O on our EC2 instance's root device. However, I/O throttling doesn't seem to be the issue here. I've tried setting the exec-root, docker root ("graph"), and checkpoint-dir all to locations on an instance's ephemeral storage, which isn't subject to throttling by AWS and which should achieve better performance, and the issue did not go away or improve. I also tried increasing the number of IOPS provisioned for the root device from 100 to 1000, and this also did not seem to really make a difference.

The other thing I've noticed is that the number of concurrent restores is a factor, independent of e.g. memory consumption. I've tried playing around with the size of my toy C program and the number of concurrent restores, and I've found that the latter perhaps matters more than the former for reproducting the "context deadline exceeded" issue. With a 1.1GB application running in the container, doing 100 concurrent restores will rarely lead to any failures (total of 110GB of memory usage). However, doing 180 concurrent restores of a 550MB application will lead to at least 16 failed restores (total of 99GB of memory usage). Furthermore, doing 250 concurrent restores of a 300MB application will lead to at least 44 failed restores (total of just 75GB of memory usage). It seems like concurrent restores don't quite scale as well as maybe they should.

Although in my experiments I've often done 100-200+ concurrent restores to reproduce the issue, keep in mind that with our actual application, the issue is seen when doing just 45 concurrent restores, which doesn't seem to be an unreasonably high number.

One last thing is that while doing these experiments, in addition to the error provided above, I've also occasionally seen the following criu error: "Error (criu/mount.c:414): mnt: No root found for mountpoint 978 (@./sys/fs/pstore)". The actual mountpoint flagged varies, but the message is the same.

I'm attaching the docker daemon logs from the bootstrapping of my actual application, as well as the sample C program, Dockerfile, and shell script I used to reproduce the issue more simply. docker_cr_concurrent_restores.zip

Steps to reproduce the issue:

docker build -t test:latest .
docker run -d --security-opt=seccomp:unconfined --name cr test
docker checkpoint create --checkpoint-dir /var/lib/docker/checkpoints cr test_cp
./stress_cr.sh 250
Vary the number of concurrent restores (argument to stress_cr.sh) and the application size in lots_of_memory.c

Describe the results you received: Many restores fail with "context deadline exceeded"

Describe the results you expected: Restores succeed

Output of docker version:

Client:
 Version:      1.13.0-rc3
 API version:  1.25
 Go version:   go1.7.3
 Git commit:   4d92237
 Built:        Tue Dec  6 01:15:44 2016
 OS/Arch:      linux/amd64

Server:
 Version:      1.13.0-rc3
 API version:  1.25 (minimum version 1.12)
 Go version:   go1.7.3
 Git commit:   4d92237
 Built:        Tue Dec  6 01:15:44 2016
 OS/Arch:      linux/amd64
 Experimental: true

Output of docker info:

Containers: 0
 Running: 0
 Paused: 0
 Stopped: 0
Images: 35
Server Version: 1.13.0-rc3
Storage Driver: overlay
 Backing Filesystem: extfs
 Supports d_type: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 03e5862ec0d8d3b3f750e19fca3ee367e13c090e
runc version: 51371867a01c467f08af739783b8beafc154c4d7
init version: 949e6fa
Security Options:
 apparmor
 seccomp
  Profile: default
Kernel Version: 4.4.26-040426-generic
Operating System: Ubuntu 14.04.5 LTS
OSType: linux
Architecture: x86_64
CPUs: 16
Total Memory: 120.1 GiB
Name: ip-10-97-5-247
ID: TEID:T5YB:6YAX:E4VG:4QTD:QQQ3:RWRS:6KQU:ALGA:SQTT:TPS6:GZ2I
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): true
Registry: https://index.docker.io/v1/
Experimental: true
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

Additional environment details (AWS, VirtualBox, physical, etc.): AWS EC2. Using r3.4xlarge instance type. 20GB root device with 100 IOPS. Also tried using 1000 IOPS.

Have tried using criu version 2.8 and 2.9.

thaJeztah commented 7 years ago

/cc @tonistiigi @estesp

tswift242 commented 7 years ago

Wanted to mention that the "context deadline exceeded" is affecting exec's as well as creates/starts. In the docker log file I attached above, you'll see both

"Create container failed with error: shim error: containerd-shim: opening /var/run/docker/libcontainerd/1367eefd0a921b11d29b72b3ae9b7a69ec63ed9287f37316b4ff6cc75515a205/init-stdout failed: context deadline exceeded"

and also

"Error running exec in container: rpc error: code = 2 desc = shim error: containerd-shim: opening /var/run/docker/libcontainerd/a51d471bd301f8391791526f09d9bf14cdffc2664dc6810d2c042f25d5ebd171/8633e3c0cec2933cade81cd15b90fe15b718c0c99bdc03653d9a251989509678-stdout failed: context deadline exceeded"

We do a docker exec to run a script in each container immediately after it's restored.

tswift242 commented 7 years ago

Ping @thaJeztah. Can someone please take a look at this? Really interested to know if there's something Docker can do to prevent/mitigate this, or if there's something we can do to mitigate this.

thaJeztah commented 7 years ago

"context deadline exceeded" is quite generic, but possibly @mlaventure has some ideas for debugging

tswift242 commented 7 years ago

Is the "stream copy error: reading from a closed fifo" (with stack trace) less generic? That always precedes "context deadline exceeded".

thaJeztah commented 7 years ago

oh, you're right, I see I pinged @tonistiigi about that. You meant this stack trace, correct?

level=error msg="stream copy error: reading from a closed fifo
github.com/docker/docker/vendor/github.com/tonistiigi/fifo.(*fifo).Read
    /go/src/github.com/docker/docker/vendor/github.com/tonistiigi/fifo/fifo.go:142
bufio.(*Reader).fill
    /usr/local/go/src/bufio/bufio.go:97
bufio.(*Reader).WriteTo
    /usr/local/go/src/bufio/bufio.go:472
io.copyBuffer
    /usr/local/go/src/io/io.go:380
io.Copy
    /usr/local/go/src/io/io.go:360
github.com/docker/docker/pkg/pools.Copy
    /go/src/github.com/docker/docker/pkg/pools/pools.go:60
github.com/docker/docker/container/stream.(*Config).CopyToPipe.func1.1
    /go/src/github.com/docker/docker/container/stream/streams.go:119
runtime.goexit
    /usr/local/go/src/runtime/asm_amd64.s:2086"

Have you also tried upgrading to 1.13.0-rc4? Not sure if there were changes in this area, but worth checking.

Errors a output here container/stream/streams.go#L120, vendor/github.com/tonistiigi/fifo/fifo.go#L142

tswift242 commented 7 years ago

@thaJeztah Yes, that's the stack trace and error I mean. Both with my actual application and the toy example I provided, I always see about a dozen instances of that error in the daemon logs before I start seeing instances of "context deadline exceeded".

I have not tried 1.13.0-rc4 yet. It didn't seem like it had any fixes in this space, but I could be wrong.

mlaventure commented 7 years ago

The deadline being exceeded is this one: https://github.com/docker/containerd/blob/bb338ce1b949372986f784f491301caa4cc2e5f9/containerd-shim/process_linux.go#L33

This is more likely to occur due to lack of CPU time being allocated to that program or the kernel being contended on a lock in the open syscall.

Unfortunately, as of now your best solution is to prevent too many concurrent restores to be done as to not tax the CPU / filesystem path too much.

Regarding the criu error, it looks like you may be calling restore before all the mountpoints that were present at the time of the checkpoint creation are mounted again.

tswift242 commented 7 years ago

@mlaventure I don't think CPU exhaustion is the cause of this. Using both AWS CloudWatch and top, I've confirmed that the CPU usage isn't ever close to 100% on the instance when these concurrent restores are failing. I've got a screenshot of the CloudWatch CPU utilization below. The "stream copy error" and "context deadline exceeded" errors both started occurring at 16:39 UTC according to the daemon logs. The CPU utilization at this point is less than 12%.

concurrent_restores_cpu_utilization

tswift242 commented 7 years ago

@mlaventure As far as locking contention on the open syscall, is that applicable here? I would assume that the criu image files are being opened read-only, and so wouldn't be subject to contention. Is this assumption incorrect? Or, what else is being opened that is subject to locking contention? Note that I can start 45 of these same containers concurrently without any of these issues when I don't use C/R. So the fact that a start from a checkpoint is happening, rather than just a normal start, matters.

mlaventure commented 7 years ago

I don't think it would, unless there's a global lock for FIFO. Just mentioned it there in case. My bet is on scheduling :-)

On Tue, Dec 20, 2016, 9:04 AM Thomas Swift notifications@github.com wrote:

@mlaventure https://github.com/mlaventure As far as locking contention on the open syscall, is that applicable here? I would assume that the criu image files are being opened read-only, and so wouldn't be subject to contention. Is this assumption incorrect? Or, what else is being opened that is subject to locking contention? Note that I can start 45 of these same containers concurrently without any of these issues when I don't use C/R. So the fact that a start from a checkpoint is happening, rather than just a normal start, matters.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/docker/docker/issues/29369#issuecomment-268298036, or mute the thread https://github.com/notifications/unsubscribe-auth/APUifgmNTSMhyzFfV5CGo7A1xiDlo3Uiks5rKAqPgaJpZM4LMQ3V .

mlaventure commented 7 years ago

@tswift242 Just saw your graph. It does indeed doesn't look like the CPU is in cause.

containerd-shim is trying to open this file in WRONLY meaning it'll block until the other side opens it in RDONLY.

For some reason this doesn't happen within 15secs (those 15secs being used for the opening of all 3 IOs atm btw)

I'll try to spend some time with @tonistiigi to think about it :)

tswift242 commented 7 years ago

@mlaventure Thanks a lot for thinking about this! Really appreciate it!

mlaventure commented 7 years ago

@tswift242 I think we figured it out. Here is our assumption:

I had forgotten, but containerd only allow starting 10 containers at a time, the other ones are queued. Since docker doesn't account for this, the fifo opening on its side timeouts before containerd-shim even have the time to start. This is the cause of the stream copy error. Then later on containerd-shim finally starts and timeout on its side because docker doesn't have the other end open anymore. This occurs because restoring a checkpoint slows down the whole container start time quite a bit.

As a work around could you update your system to only have 10 restore requests at a time? That should prevent the issue without having too big of a performance cost.

tswift242 commented 7 years ago

@mlaventure Thanks for the explanation! Yea, we're working on limiting the number of concurrent restores to get unblocked by this. Going forward, will there be changes to docker and/or containerd to address this situation? Either by relaxing the 10 containers at a time containerd limit (which would be great), and/or making sure docker accounts for the containerd queuing.

One question I still have is how does exec play into this? In one of my comments, I mentioned that docker exec calls were failing with the same "context deadline exceeded" error as well, in addition to docker start calls. docker exec's were sometimes failing when run on the containers for which the restore was successful. EDIT: In this scenario, the daemon was under heavy load because we were retrying each restore that failed, and so things like docker ps took a very long time to run. Not sure if this is the cause of the exec's failing, or if there's more to it.

mlaventure commented 7 years ago

Yes, we will be trying to address this issue, hence why we haven't closed it

On Tue, Dec 20, 2016, 12:05 PM Thomas Swift notifications@github.com wrote:

@mlaventure https://github.com/mlaventure Yea, we're working on limiting the number of concurrent restores to get unblocked by this. Going forward, will there be changes to docker and/or containerd to address this situation? Either by relaxing the 10 containers at a time containerd limit (which would be great), and/or making sure docker accounts for the containerd queuing.

One question I still have is how does exec play into this? In one of my comments, I mentioned that docker exec calls were failing with the same "context deadline exceeded" error as well, in addition to docker start calls.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/docker/docker/issues/29369#issuecomment-268344126, or mute the thread https://github.com/notifications/unsubscribe-auth/APUifuQSe5GrhasesnPQWBG8ILCPykDkks5rKDT1gaJpZM4LMQ3V .

tswift242 commented 7 years ago

@mlaventure Awesome, thanks! Is there any chance that this fix (or part of it) will make it into the upcoming 1.13 release?

tswift242 commented 7 years ago

Ping @mlaventure any updates on this?

tonistiigi commented 7 years ago

@mlaventure Maybe increase timeout as a quick fix for v1.13.

tswift242 commented 7 years ago

@mlaventure Any updates?

PanJ commented 7 years ago

Saw on https://github.com/docker/docker/pull/30140 that the fix will land on 1.13.1.

I encounter this issue too, but in different way. I have a script to bring up all my services online using docker service create (approx 60+ services). It never reached the point that every container is up and running. I had to split the script into multiple small scripts and run them. Wait for every container is up and then run the next script. I think this issue could be the same case but please correct me if I'm wrong. Will report again when 1.13.1 is released.

mlaventure commented 7 years ago

@PanJ without the error messages you get when trying to bring them all up it's hard to tell. If it's "context deadline expired" then indeed it may be this issue

stuart-c commented 7 years ago

Now that 1.13.1-rc1 is out I'm not seeing this fix referenced? Is it still planned for 1.13.1?

mlaventure commented 7 years ago

@stuart-c the fix is in rc1

stuart-c commented 7 years ago

Great :-) Didn't see it in the changelog

mlaventure commented 7 years ago

ping @vieux for changelog update :)

moby / moby

Concurrently starting many containers from same checkpoint leads to error: "context deadline exceeded" #29369