Closed grahamwhaley closed 5 years ago
What I found was I had two runtimes active (ps -ef | fgrep kata-runtime
), and looking at their open files (cd /proc/<pid>/fd; ls -la
) (as I had a strong suspicion we would have a lock file open issue - one of the symptoms is that once we have 'gone wrong', doing a kata-runtime list
will hang up if somebody has their lock held for writing...), I found that both the runtime processes had their lockfiles held open.
An lsof
in their /run/vc/sb/*
dirs confirmed that.
After some debug (initially with gdb and the golang plugin, but then @jodh-intel pointed me at the kill -SIGUSR1
stackdump (when debug is enabled), what I found what looks like a killCommand to be stuck with its lock open trying to grpc over to the agent. The thing is, if we do a ps -ef | fgrep kata
, we find that both the containers have a runtime/proxy/shim, but no QEMUs are running.
Hence, my current conclusion is that the QEMU has 'gone away' for some reason, but our grpc code does not, or has not, handled that case, and is sat waiting for a reply.
Here is the relevant stackdump from the go thread in question. Look for api.go
in there.
level=error command=kill name=kata-runtime pid=8352 source=runtime
level=error msg="\t/home/gwhaley/gopath/src/github.com/kata-containers/runtime/cli/main.go:391 +0x3e" command=kill name=kata-runtime pid=8352 source=runtime
level=error msg="main.main()" command=kill name=kata-runtime pid=8352 source=runtime
level=error msg="\t/home/gwhaley/gopath/src/github.com/kata-containers/runtime/cli/main.go:383 +0x4f" command=kill name=kata-runtime pid=8352 source=runtime
level=error msg="main.createRuntime()" command=kill name=kata-runtime pid=8352 source=runtime
level=error msg="\t/home/gwhaley/gopath/src/github.com/kata-containers/runtime/cli/main.go:340 +0x1bd" command=kill name=kata-runtime pid=8352 source=runtime
level=error msg="main.createRuntimeApp(0xc4200b0000, 0xa, 0xa, 0xc42019c7e0, 0xc420071f70)" command=kill name=kata-runtime pid=8352 source=runtime
level=error msg="\t/home/gwhaley/gopath/src/github.com/kata-containers/runtime/vendor/github.com/urfave/cli/app.go:255 +0x6a0" command=kill name=kata-runtime pid=8352 source=runtime
level=error msg="github.com/kata-containers/runtime/vendor/github.com/urfave/cli.(*App).Run(0xc4200c71e0, 0xc4200b0000, 0xa, 0xa, 0x0, 0x0)" command=kill name=kata-runtime pid=8352 source=runtime
level=error msg="\t/home/gwhaley/gopath/src/github.com/kata-containers/runtime/vendor/github.com/urfave/cli/command.go:210 +0xa36" command=kill name=kata-runtime pid=8352 source=runtime
level=error msg="github.com/kata-containers/runtime/vendor/github.com/urfave/cli.Command.Run(0xb866d9, 0x4, 0x0, 0x0, 0x0, 0x0, 0x0, 0xba9243, 0x32, 0x0, ...)" command=kill name=kata-runtime pid=8352 source=runtime
level=error msg="\t/home/gwhaley/gopath/src/github.com/kata-containers/runtime/vendor/github.com/urfave/cli/app.go:490 +0xc8" command=kill name=kata-runtime pid=8352 source=runtime
level=error msg="github.com/kata-containers/runtime/vendor/github.com/urfave/cli.HandleAction(0xa81280, 0xbbb178, 0xc42010b8c0, 0xc4200cc700, 0x0)" command=kill name=kata-runtime pid=8352 source=runtime
level=error msg="\t/home/gwhaley/gopath/src/github.com/kata-containers/runtime/cli/kill.go:50 +0xdc" command=kill name=kata-runtime pid=8352 source=runtime
level=error msg="main.glob..func6(0xc42010b8c0, 0x0, 0xc42010b8c0)" command=kill name=kata-runtime pid=8352 source=runtime
level=error msg="\t/home/gwhaley/gopath/src/github.com/kata-containers/runtime/cli/kill.go:111 +0x2ba" command=kill name=kata-runtime pid=8352 source=runtime
level=error msg="main.kill(0xc4200eab80, 0x40, 0x7ffe5de64f33, 0x1, 0x9c1a00, 0xb8664d, 0x4)" command=kill name=kata-runtime pid=8352 source=runtime
level=error msg="\t/home/gwhaley/gopath/src/github.com/kata-containers/runtime/virtcontainers/implementation.go:116 +0x5f" command=kill name=kata-runtime pid=8352 source=runtime
level=error msg="github.com/kata-containers/runtime/virtcontainers.(*VCImpl).KillContainer(0x11cba80, 0xc4201b81cd, 0x40, 0xc4200eab80, 0x40, 0x9, 0x0, 0x0, 0x0)" command=kill name=kata-runtime pid=8352 source=runtime
level=error msg="\t/home/gwhaley/gopath/src/github.com/kata-containers/runtime/virtcontainers/api.go:556 +0x108" command=kill name=kata-runtime pid=8352 source=runtime
level=error msg="github.com/kata-containers/runtime/virtcontainers.KillContainer(0xc4201b81cd, 0x40, 0xc4200eab80, 0x40, 0x9, 0x3030100, 0x0, 0x0)" command=kill name=kata-runtime pid=8352 source=runtime
level=error msg="\t/home/gwhaley/gopath/src/github.com/kata-containers/runtime/virtcontainers/container.go:863 +0x56" command=kill name=kata-runtime pid=8352 source=runtime
level=error msg="github.com/kata-containers/runtime/virtcontainers.(*Container).kill(0xc4200cf400, 0x9, 0x0, 0xc4200cf400, 0x0)" command=kill name=kata-runtime pid=8352 source=runtime
level=error msg="\t/home/gwhaley/gopath/src/github.com/kata-containers/runtime/virtcontainers/container.go:875 +0x174" command=kill name=kata-runtime pid=8352 source=runtime
level=error msg="github.com/kata-containers/runtime/virtcontainers.(*Container).signalProcess(0xc4200cf400, 0xc4200eaf80, 0x40, 0x9, 0xc420263700, 0x18, 0xc420263750)" command=kill name=kata-runtime pid=8352 source=runtime
level=error msg="\t/home/gwhaley/gopath/src/github.com/kata-containers/runtime/virtcontainers/kata_agent.go:969 +0xbc" command=kill name=kata-runtime pid=8352 source=runtime
level=error msg="github.com/kata-containers/runtime/virtcontainers.(*kataAgent).signalProcess(0xc4200cc960, 0xc4200cf400, 0xc4200eaf80, 0x40, 0x9, 0x0, 0x0, 0x0)" command=kill name=kata-runtime pid=8352 source=runtime
level=error msg="\t/home/gwhaley/gopath/src/github.com/kata-containers/runtime/virtcontainers/kata_agent.go:1233 +0x199" command=kill name=kata-runtime pid=8352 source=runtime
level=error msg="github.com/kata-containers/runtime/virtcontainers.(*kataAgent).sendReq(0xc4200cc960, 0xb30740, 0xc420221980, 0x0, 0x0, 0x0, 0x0)" command=kill name=kata-runtime pid=8352 source=runtime
level=error msg="\t/home/gwhaley/gopath/src/github.com/kata-containers/runtime/virtcontainers/kata_agent.go:1179 +0x97" command=kill name=kata-runtime pid=8352 source=runtime
level=error msg="github.com/kata-containers/runtime/virtcontainers.(*kataAgent).installReqFunc.func8(0xc17ce0, 0xc4200a6080, 0xb30740, 0xc420221980, 0x0, 0x0, 0x0, 0x412328, 0x30, 0xafa680, ...)" command=kill name=kata-runtime pid=8352 source=runtime
level=error msg="\t/home/gwhaley/gopath/src/github.com/kata-containers/runtime/vendor/github.com/kata-containers/agent/protocols/grpc/agent.pb.go:1675 +0xd2" command=kill name=kata-runtime pid=8352 source=runtime
level=error msg="github.com/kata-containers/runtime/vendor/github.com/kata-containers/agent/protocols/grpc.(*agentServiceClient).SignalProcess(0xc42000e020, 0xc17ce0, 0xc4200a6080, 0xc420221980, 0x0, 0x0, 0x0, 0xc420221980, 0xc1fa20, 0xc4202ac000)" command=kill name=kata-runtime pid=8352 source=runtime
level=error msg="\t/home/gwhaley/gopath/src/github.com/kata-containers/runtime/vendor/google.golang.org/grpc/call.go:158 +0xc1" command=kill name=kata-runtime pid=8352 source=runtime
level=error msg="github.com/kata-containers/runtime/vendor/google.golang.org/grpc.Invoke(0xc17ce0, 0xc4200a6080, 0xb9bc1f, 0x20, 0xb30740, 0xc420221980, 0xb34a60, 0x11cbd50, 0xc42025a780, 0x0, ...)" command=kill name=kata-runtime pid=8352 source=runtime
level=error msg="\t/home/gwhaley/gopath/src/github.com/kata-containers/runtime/vendor/google.golang.org/grpc/call.go:150 +0x19c" command=kill name=kata-runtime pid=8352 source=runtime
level=error msg="github.com/kata-containers/runtime/vendor/google.golang.org/grpc.(*ClientConn).Invoke(0xc42025a780, 0xc17ce0, 0xc4200a6080, 0xb9bc1f, 0x20, 0xb30740, 0xc420221980, 0xb34a60, 0x11cbd50, 0x0, ...)" command=kill name=kata-runtime pid=8352 source=runtime
level=error msg="\t/home/gwhaley/gopath/src/github.com/kata-containers/runtime/vendor/google.golang.org/grpc/call.go:301 +0x9ed" command=kill name=kata-runtime pid=8352 source=runtime
level=error msg="github.com/kata-containers/runtime/vendor/google.golang.org/grpc.invoke(0xc17d60, 0xc42028c240, 0xb9bc1f, 0x20, 0xb30740, 0xc420221980, 0xb34a60, 0x11cbd50, 0xc42025a780, 0x0, ...)" command=kill name=kata-runtime pid=8352 source=runtime
level=error msg="\t/home/gwhaley/gopath/src/github.com/kata-containers/runtime/vendor/google.golang.org/grpc/call.go:49 +0x9f" command=kill name=kata-runtime pid=8352 source=runtime
level=error msg="github.com/kata-containers/runtime/vendor/google.golang.org/grpc.recvResponse(0xc17d60, 0xc42028c240, 0x0, 0x0, 0xc176a0, 0x11cbd50, 0x0, 0x0, 0x0, 0x0, ...)" command=kill name=kata-runtime pid=8352 source=runtime
level=error msg="\t/home/gwhaley/gopath/src/github.com/kata-containers/runtime/vendor/google.golang.org/grpc/transport/transport.go:283 +0x128" command=kill name=kata-runtime pid=8352 source=runtime
level=error msg="github.com/kata-containers/runtime/vendor/google.golang.org/grpc/transport.(*Stream).Header(0xc4202b0000, 0xbbadb0, 0xc4202631c0, 0xc1ad60)" command=kill name=kata-runtime pid=8352 source=runtime
/cc @sboeuf . Anybody else we need to pull in here? Tomorrow I'll look into grpc and see if there are any disconnect or timeout things we can tweak?
@grahamwhaley thanks for the detailed investigation here. We should definitely look at some appropriate timeouts to be applied here and gracefully fail based on this. Now, that being said, if we could also avoid the failure itself, this would be great. Maybe a race condition related to the stop sequence.
@sboeuf , absolutely we should figure out what is casing the death of QEMU. Yeah, I suspect a race/timeout - last time iirc I looked at something like this part of the problem was related to rm
taking some time to kill off our runtime, and docker having some hard wired internal timeouts (something like 7+4 seconds), at which point it forcefully kills things off - in which case we were half way through a kill
or rm
when Docker either issues another or forces some process death. I'll see what I can find in the logs.
On the actual hung up with no timeout issue - I think the root culprit is likely the use of this Background
context in the sendReq()
function that I think ultimately issues the grpc call. I expect we actually want a context.WithTimeout
or context.WithDeadline
context in there instead. I'm having an experiment with that right now.
I also suspect it might not be quite that simple, as we may have to figure out if all calls through that handler are expected to complete 'in a fair time', or if some are meant to be blocking calls. For instance, just below there we have another use of Background()
context, but that is polling the stdin/out, so I think that one probably is meant to block.
Heh heh - daily update. I've been trying to figure out how and when the QEMU dies, and I've not found it yet. I added in this patch whilst testing to check if we could timeout the grpc context:
diff --git a/virtcontainers/kata_agent.go b/virtcontainers/kata_agent.go
index b49efae..868f2ae 100644
--- a/virtcontainers/kata_agent.go
+++ b/virtcontainers/kata_agent.go
@@ -967,6 +967,13 @@ func (k *kataAgent) signalProcess(c *Container, processID string, signal syscall
}
_, err := k.sendReq(req)
+
+ if err != nil {
+ k.Logger().WithFields(logrus.Fields{
+ "container-id": c.id,
+ }).WithError(err).Error("signalProcess() sendReq failed")
+ }
+
return err
}
@@ -1230,7 +1237,16 @@ func (k *kataAgent) sendReq(request interface{}) (interface{}, error) {
return nil, errors.New("Invalid request type")
}
- return handler(context.Background(), request)
+ ctx, cancel := context.WithTimeout(context.Background(), 120*time.Second)
+ defer cancel()
+ i, err := handler(ctx, request)
+
+ if err != nil {
+ k.Logger().WithError(err).Error("sendReq failed")
+ }
+
+ return i, err
}
With that, I ran the test until it locked up. Then grabbing the whole journalctl for the last 1h or so, and grepping out the cid of one of the 'stuck' containers, I get (and after remembering to go re-enable debug inside our config file - thanks 'make install' ;-) ):
time="2018-06-19T20:06:39.765624803+01:00" level=error msg="signalProcess() sendReq failed" arch=amd64 container-id=3967be617317289076fd8c052b3f765b6e33821caa7937e2eca3cab214e637d8 error="context deadline exceeded" name=kata-runtime pid=11990 source=virtcontainers subsystem=kata_agent
time="2018-06-19T20:06:34.761909547+01:00" level=info arguments="\"state 3967be617317289076fd8c052b3f765b6e33821caa7937e2eca3cab214e637d8\"" command=state commit=6f409b9528b66c46814eacf72083e2a54645ee89-dirty name=kata-runtime pid=11990 source=runtime version=1.0.0
time="2018-06-19T20:06:34+01:00" level=error msg="failed to kill init's children" error="OCI runtime killall failed: context deadline exceeded" id=3967be617317289076fd8c052b3f765b6e33821caa7937e2eca3cab214e637d8 namespace=moby path="/run/docker/containerd/daemon/io.containerd.runtime.v1.linux/moby/3967be617317289076fd8c052b3f765b6e33821caa7937e2eca3cab214e637d8" pid=13833
time="2018-06-19T20:06:34.743413547+01:00" level=error msg="signalProcess() sendReq failed" arch=amd64 container-id=3967be617317289076fd8c052b3f765b6e33821caa7937e2eca3cab214e637d8 error="context deadline exceeded" name=kata-runtime pid=11935 source=virtcontainers subsystem=kata_agent
time="2018-06-19T20:06:24.741642631+01:00" level=error msg="signalProcess() sendReq failed" arch=amd64 container-id=3967be617317289076fd8c052b3f765b6e33821caa7937e2eca3cab214e637d8 error="context deadline exceeded" name=kata-runtime pid=11935 source=virtcontainers subsystem=kata_agent
time="2018-06-19T20:06:21.735519585+01:00" level=debug msg="FIXME: Got an API for which error does not match any expected type!!!: Could not kill running container 3967be617317289076fd8c052b3f765b6e33821caa7937e2eca3cab214e637d8, cannot remove - Cannot kill container 3967be617317289076fd8c052b3f765b6e33821caa7937e2eca3cab214e637d8: unknown error after kill: /usr/local/bin/kata-runtime did not terminate sucessfully: rpc error: code = DeadlineExceeded desc = context deadline exceeded\n: unknown" error_type="*errors.errorString" module=api
time="2018-06-19T20:06:21.735418713+01:00" level=error msg="Handler for DELETE /v1.35/containers/3967be617317 returned error: Could not kill running container 3967be617317289076fd8c052b3f765b6e33821caa7937e2eca3cab214e637d8, cannot remove - Cannot kill container 3967be617317289076fd8c052b3f765b6e33821caa7937e2eca3cab214e637d8: unknown error after kill: /usr/local/bin/kata-runtime did not terminate sucessfully: rpc error: code = DeadlineExceeded desc = context deadline exceeded\n: unknown"
time="2018-06-19T20:06:21.735271498+01:00" level=debug msg="FIXME: Got an API for which error does not match any expected type!!!: Could not kill running container 3967be617317289076fd8c052b3f765b6e33821caa7937e2eca3cab214e637d8, cannot remove - Cannot kill container 3967be617317289076fd8c052b3f765b6e33821caa7937e2eca3cab214e637d8: unknown error after kill: /usr/local/bin/kata-runtime did not terminate sucessfully: rpc error: code = DeadlineExceeded desc = context deadline exceeded\n: unknown" error_type="*errors.errorString" module=api
time="2018-06-19T20:06:19.738730493+01:00" level=info arguments="\"kill --all 3967be617317289076fd8c052b3f765b6e33821caa7937e2eca3cab214e637d8 9\"" command=kill commit=6f409b9528b66c46814eacf72083e2a54645ee89-dirty name=kata-runtime pid=11935 source=runtime version=1.0.0
time="2018-06-19T20:06:19.734018987+01:00" level=error msg="failed waiting for process" container=3967be617317289076fd8c052b3f765b6e33821caa7937e2eca3cab214e637d8 error="rpc error: code = Unavailable desc = all SubConns are in TransientFailure" exec-id=3967be617317289076fd8c052b3f765b6e33821caa7937e2eca3cab214e637d8 name=kata-shim pid=1 source=shim
time="2018-06-19T20:06:19.733767045+01:00" level=info msg="copy stdout failed" container=3967be617317289076fd8c052b3f765b6e33821caa7937e2eca3cab214e637d8 error="rpc error: code = Unavailable desc = transport is closing" exec-id=3967be617317289076fd8c052b3f765b6e33821caa7937e2eca3cab214e637d8 name=kata-shim pid=1 source=shim
time="2018-06-19T20:06:19.730998206+01:00" level=error msg="signalProcess() sendReq failed" arch=amd64 container-id=3967be617317289076fd8c052b3f765b6e33821caa7937e2eca3cab214e637d8 error="rpc error: code = DeadlineExceeded desc = context deadline exceeded" name=kata-runtime pid=9650 source=virtcontainers subsystem=kata_agent
time="2018-06-19T20:01:20.78529512+01:00" level=debug msg="Replacing OCI mount (/etc/hosts) source /var/lib/docker/containers/3967be617317289076fd8c052b3f765b6e33821caa7937e2eca3cab214e637d8/hosts with /run/kata-containers/shared/containers/3967be617317289076fd8c052b3f765b6e33821caa7937e2eca3cab214e637d8-0722f9f7dc92243c-hosts" arch=amd64 name=kata-runtime pid=13841 source=virtcontainers subsystem=kata_agent
time="2018-06-19T20:01:20.785262587+01:00" level=debug msg="Replacing OCI mount (/etc/hostname) source /var/lib/docker/containers/3967be617317289076fd8c052b3f765b6e33821caa7937e2eca3cab214e637d8/hostname with /run/kata-containers/shared/containers/3967be617317289076fd8c052b3f765b6e33821caa7937e2eca3cab214e637d8-5dda9d1a96ad4e20-hostname" arch=amd64 name=kata-runtime pid=13841 source=virtcontainers subsystem=kata_agent
time="2018-06-19T20:01:20.785209919+01:00" level=debug msg="Replacing OCI mount (/etc/resolv.conf) source /var/lib/docker/containers/3967be617317289076fd8c052b3f765b6e33821caa7937e2eca3cab214e637d8/resolv.conf with /run/kata-containers/shared/containers/3967be617317289076fd8c052b3f765b6e33821caa7937e2eca3cab214e637d8-f5383d412b1bf8ec-resolv.conf" arch=amd64 name=kata-runtime pid=13841 source=virtcontainers subsystem=kata_agent
time="2018-06-19T20:01:19.6376833+01:00" level=debug command=create default-kernel-parameters="ip=::::::3967be617317289076fd8c052b3f765b6e33821caa7937e2eca3cab214e637d8::off:: init=/usr/lib/systemd/systemd systemd.unit=kata-containers.target systemd.mask=systemd-networkd.service systemd.mask=systemd-networkd.socket" name=kata-runtime pid=13841 source=runtime
time="2018-06-19T20:01:19.636626569+01:00" level=debug msg="converting /run/docker/containerd/daemon/io.containerd.runtime.v1.linux/moby/3967be617317289076fd8c052b3f765b6e33821caa7937e2eca3cab214e637d8/config.json" name=kata-runtime pid=13841 source=virtcontainers/oci
Whilst I can see container create and startup, and we can see the timeout fail of the sendReq
, what I don't see is any debug/logs of the docker kill
command coming through that matches that hang up. odd. Any ideas on that welcome.
I'm starting to run out of ideas a touch on how to figure out what happened to my QEMU... one more 'hack' from @jodh-intel to try tomorrow...
And a final note - adding that context timeout returns an error that seems to upset Docker somewhat ;-)
time="2018-06-19T20:06:21.735519585+01:00" level=debug msg="FIXME: Got an API for which error does not match any expected type!!!: Could not kill running container 3967be617317289076fd8c052b3f765b6e33821caa7937e2eca3cab214e637d8, cannot remove - Cannot kill container 3967be617317289076fd8c052b3f765b6e33821caa7937e2eca3cab214e637d8: unknown error after kill: /usr/local/bin/kata-runtime did not terminate sucessfully: rpc error: code = DeadlineExceeded desc = context deadline exceeded\n: unknown" error_type="*errors.errorString" module=api
I suspect we are going to need more framework or something along with that simple context timeout...
Hi @grahamwhaley - did the qemu shell hack provide anything new?
Not got to it yet. First I'm going to hack the metrics to not do a docker rm -f $(docker ps -qa)
, as that is what hangs up, and is getting us false failures on the metrics CI. I'll add some large comments etc. I'm only really happy changing that as we have this Issue and https://github.com/kata-containers/tests/pull/414 open to cover the broken case.
Once I've proven changing the sequence to an iterated pair of docker stop
and docker rm
, then I'll get back to chasing this a bit.
Grrr, and testing that, replacing the docker rm -f
with an iterated pair of stop/rm
... still hung up!
I'll have another shot at tracking down the missing QEMU then, and if that fails, will have to try harder to shore up the metrics for a bit...
Time for an update. Now we have the journald truncation figured out, I can get some full logs to work with.
Doing a run and having 4 containers hang up, looking in the system journal (and a grep -i deadline
), I see (oh, the grpc timeouts are from my grpc context timeout patch btw):
time="2018-07-02T11:49:16.217469645+01:00" level=info msg="2018/07/02 10:49:16 [ERR] yamux: keepalive failed: i/o deadline reached\n" name=kata-proxy pid=25683 source=agent
time="2018-07-02T11:49:23.356943573+01:00" level=info msg="2018/07/02 10:49:23 [ERR] yamux: keepalive failed: i/o deadline reached\n" name=kata-proxy pid=22938 source=agent
time="2018-07-02T11:49:37.08974012+01:00" level=info msg="2018/07/02 10:49:37 [ERR] yamux: keepalive failed: i/o deadline reached\n" name=kata-proxy pid=24587 source=agent
time="2018-07-02T11:49:50.249446827+01:00" level=info msg="2018/07/02 10:49:50 [ERR] yamux: keepalive failed: i/o deadline reached\n" name=kata-proxy pid=11639 source=agent
time="2018-07-02T11:51:48.881905754+01:00" level=error msg="sendReq failed" arch=amd64 error="rpc error: code = DeadlineExceeded desc = context deadline exceeded" name=kata-runtime pid=27335 source=virtcontainers subsystem=kata_agent
time="2018-07-02T11:51:48.887067109+01:00" level=error msg="signalProcess() sendReq failed" arch=amd64 container-id=8f2b8cc2f85c93ed30c61abcc5f25c923344c28295a693af2a23bac2a7008408 error="rpc error: code = DeadlineExceeded desc = context deadline exceeded" name=kata-runtime pid=27335 source=virtcontainers subsystem=kata_agent
time="2018-07-02T11:51:48.8872918+01:00" level=error msg="rpc error: code = DeadlineExceeded desc = context deadline exceeded" command=kill container=8f2b8cc2f85c93ed30c61abcc5f25c923344c28295a693af2a23bac2a7008408 name=kata-runtime pid=27335 sandbox=8f2b8cc2f85c93ed30c61abcc5f25c923344c28295a693af2a23bac2a7008408 source=runtime
time="2018-07-02T11:51:49.249058752+01:00" level=error msg="sendReq failed" arch=amd64 error="rpc error: code = DeadlineExceeded desc = context deadline exceeded" name=kata-runtime pid=27330 source=virtcontainers subsystem=kata_agent
time="2018-07-02T11:51:49.249604392+01:00" level=error msg="signalProcess() sendReq failed" arch=amd64 container-id=582e7fff4cf013de61ddb7d02764bc21dc7a9caad95bd8c173a6508038e9e10d error="rpc error: code = DeadlineExceeded desc = context deadline exceeded" name=kata-runtime pid=27330 source=virtcontainers subsystem=kata_agent
time="2018-07-02T11:51:49.250132138+01:00" level=error msg="rpc error: code = DeadlineExceeded desc = context deadline exceeded" command=kill container=582e7fff4cf013de61ddb7d02764bc21dc7a9caad95bd8c173a6508038e9e10d name=kata-runtime pid=27330 sandbox=582e7fff4cf013de61ddb7d02764bc21dc7a9caad95bd8c173a6508038e9e10d source=runtime
time="2018-07-02T11:51:49.335716463+01:00" level=error msg="sendReq failed" arch=amd64 error="rpc error: code = DeadlineExceeded desc = context deadline exceeded" name=kata-runtime pid=27474 source=virtcontainers subsystem=kata_agent
time="2018-07-02T11:51:49.336945716+01:00" level=error msg="signalProcess() sendReq failed" arch=amd64 container-id=21e55708d2e9ec6853b947df1490f263f33a67da380240e3bdce999fbd3fe6a3 error="rpc error: code = DeadlineExceeded desc = context deadline exceeded" name=kata-runtime pid=27474 source=virtcontainers subsystem=kata_agent
time="2018-07-02T11:51:49.339115749+01:00" level=error msg="rpc error: code = DeadlineExceeded desc = context deadline exceeded" command=kill container=21e55708d2e9ec6853b947df1490f263f33a67da380240e3bdce999fbd3fe6a3 name=kata-runtime pid=27474 sandbox=21e55708d2e9ec6853b947df1490f263f33a67da380240e3bdce999fbd3fe6a3 source=runtime
That yamux
timeout find is new for me, and is quite likely related to https://github.com/kata-containers/agent/pull/263, which is not settled out yet. At this point I suspect the agent end yamux has timed out on a keepalive as the host side is somewhat busy, and then death follows (as the agent/comms channel shuts down?).
@jodh-intel - on the logging format etc. front - the line coming back from the agent/yamux:
time="2018-07-02T11:49:16.217469645+01:00" level=info msg="2018/07/02 10:49:16 [ERR] yamux: keepalive failed: i/o deadline reached\n" name=kata-proxy pid=25683 source=agent
doesn't have a pod or CID in it - which I suspect the agent has access to, you think? I have to wonder if we can (and should) hook our logrus into the yamux logger config item, which we seem to currently not use in either direction (proxy or agent): https://github.com/hashicorp/yamux/blob/master/mux.go#L33 Let me know what you think - I can go try to improve that logging status if you think we can?
@grahamwhaley have you given a quick try of applying https://github.com/kata-containers/agent/pull/263 to the agent, and see if you can reproduce the issue ?
Will do. I held off as it looked from the comments over there that that PR caused some other issues, but I can try it and have a compare. np.
@grahamwhaley yeah I know, you might hit some other issues, but let's cross the fingers you won't and you'll be able to verify that it might solve your current issue.
@sboeuf yay \o/ - kata-containers/agent#263 fixed the qemu death hangup issue! I ran 5 loops of 110 nginx containers, and got 0 hangs. It would normally hang on the first loop. Now I guess we 'just' have to figure out the other wrinkles of kata-containers/agent#263...
Oh, (if I've not already then) I'll open an Issue for the grpc context timeout so we can discuss what the right thing to do there is.
Great news! I've just made https://github.com/kata-containers/tests/issues/195 a P2 as we clearly never want to have to debug this sort of nightmare again - we want immediate CI feedback when we broke it ;)
Oh that is very good news @grahamwhaley !!! Now, as you said, we have to figure out how to merge #263 without having those weird issues @devimc and myself ran into. @jodh-intel +1 for gating this through CI.
Closing this issue as it was fixed by https://github.com/kata-containers/agent/pull/263
Description of problem
Running the soak test over at https://github.com/kata-containers/tests/pull/414, regularly hangs up in the
docker rm
phase - that is when it does adocker rm x y z
of upto 110 containers (but we have seen this fail with 20 containers).Expected result
The test is expected to complete.
Actual result
The test hangs up in the
rm
phaseMeta details
Mega system details here. More interesting info I'll post in a new comment on the thread.
Running
kata-collect-data.sh
version1.0.0 (commit ca9f7abba96e5c4db734673b9e7d870076d715e2)
at2018-06-18.17:39:28.741713727+0100
.Runtime is
/usr/local/bin/kata-runtime
.kata-env
Output of "
/usr/local/bin/kata-runtime kata-env
":Runtime config files
Runtime default config files
Runtime config file contents
Config file
/etc/kata-containers/configuration.toml
not found Output of "cat "/usr/share/defaults/kata-containers/configuration.toml"
":Image details
Initrd details
No initrd
Logfiles
Runtime logs
Recent runtime problems found in system journal:
Proxy logs
Recent proxy problems found in system journal:
Shim logs
Recent shim problems found in system journal:
Container manager details
Have
docker
Docker
Output of "
docker version
":Output of "
docker info
":Output of "
systemctl show docker
":No
kubectl
Packages
Have
dpkg
Output of "dpkg -l|egrep "(cc-oci-runtimecc-runtimerunv|kata-proxy|kata-runtime|kata-shim|kata-containers-image|linux-container|qemu-)"
":Have
rpm
Output of "rpm -qa|egrep "(cc-oci-runtimecc-runtimerunv|kata-proxy|kata-runtime|kata-shim|kata-containers-image|linux-container|qemu-)"
":