Open nscheer opened 5 years ago
Hm; feels like a race condition; swamkit API shutting down before the daemon shuts down (therefore some cleanup not completing) and not being up yet when the daemon starts (therefore attempting to create the object, and then discovering it already exists)
Reading this from my phone, but I'll give it a try and see if I can reproduce with your steps
Thanks for reporting!
Confirmed that I see the same errors/warnings.
I'd have to check with the SwarmKit team to see if these are actual issues, or just a red herring.
Nov 12 20:29:19 centos-test dockerd: time="2018-11-12T20:29:19.412163426Z" level=error msg="error reading the kernel parameter net.ipv4.vs.expire_nodest_conn" error="open /proc/sys/net/ipv4/vs/expire_nodest_conn: no such file or directory"
Nov 12 20:29:27 centos-test dockerd: time="2018-11-12T20:29:27.619515003Z" level=error msg="agent: session failed" backoff=100ms error="context canceled" module=node/agent node.id=q3h8ro34wxykrdehisqxepicb
Nov 12 20:29:27 centos-test dockerd: time="2018-11-12T20:29:27.619715427Z" level=error msg="failed to remove node" error="rpc error: code = Aborted desc = dispatcher is stopped" method="(*Dispatcher).Session" node.id=q3h8ro34wxykrdehisqxepicb node.session=gg89p0ragegtuby78jqpyf51r
Nov 12 20:29:27 centos-test dockerd: time="2018-11-12T20:29:27.620426607Z" level=error msg="failed to receive changes from store watch API" error="rpc error: code = Unknown desc = context canceled"
Nov 12 20:29:27 centos-test dockerd: time="2018-11-12T20:29:27.620700335Z" level=warning msg="grpc: addrConn.createTransport failed to connect to {/var/run/docker/swarm/control.sock 0 <nil>}. Err :connection error: desc = \"transport: Error while dialing dial unix /var/run/docker/swarm/control.sock: connect: no such file or directory\". Reconnecting..." module=grpc
Nov 12 20:29:27 centos-test dockerd: time="2018-11-12T20:29:27.621016081Z" level=warning msg="grpc: addrConn.transportMonitor exits due to: context canceled" module=grpc
Nov 12 20:29:27 centos-test dockerd: time="2018-11-12T20:29:27.750759696Z" level=warning msg="Error (Unable to complete atomic operation, key modified) deleting object [endpoint abj6s1gt9dqvd15ouqnxmaqdu 1273fcc5aa566ebe4410a67c8e25e27e9e5c0c9e5539d112706bb552bbe6eb9e], retrying...."
Nov 12 20:29:32 centos-test dockerd: time="2018-11-12T20:29:32.254580485Z" level=error msg="error creating cluster object" error="name conflicts with an existing object" module=node node.id=q3h8ro34wxykrdehisqxepicb
Nov 12 20:29:32 centos-test dockerd: time="2018-11-12T20:29:32.373213186Z" level=warning msg="Could not register builder git source: failed to find git binary: exec: \"git\": executable file not found in $PATH"
Nov 12 20:29:32 centos-test dockerd: time="2018-11-12T20:29:32.491401983Z" level=error msg="error reading the kernel parameter net.ipv4.vs.expire_nodest_conn" error="open /proc/sys/net/ipv4/vs/expire_nodest_conn: no such file or directory"
Full logs below:
i have a strong suspicion it's just garbage messages (swarmkit is notoriously bad about them) but i'll go code-diving to check.
Yeah, it's garbage. The line that produces it is here:
We're not supposed to emit that error, but something got changed at some point, because the return value of CreateCluster
is actually ErrNameConflict
.
I'll open a quick PR, it's just a one-line fix. Sorry about the noise.
Also, I'm 94% sure that failed to remove node
is an expected part of the swarmkit shutdown process, and I seem to recall that fixing it and errors of its ilk is nontrivial.
Hello, Mr My enviroment are as flows: 18.09.0 Version of docker, three nodes There are some question about docker swarm.The docker was exit after executed "docker swarm leave -f" cmd at 0001 node ,detail log are as follows: Jun 18 16:50:12 moss-0001 dockerd[4047]: time="2024-06-18T16:50:12.646486713+08:00" level=error msg="error sending message to peer" error="rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 172.78.78.10:2377: connect: connection refused"" Jun 18 16:50:13 moss-0001 dockerd[4047]: time="2024-06-18T16:50:13.586876613+08:00" level=info msg="pickfirstBalancer: HandleSubConnStateChange: 0xc001896010, CONNECTING" module=grpc Jun 18 16:50:13 moss-0001 dockerd[4047]: time="2024-06-18T16:50:13.587991196+08:00" level=info msg="pickfirstBalancer: HandleSubConnStateChange: 0xc001896010, READY" module=grpc Jun 18 16:54:04 moss-0001 dockerd[4047]: time="2024-06-18T16:54:04.084206553+08:00" level=info msg="shutting down certificate renewal routine" module=node/tls node.id=n09yo5hghhwxuq1ub26spp1at node.role=swarm-manager Jun 18 16:54:04 moss-0001 dockerd[4047]: time="2024-06-18T16:54:04.084208854+08:00" level=info msg="Stopping manager" module=node node.id=n09yo5hghhwxuq1ub26spp1at Jun 18 16:54:04 moss-0001 dockerd[4047]: time="2024-06-18T16:54:04.084273668+08:00" level=info msg="dispatcher stopping" method="(Dispatcher).Stop" module=dispatcher node.id=n09yo5hghhwxuq1ub26spp1at Jun 18 16:54:04 moss-0001 dockerd[4047]: time="2024-06-18T16:54:04.084317588+08:00" level=info msg="dispatcher session dropped, marking node q0keq55q291o8trumxbob2lhg down" method="(Dispatcher).Session" node.id=q0keq55q291o8trumxbob2lhg node.session=cqkmb6ad2cc91yjkxv4qofg6u Jun 18 16:54:04 moss-0001 dockerd[4047]: time="2024-06-18T16:54:04.084328711+08:00" level=error msg="failed to remove node" error="rpc error: code = Aborted desc = dispatcher is stopped" method="(Dispatcher).Session" node.id=q0keq55q291o8trumxbob2lhg node.session=cqkmb6ad2cc91yjkxv4qofg6u Jun 18 16:54:04 moss-0001 dockerd[4047]: time="2024-06-18T16:54:04.084343918+08:00" level=info msg="pickfirstBalancer: HandleSubConnStateChange: 0xc000d15500, TRANSIENT_FAILURE" module=grpc Jun 18 16:54:04 moss-0001 dockerd[4047]: time="2024-06-18T16:54:04.084368139+08:00" level=info msg="pickfirstBalancer: HandleSubConnStateChange: 0xc000d15500, CONNECTING" module=grpc Jun 18 16:54:04 moss-0001 dockerd[4047]: time="2024-06-18T16:54:04.084356930+08:00" level=info msg="dispatcher session dropped, marking node m7p80vzt9krzz9o9tia7972a4 down" method="(Dispatcher).Session" node.id=m7p80vzt9krzz9o9tia7972a4 node.session=a6ch3h812r23c5wahjw12e4l0 Jun 18 16:54:04 moss-0001 dockerd[4047]: time="2024-06-18T16:54:04.084393572+08:00" level=warning msg="grpc: addrConn.createTransport failed to connect to {/var/run/docker/swarm/control.sock 0 }. Err :connection error: desc = "transport: Error while dialing dial unix /var/run/docker/swarm/control.sock: connect: no such file or directory". Reconnecting..." module=grpc Jun 18 16:54:04 moss-0001 dockerd[4047]: time="2024-06-18T16:54:04.084389703+08:00" level=error msg="failed to remove node" error="rpc error: code = Aborted desc = dispatcher is stopped" method="(Dispatcher).Session" node.id=m7p80vzt9krzz9o9tia7972a4 node.session=a6ch3h812r23c5wahjw12e4l0 Jun 18 16:54:04 moss-0001 dockerd[4047]: time="2024-06-18T16:54:04.084423587+08:00" level=info msg="pickfirstBalancer: HandleSubConnStateChange: 0xc000d15500, TRANSIENT_FAILURE" module=grpc Jun 18 16:54:04 moss-0001 dockerd[4047]: time="2024-06-18T16:54:04.084614208+08:00" level=error msg="failed to receive changes from store watch API" error="rpc error: code = Unknown desc = context canceled" Jun 18 16:54:04 moss-0001 dockerd[4047]: time="2024-06-18T16:54:04.085089050+08:00" level=info msg="Manager shut down" module=node node.id=n09yo5hghhwxuq1ub26spp1at Jun 18 16:54:04 moss-0001 dockerd[4047]: time="2024-06-18T16:54:04.085142553+08:00" level=warning msg="grpc: addrConn.transportMonitor exits due to: context canceled" module=grpc Jun 18 16:54:04 moss-0001 dockerd[4047]: time="2024-06-18T16:54:04.361178552+08:00" level=info msg="Node 9041e40f9e98 change state NodeActive --> NodeLeft" Jun 18 16:54:04 moss-0001 dockerd[4047]: time="2024-06-18T16:54:04.361226540+08:00" level=info msg="moss-0001(9041e40f9e98): Node leave event for 9041e40f9e98/172.78.78.8" Jun 18 16:54:04 moss-0001 dockerd[4047]: panic: runtime error: invalid memory address or nil pointer dereference Jun 18 16:54:04 moss-0001 dockerd[4047]: [signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x56408cb68044] Jun 18 16:54:04 moss-0001 dockerd[4047]: goroutine 828 [running]: Jun 18 16:54:04 moss-0001 dockerd[4047]: panic({0x56408d8cf5c0, 0x56408ea6b160}) Jun 18 16:54:04 moss-0001 dockerd[4047]: /usr/lib/golang/src/runtime/panic.go:1147 +0x3a8 fp=0xc0021d5d60 sp=0xc0021d5ca0 pc=0x56408bfeb208 Jun 18 16:54:04 moss-0001 dockerd[4047]: runtime.panicmem(...) Jun 18 16:54:04 moss-0001 dockerd[4047]: /usr/lib/golang/src/runtime/panic.go:221 Jun 18 16:54:04 moss-0001 dockerd[4047]: runtime.sigpanic() Jun 18 16:54:04 moss-0001 dockerd[4047]: /usr/lib/golang/src/runtime/signal_unix.go:735 +0x327 fp=0xc0021d5db0 sp=0xc0021d5d60 pc=0x56408c0020e7 Jun 18 16:54:04 moss-0001 dockerd[4047]: github.com/docker/docker/vendor/github.com/docker/libnetwork/networkdb.(NetworkDB).rejoinClusterBootStrap(0xc002116a20) Jun 18 16:54:04 moss-0001 dockerd[4047]: /root/rpmbuild/BUILD/components/engine/.gopath/src/github.com/docker/docker/vendor/github.com/docker/libnetwork/networkdb/cluster.go:305 +0x4c4 fp=0xc0021d5f00 sp=0xc0021d5db0 pc=0x56408cb68044 Jun 18 16:54:04 moss-0001 dockerd[4047]: github.com/docker/docker/vendor/github.com/docker/libnetwork/networkdb.(NetworkDB).rejoinClusterBootStrap-fm() Jun 18 16:54:04 moss-0001 dockerd[4047]: /root/rpmbuild/BUILD/components/engine/.gopath/src/github.com/docker/docker/vendor/github.com/docker/libnetwork/networkdb/cluster.go:284 +0x26 fp=0xc0021d5f18 sp=0xc0021d5f00 pc=0x56408cb8bb46 Jun 18 16:54:04 moss-0001 dockerd[4047]: github.com/docker/docker/vendor/github.com/docker/libnetwork/networkdb.(NetworkDB).triggerFunc(0xc002116a20, 0xdf8475800, 0xc001513140, 0xc001281460) Jun 18 16:54:04 moss-0001 dockerd[4047]: /root/rpmbuild/BUILD/components/engine/.gopath/src/github.com/docker/docker/vendor/github.com/docker/libnetwork/networkdb/cluster.go:256 +0x112 fp=0xc0021d5fb0 sp=0xc0021d5f18 pc=0x56408cb67832 Jun 18 16:54:04 moss-0001 dockerd[4047]: github.com/docker/docker/vendor/github.com/docker/libnetwork/networkdb.(NetworkDB).clusterInit·dwrap·4() Jun 18 16:54:04 moss-0001 dockerd[4047]: /root/rpmbuild/BUILD/components/engine/.gopath/src/github.com/docker/docker/vendor/github.com/docker/libnetwork/networkdb/cluster.go:178 +0x32 fp=0xc0021d5fe0 sp=0xc0021d5fb0 pc=0x56408cb67012 Jun 18 16:54:04 moss-0001 dockerd[4047]: runtime.goexit() Jun 18 16:54:04 moss-0001 dockerd[4047]: /usr/lib/golang/src/runtime/asm_amd64.s:1581 +0x1 fp=0xc0021d5fe8 sp=0xc0021d5fe0 pc=0x56408c01faa1 Jun 18 16:54:04 moss-0001 dockerd[4047]: created by github.com/docker/docker/vendor/github.com/docker/libnetwork/networkdb.(NetworkDB).clusterInit Jun 18 16:54:04 moss-0001 dockerd[4047]: /root/rpmbuild/BUILD/components/engine/.gopath/src/github.com/docker/docker/vendor/github.com/docker/libnetwork/networkdb/cluster.go:178 +0x7e5 Jun 18 16:54:04 moss-0001 dockerd[4047]: goroutine 1 [chan receive, 5 minutes]: Jun 18 16:54:04 moss-0001 dockerd[4047]: runtime.gopark(0xc000f21918, 0x56408bfc1c14, 0x18, 0x0, 0x7fc622c6bd08) Jun 18 16:54:04 moss-0001 dockerd[4047]: /usr/lib/golang/src/runtime/proc.go:366 +0xd6 fp=0xc0022af898 sp=0xc0022af878 pc=0x56408bfee136 Jun 18 16:54:04 moss-0001 dockerd[4047]: runtime.chanrecv(0xc000b9a000, 0xc000f21bf8, 0x1) Jun 18 16:54:04 moss-0001 dockerd[4047]: /usr/lib/golang/src/runtime/chan.go:576 +0x56c fp=0xc0022af928 sp=0xc0022af898 pc=0x56408bfbaf2c Jun 18 16:54:04 moss-0001 dockerd[4047]: runtime.chanrecv1(0xc000b33740, 0xc001481a10) Jun 18 16:54:04 moss-0001 dockerd[4047]: /usr/lib/golang/src/runtime/chan.go:439 +0x18 fp=0xc0022af950 sp=0xc0022af928 pc=0x56408bfba958 Jun 18 16:54:04 moss-0001 dockerd[4047]: main.(DaemonCli).start(0xc000b33740, 0xc0002a3da0) Jun 18 16:54:04 moss-0001 dockerd[4047]: src/github.com/docker/docker/cmd/dockerd/daemon.go:338 +0x148b fp=0xc0022afd78 sp=0xc0022af950 pc=0x56408d3f416b Jun 18 16:54:04 moss-0001 dockerd[4047]: main.runDaemon(...) Jun 18 16:54:04 moss-0001 dockerd[4047]: src/github.com/docker/docker/cmd/dockerd/docker_unix.go:7 Jun 18 16:54:04 moss-0001 dockerd[4047]: main.newDaemonCommand.func1(0xc000a9fb80, {0xc000293e00, 0x8, 0x8}) Jun 18 16:54:04 moss-0001 dockerd[4047]: src/github.com/docker/docker/cmd/dockerd/docker.go:29 +0x5c fp=0xc0022afda0 sp=0xc0022afd78 pc=0x56408d3f8dfc Jun 18 16:54:04 moss-0001 dockerd[4047]: github.com/docker/docker/vendor/github.com/spf13/cobra.(Command).execute(0xc000a9fb80, {0xc00013c010, 0x8, 0x8}) Jun 18 16:54:04 moss-0001 dockerd[4047]: /root/rpmbuild/BUILD/components/engine/.gopath/src/github.com/docker/docker/vendor/github.com/spf13/cobra/command.go:762 +0x60e fp=0xc0022afe60 sp=0xc0022afda0 pc=0x56408d3e93ce Jun 18 16:54:04 moss-0001 dockerd[4047]: github.com/docker/docker/vendor/github.com/spf13/cobra.(Command).ExecuteC(0xc000a9fb80) Jun 18 16:54:04 moss-0001 dockerd[4047]: /root/rpmbuild/BUILD/components/engine/.gopath/src/github.com/docker/docker/vendor/github.com/spf13/cobra/command.go:852 +0x2dc fp=0xc0022aff18 sp=0xc0022afe60 pc=0x56408d3e98dc Jun 18 16:54:04 moss-0001 dockerd[4047]: github.com/docker/docker/vendor/github.com/spf13/cobra.(Command).Execute(...) Jun 18 16:54:04 moss-0001 dockerd[4047]: /root/rpmbuild/BUILD/components/engine/.gopath/src/github.com/docker/docker/vendor/github.com/spf13/cobra/command.go:800 Jun 18 16:54:04 moss-0001 dockerd[4047]: main.main() Jun 18 16:54:04 moss-0001 dockerd[4047]: src/github.com/docker/docker/cmd/dockerd/docker.go:70 +0xca fp=0xc0022aff80 sp=0xc0022aff18 pc=0x56408d3f8f0a ....... Jun 18 16:54:12 moss-0001 systemd[1]: docker.service: Main process exited, code=killed, status=6/ABRT Jun 18 16:54:12 moss-0001 systemd[1]: docker.service: Failed with result 'signal'. Jun 18 16:54:12 moss-0001 systemd[1]: docker.service: Unit process 4070 (containerd) remains running after unit stopped. Jun 18 16:54:12 moss-0001 systemd[1]: docker.service: Unit process 4401 (containerd-shim) remains running after unit stopped. Jun 18 16:54:12 moss-0001 systemd[1]: docker.service: Unit process 4466 (containerd-shim) remains running after unit stopped.
at the moment the Linux system message log file print: error during connect: Post "http://%2Fvar%2Frun8%2Fdocker.sock/v1.39/swarm/leave?force=1": EOF
After a fresh install of 18.09.0 on a CentOS 7.5 box, the following errors appear in the log:
On daemon stop:
On daemon start:
Steps to reproduce the issue:
Describe the results you received:
Error messages as depicted above.
Describe the results you expected:
No errors :)
Additional information you deem important (e.g. issue happens only occasionally):
Output of
docker version
:Output of
docker info
:Note: These errors are emitted on every start/stop, not just once.