Closed edenbuaa closed 5 years ago
You can check node's CPU/mem usage using (100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100))
and node_memory_MemFree_bytes
, and enlarge the time frame to see what happened at that time.
More metrics and meaning can see here and here.
This is usually due to system problem, may have many causes.
Where is your machine? Cloud VM or physical machine? You can boot up Cloud VM using cloud command line or webportal.
thanks @xudifsd. the node machine is Cloud VM, but this issue happens again after boot up
a fact is that the node is still alive but can not login (login service is stopped), so is it possible that it caused by ssh connection too frequently and fire the login service limit??
No, pai services do not ssh into a node. We have not met this problem before.
@xudifsd we got the same issue yesterday, it seems caused by TaskDelete(after the TaskDelete log, many service is killed) ? the log in one node show as follows:
May 7 18:03:37 localhost dockerd[4163]: time="2019-05-07T18:03:37.888880333+08:00" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="events.TaskDelete"
May 7 18:04:04 localhost containerd[1473]: time="2019-05-07T18:04:04.046519878+08:00" level=info msg="shim containerd-shim started" address="/containerd-shim/moby/8744a2372af4376b522b0560732fd6a7ba6d52ce91f8f0d4375d3f943f0c1b7b/shim.sock" debug=false pid=93485
May 7 18:04:10 localhost containerd[1473]: time="2019-05-07T18:04:10.059296363+08:00" level=info msg="shim containerd-shim started" address="/containerd-shim/moby/c587961b097d5582dd7561bf75458e56b0ab82ca78224457e5617c6e5ea4ff4e/shim.sock" debug=false pid=94094
May 7 18:04:11 localhost containerd[1473]: time="2019-05-07T18:04:11.038139893+08:00" level=info msg="shim containerd-shim started" address="/containerd-shim/moby/bb99fed90677d38ec56738648aece958f84b0fa0fe40f8f301092698b2485b50/shim.sock" debug=false pid=94290
May 7 18:04:13 localhost containerd[1473]: time="2019-05-07T18:04:13.295003447+08:00" level=info msg="shim reaped" id=c587961b097d5582dd7561bf75458e56b0ab82ca78224457e5617c6e5ea4ff4e
May 7 18:04:13 localhost dockerd[4163]: time="2019-05-07T18:04:13.305062134+08:00" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="events.TaskDelete"
May 7 18:04:13 localhost containerd[1473]: time="2019-05-07T18:04:13.443739665+08:00" level=info msg="shim reaped" id=8744a2372af4376b522b0560732fd6a7ba6d52ce91f8f0d4375d3f943f0c1b7b
May 7 18:04:13 localhost dockerd[4163]: time="2019-05-07T18:04:13.453872864+08:00" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="events.TaskDelete"
May 7 18:04:13 localhost containerd[1473]: time="2019-05-07T18:04:13.773523289+08:00" level=info msg="shim containerd-shim started" address="/containerd-shim/moby/9ce1b4e04ace76039d29ed2204b53116f342b3fed957905f29dd1aaf3efe8ce5/shim.sock" debug=false pid=94568
May 7 18:04:19 localhost containerd[1473]: time="2019-05-07T18:04:19.296900629+08:00" level=info msg="shim reaped" id=9ce1b4e04ace76039d29ed2204b53116f342b3fed957905f29dd1aaf3efe8ce5
May 7 18:04:19 localhost dockerd[4163]: time="2019-05-07T18:04:19.306827799+08:00" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="events.TaskDelete"
May 7 18:04:19 localhost containerd[1473]: time="2019-05-07T18:04:19.552683014+08:00" level=info msg="shim containerd-shim started" address="/containerd-shim/moby/541a2604ab79651a575ff1e778acbf17e76a076366605fc7eaa053dae93decfc/shim.sock" debug=false pid=95114
May 7 18:04:25 localhost containerd[1473]: time="2019-05-07T18:04:25.309914024+08:00" level=info msg="shim reaped" id=541a2604ab79651a575ff1e778acbf17e76a076366605fc7eaa053dae93decfc
May 7 18:04:25 localhost dockerd[4163]: time="2019-05-07T18:04:25.319706024+08:00" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="events.TaskDelete"
May 7 18:04:28 localhost containerd[1473]: time="2019-05-07T18:04:28.513070771+08:00" level=info msg="shim containerd-shim started" address="/containerd-shim/moby/103395d557f107b3e27cc852b56cad27e446b7db40cb1ff8d455d92ded1e8399/shim.sock" debug=false pid=95977
May 7 18:04:30 localhost containerd[1473]: time="2019-05-07T18:04:30.518177183+08:00" level=info msg="shim containerd-shim started" address="/containerd-shim/moby/92dcb56b0d79d3e366e4f55207484c7ec2346ac6a78c309a39381c24898d6e7e/shim.sock" debug=false pid=96129
May 7 18:04:31 localhost systemd[1]: ssh.service: Main process exited, code=killed, status=9/KILL
May 7 18:04:31 localhost systemd[1]: ssh.service: Unit entered failed state.
May 7 18:04:31 localhost systemd[1]: ssh.service: Failed with result 'signal'.
May 7 18:04:32 localhost systemd[1]: ssh.service: Service hold-off time over, scheduling restart.
May 7 18:04:32 localhost systemd[1]: Stopped OpenBSD Secure Shell server.
May 7 18:04:32 localhost systemd[1]: Starting OpenBSD Secure Shell server...
May 7 18:04:32 localhost systemd[1]: ssh.service: Main process exited, code=exited, status=255/n/a
May 7 18:04:32 localhost systemd[1]: Failed to start OpenBSD Secure Shell server.
May 7 18:04:32 localhost systemd[1]: ssh.service: Unit entered failed state.
May 7 18:04:32 localhost systemd[1]: ssh.service: Failed with result 'exit-code'.
May 7 18:04:36 localhost containerd[1473]: time="2019-05-07T18:04:36.129460918+08:00" level=info msg="shim reaped" id=103395d557f107b3e27cc852b56cad27e446b7db40cb1ff8d455d92ded1e8399
May 7 18:04:36 localhost dockerd[4163]: time="2019-05-07T18:04:36.139429284+08:00" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="events.TaskDelete"
May 7 18:04:36 localhost containerd[1473]: time="2019-05-07T18:04:36.732186530+08:00" level=info msg="shim reaped" id=92dcb56b0d79d3e366e4f55207484c7ec2346ac6a78c309a39381c24898d6e7e
May 7 18:04:36 localhost dockerd[4163]: time="2019-05-07T18:04:36.742141988+08:00" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="events.TaskDelete"
May 7 18:04:41 localhost containerd[1473]: time="2019-05-07T18:04:41.009693437+08:00" level=info msg="shim containerd-shim started" address="/containerd-shim/moby/258c4644a189b5635dacc66a9e46255152fa98ef042146796f2cad79866294cb/shim.sock" debug=false pid=97166
May 7 18:04:47 localhost containerd[1473]: time="2019-05-07T18:04:47.971487482+08:00" level=info msg="shim containerd-shim started" address="/containerd-shim/moby/7d2a753c26dc46f43a6f86798b606d77831b024e496eb0c38c383a942b4f89c0/shim.sock" debug=false pid=97735
May 7 18:04:52 localhost containerd[1473]: time="2019-05-07T18:04:52.950117186+08:00" level=info msg="shim reaped" id=258c4644a189b5635dacc66a9e46255152fa98ef042146796f2cad79866294cb
May 7 18:04:52 localhost dockerd[4163]: time="2019-05-07T18:04:52.960071477+08:00" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="events.TaskDelete"
May 7 18:04:54 localhost containerd[1473]: time="2019-05-07T18:04:54.241442108+08:00" level=info msg="shim reaped" id=7d2a753c26dc46f43a6f86798b606d77831b024e496eb0c38c383a942b4f89c0
May 7 18:04:54 localhost dockerd[4163]: time="2019-05-07T18:04:54.251470613+08:00" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="events.TaskDelete"
May 7 18:04:57 localhost containerd[1473]: time="2019-05-07T18:04:57.477628939+08:00" level=info msg="shim containerd-shim started" address="/containerd-shim/moby/ff558ac8435e4063775aa6c639e19ed8d821f1fc9e3f523b71778cd7b688d829/shim.sock" debug=false pid=98431
May 7 18:04:58 localhost containerd[1473]: time="2019-05-07T18:04:58.505492385+08:00" level=info msg="shim containerd-shim started" address="/containerd-shim/moby/90feb1f411dc6859b8be19c21baae146ea6047196fc7781073f7c44886fffc1c/shim.sock" debug=false pid=98645
May 7 18:05:00 localhost containerd[1473]: time="2019-05-07T18:05:00.513350979+08:00" level=info msg="shim reaped" id=90feb1f411dc6859b8be19c21baae146ea6047196fc7781073f7c44886fffc1c
May 7 18:05:00 localhost dockerd[4163]: time="2019-05-07T18:05:00.522974999+08:00" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="events.TaskDelete"
May 7 18:05:01 localhost containerd[1473]: time="2019-05-07T18:05:01.161405883+08:00" level=info msg="shim reaped" id=ff558ac8435e4063775aa6c639e19ed8d821f1fc9e3f523b71778cd7b688d829
May 7 18:05:01 localhost dockerd[4163]: time="2019-05-07T18:05:01.171454385+08:00" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="events.TaskDelete"
May 7 18:05:01 localhost CRON[99238]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
May 7 18:05:01 localhost CRON[99239]: (root) CMD (/opt/cloud/agent-manager-client/control start > /dev/null 2>&1)
May 7 18:05:04 localhost systemd[1]: lvm2-lvmetad.service: Main process exited, code=killed, status=9/KILL
May 7 18:05:04 localhost systemd[1]: lvm2-lvmetad.service: Unit entered failed state.
May 7 18:05:04 localhost systemd[1]: lvm2-lvmetad.service: Failed with result 'signal'.
May 7 18:05:04 localhost systemd[1]: lvm2-lvmetad.service: Service hold-off time over, scheduling restart.
May 7 18:05:04 localhost systemd[1]: Stopped LVM2 metadata daemon.
May 7 18:05:04 localhost systemd[1]: Started LVM2 metadata daemon.
May 7 18:05:05 localhost systemd[1]: systemd-udevd.service: Main process exited, code=killed, status=9/KILL
May 7 18:05:05 localhost systemd[1]: systemd-udevd.service: Unit entered failed state.
May 7 18:05:05 localhost systemd[1]: systemd-udevd.service: Failed with result 'signal'.
May 7 18:05:05 localhost systemd[1]: systemd-udevd.service: Service has no hold-off time, scheduling restart.
May 7 18:05:05 localhost systemd[1]: Stopped udev Kernel Device Manager.
May 7 18:05:05 localhost systemd[1]: Starting udev Kernel Device Manager...
May 7 18:05:05 localhost systemd-udevd[99580]: Network interface NamePolicy= disabled on kernel command line, ignoring.
May 7 18:05:05 localhost systemd[1]: Started udev Kernel Device Manager.
May 7 18:05:14 localhost systemd[1]: nfs-idmapd.service: Main process exited, code=killed, status=9/KILL
May 7 18:05:14 localhost systemd[1]: nfs-idmapd.service: Unit entered failed state.
May 7 18:05:14 localhost systemd[1]: nfs-idmapd.service: Failed with result 'signal'.
May 7 18:05:14 localhost containerd[1473]: time="2019-05-07T18:05:14.644971373+08:00" level=info msg="shim containerd-shim started" address="/containerd-shim/moby/2bb516f3bf4a61593f460775fd917dcab96e7de7c143083a8b368eec301016ae/shim.sock" debug=false pid=100206
May 7 18:05:19 localhost containerd[1473]: time="2019-05-07T18:05:19.399390794+08:00" level=info msg="shim reaped" id=2bb516f3bf4a61593f460775fd917dcab96e7de7c143083a8b368eec301016ae
May 7 18:05:19 localhost dockerd[4163]: time="2019-05-07T18:05:19.409393676+08:00" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="events.TaskDelete"
May 7 18:05:21 localhost systemd[1]: iscsid.service: Main process exited, code=killed, status=9/KILL
May 7 18:05:21 localhost iscsiadm[100868]: iscsiadm: can not connect to iSCSI daemon (111)!
May 7 18:05:21 localhost iscsiadm[100868]: iscsiadm: initiator reported error (20 - could not connect to iscsid)
May 7 18:05:21 localhost iscsiadm[100868]: iscsiadm: Could not stop iscsid. Trying sending iscsid SIGTERM or SIGKILL signals manually
May 7 18:05:21 localhost systemd[1]: iscsid.service: Unit entered failed state.
May 7 18:05:21 localhost systemd[1]: iscsid.service: Failed with result 'signal'.
May 7 18:05:22 localhost systemd[1]: rpcbind.service: Main process exited, code=killed, status=9/KILL
May 7 18:05:22 localhost systemd[1]: rpcbind.service: Unit entered failed state.
May 7 18:05:22 localhost systemd[1]: rpcbind.service: Failed with result 'signal'.
May 7 18:05:22 localhost systemd[1]: rpcbind.service: Service hold-off time over, scheduling restart.
May 7 18:05:22 localhost systemd[1]: Stopped target RPC Port Mapper.
May 7 18:05:22 localhost systemd[1]: Stopping RPC Port Mapper.
May 7 18:05:22 localhost systemd[1]: Stopped RPC bind portmap service.
May 7 18:05:22 localhost systemd[1]: Starting RPC bind portmap service...
May 7 18:05:22 localhost rpcbind[101042]: cannot create socket for udp6
May 7 18:05:22 localhost rpcbind[101042]: cannot create socket for tcp6
May 7 18:05:22 localhost systemd[1]: Started RPC bind portmap service.
May 7 18:05:22 localhost systemd[1]: Reached target RPC Port Mapper.
May 7 18:05:23 localhost systemd[1]: nfs-mountd.service: Main process exited, code=killed, status=9/KILL
May 7 18:05:23 localhost systemd[1]: nfs-mountd.service: Unit entered failed state.
May 7 18:05:23 localhost systemd[1]: nfs-mountd.service: Failed with result 'signal'.
May 7 18:05:25 localhost systemd[1]: atd.service: Main process exited, code=killed, status=9/KILL
May 7 18:05:25 localhost systemd[1]: atd.service: Unit entered failed state.
May 7 18:05:25 localhost systemd[1]: atd.service: Failed with result 'signal'.
May 7 18:05:26 localhost systemd[1]: acpid.service: Main process exited, code=killed, status=9/KILL
May 7 18:05:26 localhost systemd[1]: acpid.service: Unit entered failed state.
May 7 18:05:26 localhost systemd[1]: acpid.service: Failed with result 'signal'.
May 7 18:05:28 localhost systemd[1]: agent-manager-client.service: Main process exited, code=killed, status=9/KILL
May 7 18:05:28 localhost control[101407]: /opt/cloud/agent-manager-client/control: line 41: kill: (1410) - No such process
May 7 18:05:28 localhost control[101407]: agent-manager-client stoped...
May 7 18:05:28 localhost systemd[1]: agent-manager-client.service: Unit entered failed state.
May 7 18:05:28 localhost systemd[1]: agent-manager-client.service: Failed with result 'signal'.
May 7 18:05:29 localhost systemd[1]: avahi-daemon.service: Main process exited, code=killed, status=9/KILL
May 7 18:05:29 localhost systemd[1]: avahi-daemon.service: Unit entered failed state.
May 7 18:05:29 localhost systemd[1]: avahi-daemon.service: Failed with result 'signal'.
May 7 18:05:32 localhost systemd[1]: accounts-daemon.service: Main process exited, code=killed, status=9/KILL
May 7 18:05:32 localhost systemd[1]: accounts-daemon.service: Unit entered failed state.
May 7 18:05:32 localhost systemd[1]: accounts-daemon.service: Failed with result 'signal'.
May 7 18:05:34 localhost rsyslogd: [origin software="rsyslogd" swVersion="8.16.0" x-pid="101849" x-info="http://www.rsyslog.com"] start
May 7 18:05:34 localhost rsyslogd-2222: command 'KLogPermitNonKernelFacility' is currently not permitted - did you already set it via a RainerScript command (v6+ config)? [v8.16.0 try http://www.rsyslog.com/e/2222 ]
May 7 18:05:34 localhost rsyslogd: rsyslogd's groupid changed to 108
May 7 18:05:34 localhost rsyslogd: rsyslogd's userid changed to 104
May 7 18:05:33 localhost systemd[1]: rsyslog.service: Main process exited, code=killed, status=9/KILL
May 7 18:05:33 localhost systemd[1]: rsyslog.service: Unit entered failed state.
May 7 18:05:33 localhost systemd[1]: rsyslog.service: Failed with result 'signal'.
May 7 18:05:34 localhost systemd[1]: rsyslog.service: Service hold-off time over, scheduling restart.
May 7 18:05:34 localhost systemd[1]: Stopped System Logging Service.
May 7 18:05:34 localhost systemd[1]: Starting System Logging Service...
May 7 18:05:34 localhost rsyslogd-2039: Could not open output pipe '/dev/xconsole':: No such file or directory [v8.16.0 try http://www.rsyslog.com/e/2039 ]
May 7 18:05:34 localhost rsyslogd-2007: action 'action 11' suspended, next retry is Tue May 7 18:06:04 2019 [v8.16.0 try http://www.rsyslog.com/e/2007 ]
May 7 18:05:34 localhost systemd[1]: Started System Logging Service.
May 7 18:05:34 localhost containerd[1473]: time="2019-05-07T18:05:34.850266407+08:00" level=info msg="shim containerd-shim started" address="/containerd-shim/moby/828fe5e634f722bb3868bb6977519e48fc8fd97f3d839562df7817bd37f1fd82/shim.sock" debug=false pid=102011
May 7 18:05:35 localhost systemd[1]: snapd.service: Main process exited, code=killed, status=9/KILL
May 7 18:05:35 localhost systemd[1]: snapd.service: Unit entered failed state.
May 7 18:05:35 localhost systemd[1]: snapd.service: Failed with result 'signal'.
May 7 18:05:36 localhost systemd[1]: snapd.service: Service hold-off time over, scheduling restart.
May 7 18:05:36 localhost systemd[1]: Stopped Snappy daemon.
May 7 18:05:36 localhost systemd[1]: Starting Snappy daemon...
May 7 18:05:36 localhost snapd[102084]: AppArmor status: apparmor is enabled and all features are available
May 7 18:05:36 localhost snapd[102084]: 2019/05/07 18:05:36.113977 daemon.go:323: started snapd/2.32.3.2 (series 16; classic) ubuntu/16.04 (amd64) linux/4.4.0-122-generic.
May 7 18:05:36 localhost systemd[1]: Started Snappy daemon.
May 7 18:05:37 localhost systemd[1]: containerd.service: Main process exited, code=killed, status=9/KILL
May 7 18:05:37 localhost dockerd[4163]: time="2019-05-07T18:05:37.361052345+08:00" level=info msg="blockingPicker: the picked transport is not ready, loop back to repick" module=grpc
May 7 18:05:37 localhost dockerd[4163]: time="2019-05-07T18:05:37.360177027+08:00" level=error msg="failed to get event" error="rpc error: code = Unavailable desc = transport is closing" module=libcontainerd namespace=moby
May 7 18:05:37 localhost dockerd[4163]: time="2019-05-07T18:05:37.360173279+08:00" level=error msg="failed to get event" error="rpc error: code = Unavailable desc = transport is closing" module=libcontainerd namespace=plugins.moby
May 7 18:05:37 localhost dockerd[4163]: time="2019-05-07T18:05:37.361154108+08:00" level=info msg="blockingPicker: the picked transport is not ready, loop back to repick" module=grpc
May 7 18:05:37 localhost dockerd[4163]: time="2019-05-07T18:05:37.361157325+08:00" level=info msg="pickfirstBalancer: HandleSubConnStateChange: 0xc420aec060, TRANSIENT_FAILURE" module=grpc
May 7 18:05:37 localhost dockerd[4163]: time="2019-05-07T18:05:37.361188437+08:00" level=info msg="pickfirstBalancer: HandleSubConnStateChange: 0xc420aec060, CONNECTING" module=grpc
May 7 18:05:37 localhost dockerd[4163]: time="2019-05-07T18:05:37.361085093+08:00" level=info msg="pickfirstBalancer: HandleSubConnStateChange: 0xc4209c6180, TRANSIENT_FAILURE" module=grpc
May 7 18:05:37 localhost dockerd[4163]: time="2019-05-07T18:05:37.361203310+08:00" level=info msg="pickfirstBalancer: HandleSubConnStateChange: 0xc4209c6180, CONNECTING" module=grpc
May 7 18:05:37 localhost dockerd[4163]: time="2019-05-07T18:05:37.361213182+08:00" level=warning msg="grpc: addrConn.createTransport failed to connect to {unix:///run/containerd/containerd.sock 0
You can view pai services log. It seems this was not caused by pai services.
OpenPAI Environment:
uname -a
):Issue Description: after running openpai service for a long time( 1 or 2 days?) , some of work nodes would be down, check the health status in prometheus, it shows that:
JobExporterHangs (9 active) alert: JobExporterHangs expr: irate(collector_iteration_count_total[5m]) == 0 for: 5m labels: type: pai_service annotations: summary: '{{$labels.name}} in {{$labels.instance}} hangs detected'
✔ALERTS{alertname="JobExporterHangs",alertstate="firing",instance=":9102",job="pai_serivce_exporter",name="zombie_collector",pai_service_name="job-exporter",scraped_from="job-exporter-7x6fn",type="pai_service"}
✔ALERTS{alertname="JobExporterHangs",alertstate="firing",instance=":9102",job="pai_serivce_exporter",name="docker_daemon_collector",pai_service_name="job-exporter",scraped_from="job-exporter-7x6fn",type="pai_service"}
✔ALERTS{alertname="JobExporterHangs",alertstate="firing",instance=":9102",job="pai_serivce_exporter",name="container_collector",pai_service_name="job-exporter",scraped_from="job-exporter-7x6fn",type="pai_service"}
✔ALERTS{alertname="JobExporterHangs",alertstate="firing",instance=":9102",job="pai_serivce_exporter",name="zombie_collector",pai_service_name="job-exporter",scraped_from="job-exporter-kkcmx",type="pai_service"}
✔ALERTS{alertname="JobExporterHangs",alertstate="firing",instance=":9102",job="pai_serivce_exporter",name="docker_daemon_collector",pai_service_name="job-exporter",scraped_from="job-exporter-kkcmx",type="pai_service"}
✔ALERTS{alertname="JobExporterHangs",alertstate="firing",instance=":9102",job="pai_serivce_exporter",name="container_collector",pai_service_name="job-exporter",scraped_from="job-exporter-kkcmx",type="pai_service"}
✔ALERTS{alertname="JobExporterHangs",alertstate="firing",instance=":9102",job="pai_serivce_exporter",name="zombie_collector",pai_service_name="job-exporter",scraped_from="job-exporter-xltj9",type="pai_service"}
✔ALERTS{alertname="JobExporterHangs",alertstate="firing",instance=":9102",job="pai_serivce_exporter",name="docker_daemon_collector",pai_service_name="job-exporter",scraped_from="job-exporter-xltj9",type="pai_service"}
✔ALERTS{alertname="JobExporterHangs",alertstate="firing",instance=":9102",job="pai_serivce_exporter",name="container_collector",pai_service_name="job-exporter",scraped_from="job-exporter-xltj9",type="pai_service"}
NodeNotReady (3 active) alert: NodeNotReady expr: pai_node_count{ready!="true"}
docker_daemon_count{error="Command '['docker', 'info']' returned non-zero exit status 1.",instance=":9102",job="pai_serivce_exporter",pai_service_name="job-exporter",scraped_from="job-exporter-cf98d"}
Then I check the system log in the specific node localhost dockerd[2596]: time="2019-05-05T11:14:23.481834565+08:00" level=warning msg="Your kernel does not support swap limit capabilities,or the cgroup is not mounted. Memory limited without swap." localhost containerd[1464]: time="2019-05-05T11:14:23.532577840+08:00" level=info msg="shim containerd-shim started" address="/containerd-shim/moby/7b12f492825d80b09a648a7a04aa5de74d5d94354c12088a7f5191d1faf09766/shim.sock" debug=false pid=160603 localhost systemd[1]: ssh.service: Main process exited, code=killed, status=9/KILL localhost systemd[1]: ssh.service: Unit entered failed state. localhost systemd[1]: ssh.service: Failed with result 'signal'. localhost systemd[1]: ssh.service: Service hold-off time over, scheduling restart. localhost systemd[1]: Stopped OpenBSD Secure Shell server. localhost systemd[1]: Starting OpenBSD Secure Shell server... localhost systemd[1]: Started OpenBSD Secure Shell server. localhost systemd[1]: ssh.service: Main process exited, code=killed, status=9/KILL localhost systemd[1]: ssh.service: Unit entered failed state. localhost systemd[1]: ssh.service: Failed with result 'signal'. localhost systemd[1]: ssh.service: Service hold-off time over, scheduling restart. localhost systemd[1]: Stopped OpenBSD Secure Shell server. localhost systemd[1]: Starting OpenBSD Secure Shell server... localhost systemd[1]: Started OpenBSD Secure Shell server. localhost systemd[1]: ssh.service: Main process exited, code=killed, status=9/KILL localhost systemd[1]: ssh.service: Unit entered failed state. localhost systemd[1]: ssh.service: Failed with result 'signal'. localhost systemd[1]: ssh.service: Service hold-off time over, scheduling restart. localhost systemd[1]: Stopped OpenBSD Secure Shell server. localhost systemd[1]: Starting OpenBSD Secure Shell server... localhost systemd[1]: Started OpenBSD Secure Shell server. localhost systemd[1]: ssh.service: Main process exited, code=killed, status=9/KILL localhost systemd[1]: ssh.service: Unit entered failed state. localhost systemd[1]: ssh.service: Failed with result 'signal'. localhost systemd[1]: ssh.service: Service hold-off time over, scheduling restart. localhost systemd[1]: Stopped OpenBSD Secure Shell server. localhost systemd[1]: Starting OpenBSD Secure Shell server... localhost systemd[1]: Started OpenBSD Secure Shell server. localhost systemd[1]: ssh.service: Main process exited, code=killed, status=9/KILL localhost systemd[1]: ssh.service: Unit entered failed state. localhost systemd[1]: ssh.service: Failed with result 'signal'. localhost systemd[1]: ssh.service: Service hold-off time over, scheduling restart. localhost systemd[1]: Stopped OpenBSD Secure Shell server. localhost systemd[1]: ssh.service: Start request repeated too quickly. localhost systemd[1]: Failed to start OpenBSD Secure Shell server. localhost containerd[1464]: time="2019-05-05T11:14:36.426468968+08:00" level=info msg="shim reaped" id=9d9327ede31deadd518f1e2df7a29225d567cda0c3ba9f78b6cfa5dc915815f5 localhost dockerd[2596]: time="2019-05-05T11:14:36.436314973+08:00" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete" localhost systemd[1]: systemd-journald.service: Main process exited, code=killed, status=9/KILL localhost systemd[1]: systemd-journald.service: Unit entered failed state. localhost systemd[1]: Starting Flush Journal to Persistent Storage... localhost systemd[1]: Started Flush Journal to Persistent Storage. localhost kernel: [ 3022.182297] TCP: request_sock_TCP: Possible SYN flooding on port 80. Sending cookies. Check SNMP counters. localhost systemd[1]: lvm2-lvmetad.service: Main process exited, code=killed, status=9/KILL localhost systemd[1]: lvm2-lvmetad.service: Unit entered failed state. localhost systemd[1]: lvm2-lvmetad.service: Failed with result 'signal'. localhost systemd[1]: lvm2-lvmetad.service: Service hold-off time over, scheduling restart. localhost systemd[1]: Stopped LVM2 metadata daemon. localhost systemd[1]: Started LVM2 metadata daemon. localhost systemd[1]: systemd-udevd.service: Main process exited, code=killed, status=9/KILL localhost systemd[1]: systemd-udevd.service: Unit entered failed state. localhost systemd[1]: systemd-udevd.service: Failed with result 'signal'. localhost systemd[1]: systemd-udevd.service: Service has no hold-off time, scheduling restart. localhost systemd[1]: Stopped udev Kernel Device Manager. localhost systemd[1]: Starting udev Kernel Device Manager... localhost systemd-udevd[163384]: Network interface NamePolicy= disabled on kernel command line, ignoring. localhost systemd[1]: Started udev Kernel Device Manager. localhost CRON[164510]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1) localhost CRON[164512]: (root) CMD (/opt/cloud/agent-manager-client/control start > /dev/null 2>&1) localhost CRON[172297]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1) localhost CRON[178742]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly) localhost CRON[178743]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1) localhost CRON[185690]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1) localhost CRON[193522]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1) localhost CRON[200377]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1) localhost CRON[200379]: (root) CMD (/opt/cloud/agent-manager-client/control start > /dev/null 2>&1) localhost CRON[200380]: (root) CMD (/sbin/lsmod | grep nvidia > /dev/null && nvidia-smi -pm ENABLED > /dev/null) localhost CRON[200375]: (CRON) info (No MTA installed, discarding output) localhost CRON[208476]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1) localhost systemd[1]: nfs-idmapd.service: Main process exited, code=killed, status=9/KILL localhost systemd[1]: nfs-idmapd.service: Unit entered failed state. localhost systemd[1]: nfs-idmapd.service: Failed with result 'signal'. localhost systemd[1]: iscsid.service: Main process exited, code=killed, status=9/KILL localhost iscsiadm[209865]: iscsiadm: can not connect to iSCSI daemon (111)! localhost iscsiadm[209865]: iscsiadm: initiator reported error (20 - could not connect to iscsid) localhost iscsiadm[209865]: iscsiadm: Could not stop iscsid. Trying sending iscsid SIGTERM or SIGKILL signals manually localhost systemd[1]: iscsid.service: Unit entered failed state. localhost systemd[1]: iscsid.service: Failed with result 'signal'. localhost systemd[1]: rpcbind.service: Main process exited, code=killed, status=9/KILL localhost systemd[1]: rpcbind.service: Unit entered failed state. localhost systemd[1]: rpcbind.service: Failed with result 'signal'. localhost systemd[1]: rpcbind.service: Service hold-off time over, scheduling restart. localhost systemd[1]: Stopped target RPC Port Mapper. localhost systemd[1]: Stopping RPC Port Mapper. localhost systemd[1]: Stopped RPC bind portmap service. localhost systemd[1]: Starting RPC bind portmap service... localhost rpcbind[209962]: cannot create socket for udp6 localhost rpcbind[209962]: cannot create socket for tcp6 localhost systemd[1]: Started RPC bind portmap service. localhost systemd[1]: Reached target RPC Port Mapper. localhost systemd[1]: nfs-mountd.service: Main process exited, code=killed, status=9/KILL localhost systemd[1]: nfs-mountd.service: Unit entered failed state. localhost systemd[1]: nfs-mountd.service: Failed with result 'signal'. localhost systemd[1]: dbus.service: Main process exited, code=killed, status=9/KILL localhost avahi-daemon[1405]: Disconnected from D-Bus, exiting. localhost avahi-daemon[1405]: Got SIGTERM, quitting. localhost avahi-daemon[1405]: Leaving mDNS multicast group on interface docker0.IPv4 with address 172.17.0.1. localhost avahi-daemon[1405]: Leaving mDNS multicast group on interface eth0.IPv4 with address 10.10.47.5. localhost avahi-daemon[1405]: avahi-daemon 0.6.32-rc exiting. localhost systemd[1]: dbus.service: Unit entered failed state. localhost systemd[1]: dbus.service: Failed with result 'signal'. localhost systemd[1]: cron.service: Main process exited, code=killed, status=9/KILL localhost systemd[1]: cron.service: Unit entered failed state. localhost systemd[1]: cron.service: Failed with result 'signal'. localhost rsyslogd: [origin software="rsyslogd" swVersion="8.16.0" x-pid="210356" x-info="http://www.rsyslog.com"] start localhost rsyslogd-2222: command 'KLogPermitNonKernelFacility' is currently not permitted - did you already set it via a RainerScript command (v6+ config)? [v8.16.0 try http://www.rsyslog.com/e/2222 ] localhost rsyslogd: rsyslogd's groupid changed to 108 localhost rsyslogd: rsyslogd's userid changed to 104 localhost systemd[1]: rsyslog.service: Main process exited, code=killed, status=9/KILL localhost systemd[1]: rsyslog.service: Unit entered failed state. localhost systemd[1]: rsyslog.service: Failed with result 'signal'. localhost systemd[1]: rsyslog.service: Service hold-off time over, scheduling restart. localhost systemd[1]: Stopped System Logging Service. localhost systemd[1]: Starting System Logging Service... localhost rsyslogd-2039: Could not open output pipe '/dev/xconsole':: No such file or directory [v8.16.0 try http://www.rsyslog.com/e/2039 ] localhost rsyslogd-2007: action 'action 11' suspended, next retry is Sun May 5 11:21:53 2019 [v8.16.0 try http://www.rsyslog.com/e/2007 ] localhost systemd[1]: Started System Logging Service. localhost systemd[1]: atd.service: Main process exited, code=killed, status=9/KILL localhost systemd[1]: atd.service: Unit entered failed state. localhost systemd[1]: atd.service: Failed with result 'signal'. localhost kernel: [ 3942.895559] hrtimer: interrupt took 635042 ns localhost systemd[1]: systemd-logind.service: Main process exited, code=killed, status=9/KILL localhost systemd[1]: systemd-logind.service: Unit entered failed state. localhost systemd[1]: systemd-logind.service: Failed with result 'signal'. localhost systemd[1]: systemd-logind.service: Service has no hold-off time, scheduling restart. localhost systemd[1]: Stopped Login Service. localhost systemd[1]: Starting Login Service... localhost systemd[1]: Started D-Bus System Message Bus. localhost systemd[1]: Failed to subscribe to NameOwnerChanged signal for 'org.freedesktop.login1': Device or resource busy localhost systemd[1]: Failed to subscribe to NameOwnerChanged signal for 'org.freedesktop.Accounts': Device or resource busy localhost systemd[1]: Failed to subscribe to NameOwnerChanged signal for 'org.freedesktop.Avahi': Device or resource busy localhost dbus-daemon[324247]: Unknown username "whoopsie" in message bus configuration file localhost dbus[324247]: [system] AppArmor D-Bus mediation is enabled localhost systemd[1]: Started User Manager for UID 0. localhost systemd[1]: acpid.service: Main process exited, code=killed, status=9/KILL localhost systemd[1]: acpid.service: Unit entered failed state. localhost systemd[1]: acpid.service: Failed with result 'signal'. localhost systemd[1]: snapd.service: Main process exited, code=killed, status=9/KILL localhost systemd[1]: snapd.service: Unit entered failed state. localhost systemd[1]: snapd.service: Failed with result 'signal'. localhost systemd[1]: snapd.service: Service hold-off time over, scheduling restart. localhost systemd[1]: Stopped Snappy daemon. localhost systemd[1]: Starting Snappy daemon... localhost snapd[324394]: AppArmor status: apparmor is enabled and all features are available localhost snapd[324394]: 2019/05/05 11:40:17.246817 daemon.go:323: started snapd/2.32.3.2 (series 16; classic) ubuntu/16.04 (amd64) linux/4.4.0-122-generic. localhost systemd[1]: Started Snappy daemon. localhost systemd[1]: agent-manager-client.service: Main process exited, code=killed, status=9/KILL localhost control[324623]: /opt/cloud/agent-manager-client/control: line 41: kill: (1439) - No such process localhost control[324623]: agent-manager-client stoped... localhost systemd[1]: agent-manager-client.service: Unit entered failed state. localhost systemd[1]: agent-manager-client.service: Failed with result 'signal'. localhost systemd[1]: containerd.service: Main process exited, code=killed, status=9/KILL localhost systemd[1]: containerd.service: Unit entered failed state. localhost systemd[1]: containerd.service: Failed with result 'signal'. localhost systemd[1]: Stopping Docker Application Container Engine... localhost systemd[1]: Stopped Docker Application Container Engine. localhost systemd[1]: Closed Docker Socket for the API. localhost systemd[1]: systemd-logind.service: Start operation timed out. Terminating. localhost systemd[1]: Failed to start Login Service. localhost systemd[1]: systemd-logind.service: Unit entered failed state. localhost systemd[1]: systemd-logind.service: Failed with result 'timeout'. localhost systemd[1]: systemd-logind.service: Service has no hold-off time, scheduling restart. localhost systemd[1]: Stopped Login Service. localhost systemd[1]: Starting Login Service... localhost systemd[1]: Started User Manager for UID 0. localhost systemd[1]: systemd-logind.service: Start operation timed out. Terminating. localhost systemd[1]: Failed to start Login Service. localhost systemd[1]: systemd-logind.service: Unit entered failed state. localhost systemd[1]: systemd-logind.service: Failed with result 'timeout'. localhost systemd[1]: systemd-logind.service: Service has no hold-off time, scheduling restart. localhost systemd[1]: Stopped Login Service. localhost systemd[1]: Starting Login Service... localhost systemd[1]: Started User Manager for UID 0. localhost systemd[1]: systemd-logind.service: Start operation timed out. Terminating. localhost systemd[1]: Failed to start Login Service. localhost systemd[1]: systemd-logind.service: Unit entered failed state. localhost systemd[1]: systemd-logind.service: Failed with result 'timeout'. localhost systemd[1]: systemd-logind.service: Service has no hold-off time, scheduling restart. localhost systemd[1]: Stopped Login Service. localhost systemd[1]: Starting Login Service... localhost systemd[1]: Started User Manager for UID 0. localhost systemd[1]: systemd-logind.service: Start operation timed out. Terminating. localhost systemd[1]: Failed to start Login Service. localhost systemd[1]: systemd-logind.service: Unit entered failed state. localhost systemd[1]: systemd-logind.service: Failed with result 'timeout'. localhost systemd[1]: systemd-logind.service: Service has no hold-off time, scheduling restart. localhost systemd[1]: Stopped Login Service. localhost systemd[1]: Starting Login Service... localhost systemd[1]: Started User Manager for UID 0. localhost systemd[1]: systemd-logind.service: Start operation timed out. Terminating. localhost systemd[1]: Failed to start Login Service. localhost systemd[1]: systemd-logind.service: Unit entered failed state. localhost systemd[1]: systemd-logind.service: Failed with result 'timeout'. localhost systemd[1]: systemd-logind.service: Service has no hold-off time, scheduling restart. localhost systemd[1]: Stopped Login Service. localhost systemd[1]: Starting Login Service... localhost systemd[1]: Started User Manager for UID 0. localhost systemd[1]: systemd-logind.service: Start operation timed out. Terminating. localhost systemd[1]: Failed to start Login Service. localhost systemd[1]: systemd-logind.service: Unit entered failed state. localhost systemd[1]: systemd-logind.service: Failed with result 'timeout'. localhost systemd[1]: systemd-logind.service: Service has no hold-off time, scheduling restart. localhost systemd[1]: Stopped Login Service.
Any suggestions?
by the way, how to boot up the down node?