microsoft / pai

Resource scheduling and cluster management for AI
https://openpai.readthedocs.io
MIT License
2.63k stars 548 forks source link

fatal: [node02]: FAILED! => {"changed": false, "msg": "Unable to restart service docker: Job for docker.service failed because the control process exited with error code.\nSee \"systemctl status docker.service\" and \"journalctl -xe\" for details.\n"} #4612

Closed liudican closed 4 years ago

liudican commented 4 years ago

when I install open pai v1.0.y,run : /bin/bash quick-start-kubespray.sh -m ~/master.csv -w ~/worker.csv -c ~/config ;command,but output error:

RUNNING HANDLER [container-engine/docker : Docker | reload docker.socket] ***** Wednesday 10 June 2020 16:49:42 +0800 (0:00:00.707) 0:02:20.713 ****

RUNNING HANDLER [container-engine/docker : Docker | reload docker] **** Wednesday 10 June 2020 16:49:42 +0800 (0:00:00.031) 0:02:20.745 **** fatal: [node02]: FAILED! => {"changed": false, "msg": "Unable to restart service docker: Job for docker.service failed because the control process exited with error code.\nSee \"systemctl status docker.service\" and \"journalctl -xe\" for details.\n"}

systemctl status docker.service ● docker.service - Docker Application Container Engine Loaded: loaded (/etc/systemd/system/docker.service; enabled; vendor preset: enabled) Drop-In: /etc/systemd/system/docker.service.d └─docker-dns.conf, docker-options.conf Active: failed (Result: exit-code) since Wed 2020-06-10 16:49:43 HKT; 7min ago Docs: http://docs.docker.com Process: 1153 ExecStart=/usr/bin/dockerd $DOCKER_OPTS $DOCKER_STORAGE_OPTIONS $DOCKER_NETWORK_OPTIONS $DOCKER_DNS_OPTIONS $INSECURE_REGISTRY (code=ex Main PID: 1153 (code=exited, status=1/FAILURE)

journalctl -xe: Jun 10 16:45:16 node02 ansible-ping[31295]: Invoked with data=pong Jun 10 16:45:48 node02 ansible-setup[31397]: Invoked with filter= gather_subset=['all'] fact_path=/etc/ansible/facts.d gather_timeout=10 Jun 10 16:46:29 node02 ansible-ping[31451]: Invoked with data=pong Jun 10 16:46:30 node02 ansible-ping[31469]: Invoked with data=pong Jun 10 16:47:02 node02 ansible-setup[31573]: Invoked with filter= gather_subset=['all'] fact_path=/etc/ansible/facts.d gather_timeout=10 Jun 10 16:48:58 node02 sshd[31281]: Received disconnect from 192.168.100.111 port 52182:11: disconnected by user Jun 10 16:48:58 node02 sshd[31281]: Disconnected from user tsing01 192.168.100.111 port 52182

ydye commented 4 years ago

Please paste the log of this command of in the failure node.

sudo journalctl -u docker | tail -n200
liudican commented 4 years ago
un 10 17:21:50 node02 dockerd[7077]: time="2020-06-10T17:21:50.414661981+08:00" level=warning msg="Your kernel does not support swap memory limit"
Jun 10 17:21:50 node02 dockerd[7077]: time="2020-06-10T17:21:50.414688209+08:00" level=warning msg="Your kernel does not support cgroup rt period"
Jun 10 17:21:50 node02 dockerd[7077]: time="2020-06-10T17:21:50.414694804+08:00" level=warning msg="Your kernel does not support cgroup rt runtime"
Jun 10 17:21:50 node02 dockerd[7077]: time="2020-06-10T17:21:50.414929741+08:00" level=info msg="Loading containers: start."
Jun 10 17:21:50 node02 dockerd[7077]: time="2020-06-10T17:21:50.498338648+08:00" level=info msg="Default bridge (docker0) is assigned with an IP address 172.17.0.0/16. Daemon option --bip can be used to set a preferred IP address"
Jun 10 17:21:50 node02 dockerd[7077]: time="2020-06-10T17:21:50.678588446+08:00" level=info msg="Loading containers: done."
Jun 10 17:21:50 node02 dockerd[7077]: time="2020-06-10T17:21:50.834339957+08:00" level=info msg="Docker daemon" commit=2d0083d graphdriver(s)=overlay2 version=18.09.7
Jun 10 17:21:50 node02 dockerd[7077]: time="2020-06-10T17:21:50.834513196+08:00" level=info msg="Daemon has completed initialization"
Jun 10 17:21:50 node02 dockerd[7077]: time="2020-06-10T17:21:50.852119165+08:00" level=info msg="API listen on /var/run/docker.sock"
Jun 10 17:21:50 node02 systemd[1]: Started Docker Application Container Engine.
Jun 10 17:21:54 node02 systemd[1]: Stopping Docker Application Container Engine...
Jun 10 17:21:54 node02 dockerd[7077]: time="2020-06-10T17:21:54.576921017+08:00" level=info msg="Processing signal 'terminated'"
Jun 10 17:21:54 node02 systemd[1]: Stopped Docker Application Container Engine.
Jun 10 17:21:54 node02 systemd[1]: Starting Docker Application Container Engine...
Jun 10 17:21:54 node02 dockerd[7420]: Status: invalid argument "N" for "--registry-mirror" flag: invalid mirror: unsupported scheme "" in "N"
Jun 10 17:21:54 node02 dockerd[7420]: See 'dockerd --help'., Code: 125
Jun 10 17:21:54 node02 systemd[1]: docker.service: Main process exited, code=exited, status=1/FAILURE
Jun 10 17:21:54 node02 systemd[1]: docker.service: Failed with result 'exit-code'.
Jun 10 17:21:54 node02 systemd[1]: Failed to start Docker Application Container Engine.
Jun 10 17:21:54 node02 systemd[1]: docker.service: Service hold-off time over, scheduling restart.
Jun 10 17:21:54 node02 systemd[1]: docker.service: Scheduled restart job, restart counter is at 1.
Jun 10 17:21:54 node02 systemd[1]: Stopped Docker Application Container Engine.
Jun 10 17:21:54 node02 systemd[1]: Starting Docker Application Container Engine...
Jun 10 17:21:54 node02 dockerd[7460]: Status: invalid argument "N" for "--registry-mirror" flag: invalid mirror: unsupported scheme "" in "N"
Jun 10 17:21:54 node02 dockerd[7460]: See 'dockerd --help'., Code: 125
Jun 10 17:21:54 node02 systemd[1]: docker.service: Main process exited, code=exited, status=1/FAILURE
Jun 10 17:21:54 node02 systemd[1]: docker.service: Failed with result 'exit-code'.
Jun 10 17:21:54 node02 systemd[1]: Failed to start Docker Application Container Engine.
Jun 10 17:21:55 node02 systemd[1]: docker.service: Service hold-off time over, scheduling restart.
Jun 10 17:21:55 node02 systemd[1]: docker.service: Scheduled restart job, restart counter is at 2.
Jun 10 17:21:55 node02 systemd[1]: Stopped Docker Application Container Engine.
Jun 10 17:21:55 node02 systemd[1]: Starting Docker Application Container Engine...
Jun 10 17:21:55 node02 dockerd[7499]: Status: invalid argument "N" for "--registry-mirror" flag: invalid mirror: unsupported scheme "" in "N"
Jun 10 17:21:55 node02 dockerd[7499]: See 'dockerd --help'., Code: 125
Jun 10 17:21:55 node02 systemd[1]: docker.service: Main process exited, code=exited, status=1/FAILURE
Jun 10 17:21:55 node02 systemd[1]: docker.service: Failed with result 'exit-code'.
Jun 10 17:21:55 node02 systemd[1]: Failed to start Docker Application Container Engine.
Jun 10 17:21:55 node02 systemd[1]: docker.service: Service hold-off time over, scheduling restart.
Jun 10 17:21:55 node02 systemd[1]: docker.service: Scheduled restart job, restart counter is at 3.
Jun 10 17:21:55 node02 systemd[1]: Stopped Docker Application Container Engine.
Jun 10 17:21:55 node02 systemd[1]: docker.service: Start request repeated too quickly.
Jun 10 17:21:55 node02 systemd[1]: docker.service: Failed with result 'exit-code'.
Jun 10 17:21:55 node02 systemd[1]: Failed to start Docker Application Container Engine.
liudican commented 4 years ago

this is my error log;thank you!

ydye commented 4 years ago
Jun 10 17:21:54 node02 dockerd[7460]: Status: invalid argument "N" for "--registry-mirror" flag: invalid mirror: unsupported scheme "" in "N"

Please paste the docker configuration in the following path

/etc/docker/daemon.json
/etc/systemd/system/docker.service.d/docker-options.conf
liudican commented 4 years ago

auto complete: /etc/systemd/system/docker.service.d/docker-options.conf [Service] Environment="DOCKER_OPTS= --registry-mirror=N --registry-mirror=o --registry-mirror=n --registry-mirror=e --data-root=/mnt/docker --log-opt max-size=2g --log-opt max-file=2 --log-driver=json-file --iptables=false \ \ --registry-mirror=N --registry-mirror=o --registry-mirror=n --registry-mirror=e \ --data-root=/mnt/docker \ --log-opt max-size=2g --log-opt max-file=2 --log-driver=json-file \

/etc/docker/daemon.json

{ "registry-mirrors": ["https://qfcvhgta.mirror.aliyuncs.com"] }

liudican commented 4 years ago

but It still doesn't work!

ydye commented 4 years ago

That's your problem --registry-mirror=N --registry-mirror=o --registry-mirror=n --registry-mirror=e

Can you paste your config.yml of quick-start

liudican commented 4 years ago
user: tsing01
password: 123
branch_name: pai-1.0.y
docker_image_tag: v1.0.0

# Optional

#############################################
# Ansible-playbooks' inventory hosts' vars. #
#############################################
# ssh_key_file_path: /path/to/you/key/file

#####################################
# OpenPAI's service image registry. #
#####################################
# docker_registry_domain: docker.io
# docker_registry_namespace: openpai
# docker_registry_username: exampleuser
# docker_registry_password: examplepasswd

###########################################################################################
#                         Pre-check setting                                               #
# By default, we assume your gpu environment is nvidia. So your runtime should be nvidia. #
# If you are using AMD or other environment, you should modify it.                        #
###########################################################################################
# worker_default_docker_runtime: nvidia
# docker_check: true

# resource_check: true

# gpu_type: nvidia

########################################################################################
# Advanced docker configuration. If you are not familiar with them, don't change them. #
########################################################################################
# docker_data_root: /mnt/docker
# docker_config_file_path: /etc/docker/daemon.json
# docker_iptables_enabled: false

## An obvious use case is allowing insecure-registry access to self hosted registries.
## Can be ipaddress and domain_name.
## example define 172.19.16.11 or mirror.registry.io
# openpai_docker_insecure_registries:
#   - mirror.registry.io
#   - 172.19.16.11

## Add other registry,example China registry mirror.
openpai_docker_registry_mirrors:
#   - https://registry.docker-cn.com
#   - https://mirror.aliyuncs.com

#######################################################################
#                       kubespray setting                             #
#######################################################################

# If you couldn't access to gcr.io or docker.io, please configure it.
# gcr_image_repo: "gcr.io"
gcr_image_repo: "gcr.azk8s.cn"
# kube_image_repo: "gcr.io/google-containers"
kube_image_repo: "gcr.azk8s.cn/google_containers"
quay_image_repo: "quay.io"
#quay_image_repo: "quay.mirrors.ustc.edu.cn"
docker_image_repo: "docker.io"
#docker_image_repo: "registry.docker-cn.com"
# kubeadm_download_url: "https://storage.googleapis.com/kubernetes-release/release/{{ kubeadm_version }}/bin/linux/{{ image_arch }}/kubeadm"
kubeadm_download_url: "https://storage.googleapis.com/kubernetes-release/release/v1.15.11/bin/linux/amd64/kubeadm"
# hyperkube_download_url: "https://storage.googleapis.com/kubernetes-release/release/{{ kube_version }}/bin/linux/{{ image_arch }}/hyperkube"
hyperkube_download_url: "https://storage.googleapis.com/kubernetes-release/release/v1.15.11/bin/linux/amd64/hyperkube"

# openpai_kube_network_plugin: calico
ydye commented 4 years ago

comment this line

openpai_docker_registry_mirrors:
ydye commented 4 years ago

BTW, pls use markdown to show u log and config.

liudican commented 4 years ago

ok! I'sorry! Does this command need comment? When I annotate this command, run it and report the following error!I am from China. Is there any other solution? thank you!

liudican commented 4 years ago

Wednesday 10 June 2020  18:53:47 +0800 (0:00:00.087)       0:01:49.437 ********
FAILED - RETRYING: download_container | Download image if required (4 retries left).
FAILED - RETRYING: download_container | Download image if required (4 retries left).
FAILED - RETRYING: download_container | Download image if required (3 retries left).
FAILED - RETRYING: download_container | Download image if required (2 retries left).
FAILED - RETRYING: download_container | Download image if required (3 retries left).
FAILED - RETRYING: download_container | Download image if required (1 retries left).
fatal: [node02 -> 192.168.100.100]: FAILED! => {"attempts": 4, "changed": true, "cmd": ["/usr/bin/docker", "pull", "gcr.azk8s.cn/google_containers/pause-amd64:3.1"], "delta": "0:00:00.268563", "end": "2020-06-10 18:54:00.979291", "msg": "non-zero return code", "rc": 1, "start": "2020-06-10 18:54:00.710728", "stderr": "Error response from daemon: error parsing HTTP 403 response body: invalid character '<' looking for beginning of value: \"<html>\\r\\n<head><title>403 Forbidden</title></head>\\r\\n<body bgcolor=\\\"white\\\">\\r\\n<center><h1>403 Forbidden</h1></center>\\r\\n<hr><center>nginx/1.14.0 (Ubuntu)</center>\\r\\n</body>\\r\\n</html>\\r\\n\"", "stderr_lines": ["Error response from daemon: error parsing HTTP 403 response body: invalid character '<' looking for beginning of value: \"<html>\\r\\n<head><title>403 Forbidden</title></head>\\r\\n<body bgcolor=\\\"white\\\">\\r\\n<center><h1>403 Forbidden</h1></center>\\r\\n<hr><center>nginx/1.14.0 (Ubuntu)</center>\\r\\n</body>\\r\\n</html>\\r\\n\""], "stdout": "", "stdout_lines": []}
FAILED - RETRYING: download_container | Download image if required (2 retries left).
FAILED - RETRYING: download_container | Download image if required (1 retries left).
fatal: [node01 -> 192.168.100.100]: FAILED! => {"attempts": 4, "changed": true, "cmd": ["/usr/bin/docker", "pull", "gcr.azk8s.cn/google_containers/pause-amd64:3.1"], "delta": "0:00:00.271022", "end": "2020-06-10 18:54:17.024551", "msg": "non-zero return code", "rc": 1, "start": "2020-06-10 18:54:16.753529", "stderr": "Error response from daemon: error parsing HTTP 403 response body: invalid character '<' looking for beginning of value: \"<html>\\r\\n<head><title>403 Forbidden</title></head>\\r\\n<body bgcolor=\\\"white\\\">\\r\\n<center><h1>403 Forbidden</h1></center>\\r\\n<hr><center>nginx/1.14.0 (Ubuntu)</center>\\r\\n</body>\\r\\n</html>\\r\\n\"", "stderr_lines": ["Error response from daemon: error parsing HTTP 403 response body: invalid character '<' looking for beginning of value: \"<html>\\r\\n<head><title>403 Forbidden</title></head>\\r\\n<body bgcolor=\\\"white\\\">\\r\\n<center><h1>403 Forbidden</h1></center>\\r\\n<hr><center>nginx/1.14.0 (Ubuntu)</center>\\r\\n</body>\\r\\n</html>\\r\\n\""], "stdout": "", "stdout_lines": []}

NO MORE HOSTS LEFT ************************************************************************************************************************************
        to retry, use: --limit @/home/tsing01/pai-deploy/kubespray/cluster.retry

PLAY RECAP ********************************************************************************************************************************************
localhost                  : ok=1    changed=0    unreachable=0    failed=0
node01                     : ok=187  changed=6    unreachable=0    failed=1
node02                     : ok=245  changed=6    unreachable=0    failed=1

Wednesday 10 June 2020  18:54:17 +0800 (0:00:30.031)       0:02:19.468 ********
===============================================================================
download : download_container | Download image if required ------------------------------------------------------------------------------------ 30.03s
download : download_container | Download image if required ------------------------------------------------------------------------------------ 22.26s
download : download_container | Download image if required ------------------------------------------------------------------------------------ 15.70s
download : download_container | Download image if required ------------------------------------------------------------------------------------ 15.68s
container-engine/docker : ensure docker packages are installed --------------------------------------------------------------------------------- 3.49s
download : download | Download files / images -------------------------------------------------------------------------------------------------- 1.27s
bootstrap-os : Install dbus for the hostname module -------------------------------------------------------------------------------------------- 1.08s
kubernetes/preinstall : Install packages requirements ------------------------------------------------------------------------------------------ 1.04s
bootstrap-os : Fetch /etc/os-release ----------------------------------------------------------------------------------------------------------- 0.93s
kubernetes/preinstall : Create kubernetes directories ------------------------------------------------------------------------------------------ 0.83s
download : download_file | Download item ------------------------------------------------------------------------------------------------------- 0.75s
bootstrap-os : Assign inventory name to unconfigured hostnames (non-CoreOS, Suse and ClearLinux) ----------------------------------------------- 0.68s
container-engine/docker : ensure docker-ce repository public key is installed ------------------------------------------------------------------ 0.67s
container-engine/docker : Ensure old versions of Docker are not installed. | Debian ------------------------------------------------------------ 0.63s
kubernetes/preinstall : Update package management cache (APT) ---------------------------------------------------------------------------------- 0.63s
bootstrap-os : Gather host facts to get ansible_os_family -------------------------------------------------------------------------------------- 0.57s
download : download_file | Download item ------------------------------------------------------------------------------------------------------- 0.46s
container-engine/docker : ensure service is started if docker packages are already present ----------------------------------------------------- 0.46s
liudican commented 4 years ago

   Loaded: loaded (/etc/systemd/system/docker.service; enabled; vendor preset: enabled)
  Drop-In: /etc/systemd/system/docker.service.d
           └─docker-dns.conf, docker-options.conf
   Active: active (running) since Wed 2020-06-10 18:35:29 HKT; 14h ago
     Docs: http://docs.docker.com
 Main PID: 18890 (dockerd)
    Tasks: 19
   CGroup: /system.slice/docker.service
           └─18890 /usr/bin/dockerd --data-root=/mnt/docker --log-opt max-size=2g --log-opt max-file=2 --log-driver=json-file --iptables=false --data-r

Jun 10 18:53:57 node02 dockerd[18890]: time="2020-06-10T18:53:57.608267738+08:00" level=info msg="Attempting next endpoint for pull after error: error
Jun 10 18:53:57 node02 dockerd[18890]: time="2020-06-10T18:53:57.608396659+08:00" level=error msg="Handler for POST /v1.39/images/create returned error
Jun 10 18:54:00 node02 dockerd[18890]: time="2020-06-10T18:54:00.973517501+08:00" level=info msg="Attempting next endpoint for pull after error: error
Jun 10 18:54:00 node02 dockerd[18890]: time="2020-06-10T18:54:00.973643896+08:00" level=error msg="Handler for POST /v1.39/images/create returned error
Jun 10 18:54:02 node02 dockerd[18890]: time="2020-06-10T18:54:02.228277154+08:00" level=info msg="Attempting next endpoint for pull after error: error
Jun 10 18:54:02 node02 dockerd[18890]: time="2020-06-10T18:54:02.228405458+08:00" level=error msg="Handler for POST /v1.39/images/create returned error
Jun 10 18:54:09 node02 dockerd[18890]: time="2020-06-10T18:54:09.627780146+08:00" level=info msg="Attempting next endpoint for pull after error: error
Jun 10 18:54:09 node02 dockerd[18890]: time="2020-06-10T18:54:09.627919507+08:00" level=error msg="Handler for POST /v1.39/images/create returned error
Jun 10 18:54:17 node02 dockerd[18890]: time="2020-06-10T18:54:17.018606048+08:00" level=info msg="Attempting next endpoint for pull after error: error
Jun 10 18:54:17 node02 dockerd[18890]: time="2020-06-10T18:54:17.018747998+08:00" level=error msg="Handler for POST /v1.39/images/create returned error
ydye commented 4 years ago

Maybe you should try to pull the image on your host and watch what will happen.

sudo docker pull gcr.azk8s.cn/google_containers/pause-amd64:3.1

@hzy46 Any suggestion about registry in china

liudican commented 4 years ago

Could it be that the mirror can't access it?


tsing01@node02:~$ sudo docker pull gcr.azk8s.cn/google_containers/pause-amd64:3.1
Error response from daemon: error parsing HTTP 403 response body: invalid character '<' looking for beginning of value: "<html>\r\n<head><title>403 Forbidden</title></head>\r\n<body bgcolor=\"white\">\r\n<center><h1>403 Forbidden</h1></center>\r\n<hr><center>nginx/1.14.0 (Ubuntu)</center>\r\n</body>\r\n</html>\r\n"
ydye commented 4 years ago

Could you try to upgrade the docker version and try again. I‘m not sure whether it is a docker issue or registry issue.

hzy46 commented 4 years ago

gcr.azk8s.cn may only work for Azure China. Could you find another gcr.io mirror?