microsoft / pai

Resource scheduling and cluster management for AI
https://openpai.readthedocs.io
MIT License
2.64k stars 548 forks source link

Failed to start service rest-server #4654

Closed zerozakiihitoshiki closed 4 years ago

zerozakiihitoshiki commented 4 years ago

Short summary about the issue/question: error: unable to recognize "priority-class.yaml": no matches for kind "PriorityClass" in version "scheduling.k8s.io/v1"

Brief what process you are following: I have cleaned previous deployment(v0.11.0) and start to install OpenPAI v1.0.1. I meet this error when running command python paictl.py service start. This error is not occurred during deploying v0.11.0.

Error Log:

2020-06-29 08:04:25,065 [INFO] - deployment.paiLibrary.paiService.service_management_start : Begin to generate service rest-server's template file
2020-06-29 08:04:25,065 [INFO] - deployment.paiLibrary.paiService.service_template_generate : Begin to generate the template file in service rest-server's configuration.
2020-06-29 08:04:25,066 [INFO] - deployment.paiLibrary.paiService.service_template_generate : Create template mapper for service rest-server.
2020-06-29 08:04:25,066 [INFO] - deployment.paiLibrary.paiService.service_template_generate : Done. Template mapper for service rest-server is created.
2020-06-29 08:04:25,066 [INFO] - deployment.paiLibrary.paiService.service_template_generate : Generate the template file /pai/deployment/paiLibrary/paiService/../../../src/rest-server/deploy/rest-server.yaml.template.
2020-06-29 08:04:25,066 [INFO] - deployment.paiLibrary.paiService.service_template_generate : Save the generated file to /pai/deployment/paiLibrary/paiService/../../../src/rest-server/deploy/rest-server.yaml.
2020-06-29 08:04:25,173 [INFO] - deployment.paiLibrary.paiService.service_template_generate : Generate the template file /pai/deployment/paiLibrary/paiService/../../../src/rest-server/deploy/start.sh.template.
2020-06-29 08:04:25,173 [INFO] - deployment.paiLibrary.paiService.service_template_generate : Save the generated file to /pai/deployment/paiLibrary/paiService/../../../src/rest-server/deploy/start.sh.
2020-06-29 08:04:25,183 [INFO] - deployment.paiLibrary.paiService.service_template_generate : Generate the template file /pai/deployment/paiLibrary/paiService/../../../src/rest-server/deploy/configmap-create.sh.template.
2020-06-29 08:04:25,183 [INFO] - deployment.paiLibrary.paiService.service_template_generate : Save the generated file to /pai/deployment/paiLibrary/paiService/../../../src/rest-server/deploy/configmap-create.sh.
2020-06-29 08:04:25,189 [INFO] - deployment.paiLibrary.paiService.service_template_generate : Generate the template file /pai/deployment/paiLibrary/paiService/../../../src/rest-server/deploy/auth-configmap/oidc.yaml.template.
2020-06-29 08:04:25,189 [INFO] - deployment.paiLibrary.paiService.service_template_generate : Save the generated file to /pai/deployment/paiLibrary/paiService/../../../src/rest-server/deploy/auth-configmap/oidc.yaml.
2020-06-29 08:04:25,205 [INFO] - root : It is not a service deploy file! Only support ['DaemonSet', 'Deployment', 'StatefulSet', 'Pod']
2020-06-29 08:04:25,206 [INFO] - deployment.paiLibrary.paiService.service_template_generate : Generate the template file /pai/deployment/paiLibrary/paiService/../../../src/rest-server/deploy/group-configmap/group.yaml.template.
2020-06-29 08:04:25,206 [INFO] - deployment.paiLibrary.paiService.service_template_generate : Save the generated file to /pai/deployment/paiLibrary/paiService/../../../src/rest-server/deploy/group-configmap/group.yaml.
2020-06-29 08:04:25,232 [INFO] - root : It is not a service deploy file! Only support ['DaemonSet', 'Deployment', 'StatefulSet', 'Pod']
2020-06-29 08:04:25,232 [INFO] - deployment.paiLibrary.paiService.service_template_generate : Generate the template file /pai/deployment/paiLibrary/paiService/../../../src/rest-server/deploy/job-exit-spec-config/job-exit-spec.yaml.template.
2020-06-29 08:04:25,232 [INFO] - deployment.paiLibrary.paiService.service_template_generate : Save the generated file to /pai/deployment/paiLibrary/paiService/../../../src/rest-server/deploy/job-exit-spec-config/job-exit-spec.yaml.
2020-06-29 08:04:25,257 [INFO] - root : It is not a service deploy file! Only support ['DaemonSet', 'Deployment', 'StatefulSet', 'Pod']
2020-06-29 08:04:25,258 [INFO] - deployment.paiLibrary.paiService.service_template_generate : Generate the template file /pai/deployment/paiLibrary/paiService/../../../src/rest-server/deploy/k8s-job-exit-spec-config/k8s-job-exit-spec.yaml.template.
2020-06-29 08:04:25,258 [INFO] - deployment.paiLibrary.paiService.service_template_generate : Save the generated file to /pai/deployment/paiLibrary/paiService/../../../src/rest-server/deploy/k8s-job-exit-spec-config/k8s-job-exit-spec.yaml.
2020-06-29 08:04:25,565 [INFO] - root : It is not a service deploy file! Only support ['DaemonSet', 'Deployment', 'StatefulSet', 'Pod']
2020-06-29 08:04:25,566 [INFO] - deployment.paiLibrary.paiService.service_template_generate : The template file of service rest-server is generated.
2020-06-29 08:04:25,566 [INFO] - deployment.paiLibrary.paiService.service_management_start : Begin to start service: [ rest-server ]
2020-06-29 08:04:25,567 [INFO] - deployment.paiLibrary.paiService.service_management_start : Begin to execute service rest-server's start script.
W0629 08:04:25.671594   37367 helpers.go:535] --dry-run is deprecated and can be replaced with --dry-run=client.
configmap/group-configuration configured
W0629 08:04:25.867507   37415 helpers.go:535] --dry-run is deprecated and can be replaced with --dry-run=client.
configmap/k8s-job-exit-spec-configuration configured
2020-06-29 08:04:26,990 [INFO] - legacy_user_migrate.py:217 : Starts to migrate legacy user data from etcd to kubernetes secrets
2020-06-29 08:04:26,993 [INFO] - legacy_user_migrate.py:233 : Etcd data has already been transferred to k8s secret
2020-06-29 08:04:27,842 [INFO] - user_v2_migrate.py:287 : Legacy data has already been transferred from v1 to v2. Skip it.
error: unable to recognize "priority-class.yaml": no matches for kind "PriorityClass" in version "scheduling.k8s.io/v1"
2020-06-29 08:04:28,245 [ERROR] - deployment.paiLibrary.common.linux_shell : Failed to execute the start script of service rest-server
2020-06-29 08:04:28,246 [ERROR] - deployment.paiLibrary.paiService.service_management_start : Failed to start service rest-server
2020-06-29 08:04:28,247 [INFO] - deployment.paiLibrary.paiService.service_management_start : -----------------------------------------------------------
2020-06-29 08:04:28,247 [ERROR] - deployment.paiLibrary.paiService.service_management_start : Have retried 5 times, but service rest-server doesn't start. Please check it.

OpenPAI Environment:

Binyang2014 commented 4 years ago

@zerozakiihitoshiki You should redeploy the k8s by following https://openpai.readthedocs.io/en/latest/manual/cluster-admin/installation-guide.html.

We need k8s version which is 1.15.x, I think v0.11.0 using an old version of k8s

zerozakiihitoshiki commented 4 years ago

@zerozakiihitoshiki You should redeploy the k8s by following https://openpai.readthedocs.io/en/latest/manual/cluster-admin/installation-guide.html.

We need k8s version which is 1.15.x, I think v0.11.0 using an old version of k8s

I use command python paictl.py cluster k8s-clean -p ~/pai-config/ to clean previous deployment before redeploy the k8s.

But when I install OpenPAIv1.0.1, I follow the old steps, I use quick-start-example.yaml to build quick-start.yaml, and then generate cluster configuration. I think these steps may cause problems. I check kubernetes-configuration.yaml in /pai-config/:

kubernetes:
  # Find the namesever in  /etc/resolv.conf
  cluster-dns: 8.8.8.8
  # To support k8s ha, you should set an lb address here.
  # If deploy k8s with single master node, please set master IP address here
  load-balance-ip: ******

  # specify an IP range not in the same network segment with the host machine.
  service-cluster-ip-range: 10.254.0.0/16
  # According to the etcdversion, you should fill a corresponding backend name.
  # If you are not familiar with etcd, please don't change it.
  storage-backend: etcd3
  # The docker registry used in the k8s deployment. If you can access to gcr, we suggest to use gcr.
  docker-registry: docker.io/openpai
  # http://gcr.io/google_containers/hyperkube. Or the tag in your registry.
  hyperkube-version: v1.9.9
  # http://gcr.io/google_containers/etcd. Or the tag in your registry.
  # If you are not familiar with etcd, please don't change it.
  etcd-version: 3.2.17
  # http://gcr.io/google_containers/kube-apiserver. Or the tag in your registry.
  apiserver-version: v1.9.9
  # http://gcr.io/google_containers/kube-scheduler. Or the tag in your registry.
  kube-scheduler-version: v1.9.9
  # http://gcr.io/google_containers/kube-controller-manager
  kube-controller-manager-version:  v1.9.9
  # http://gcr.io/google_containers/kubernetes-dashboard-amd64
  dashboard-version: v1.8.3
  # The path to storage etcd data.
  etcd-data-path: "/var/etcd"

Then I run command python paictl.py cluster k8s-bootup -p ~/pai-config/, I get k8s version which is v1.9.9. But when I change the version of k8s in kubernetes-configuration.yaml(I also change docker-registry to docker.io/mirrorgcrio), the error will be The connection to the server *******:8080 was refused - did you specify the right host or port?.

Binyang2014 commented 4 years ago

Please following this doc: https://openpai.readthedocs.io/en/latest/manual/cluster-admin/installation-guide.html#installation-guide to redeploy the k8s. The previous step: python paictl.py cluster k8s-bootup -p ~/pai-config/ is deprecated and can not work for v1.0.1 release.

zerozakiihitoshiki commented 4 years ago

Please following this doc: https://openpai.readthedocs.io/en/latest/manual/cluster-admin/installation-guide.html#installation-guide to redeploy the k8s. The previous step: python paictl.py cluster k8s-bootup -p ~/pai-config/ is deprecated and can not work for v1.0.1 release.

I am following this #4614 to deploy, but I meet this problem: fatal: [openpai-master-01 -> xxx.xxx.xxx.xxx]: FAILED! => {"attempts": 4, "changed": true, "cmd": ["/usr/bin/docker", "pull", "k8s.gcr.io/cluster-proportional-autoscaler-amd64:1.6.0"], "delta": "0:00:15 .224502", "end": "2020-07-03 17:56:19.658035", "msg": "non-zero return code", "rc": 1, "start": "202 0-07-03 17:56:04.433533", "stderr": "Error response from daemon: Get https://k8s.gcr.io/v2/: net/htt p: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)", "stderr_lines": ["Error response from daemon: Get https://k8s.gcr.io/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)"], "stdout": "", "std out_lines": []}

I have changed config.ymllike this:

user: xxx
password: xxxx
branch_name: pai-1.0.y
docker_image_tag: v1.0.1

# Optional

#############################################
# Ansible-playbooks' inventory hosts' vars. #
#############################################
# ssh_key_file_path: /path/to/you/key/file

#####################################
# OpenPAI's service image registry. #
#####################################
# docker_registry_domain: docker.io
# docker_registry_namespace: openpai
# docker_registry_username: exampleuser
# docker_registry_password: examplepasswd

###########################################################################################
#                         Pre-check setting                                               #
# By default, we assume your gpu environment is nvidia. So your runtime should be nvidia. #
# If you are using AMD or other environment, you should modify it.                        #
###########################################################################################
# worker_default_docker_runtime: nvidia
# docker_check: true

# resource_check: true

# gpu_type: nvidia

########################################################################################
# Advanced docker configuration. If you are not familiar with them, don't change them. #
########################################################################################
# docker_data_root: /mnt/docker
# docker_config_file_path: /etc/docker/daemon.json
# docker_iptables_enabled: false

## An obvious use case is allowing insecure-registry access to self hosted registries.
## Can be ipaddress and domain_name.
## example define 172.19.16.11 or mirror.registry.io
# openpai_docker_insecure_registries:
#   - mirror.registry.io
#   - 172.19.16.11

## Add other registry,example China registry mirror.
# openpai_docker_registry_mirrors:
#   - https://registry.docker-cn.com
#   - https://mirror.aliyuncs.com

#######################################################################
#                       kubespray setting                             #
#######################################################################

# If you couldn't access to gcr.io or docker.io, please configure it.
# gcr_image_repo: "gcr.io"
gcr_image_repo: "docker.io/kubesphere"

# kube_image_repo: "gcr.io/google-containers"
kube_image_repo: "docker.io/zhaowenlei"

# quay_image_repo: "quay.io"
quay_image_repo: "quay-mirror.qiniu.com"

# docker_image_repo: "docker.io"
docker_image_repo: "docker.io"

# etcd_image_repo: "quay.io/coreos/etcd"
# pod_infra_image_repo: "gcr.io/google_containers/pause-{{ image_arch }}"
pod_infra_image_repo: "registry.aliyuncs.com/google_containers/pause-{{ image_arch }}"

# kubeadm_download_url: "https://storage.googleapis.com/kubernetes-release/release/{{ kubeadm_version }}/bin/linux/{{ image_arch }}/kubeadm"
# hyperkube_download_url: "https://storage.googleapis.com/kubernetes-release/release/{{ kube_version }}/bin/linux/{{ image_arch }}/hyperkube"

# openpai_kube_network_plugin: calico

# openpai_kubespray_extra_var:
#   kay: value
#   key: value

#######################################################################
#                     host daemon port setting                        #
#######################################################################
# host_daemon_port_start: 40000
# host_daemon_port_end: 65535
Binyang2014 commented 4 years ago

@ydye @hzy46 Any suggestion? Seems customer already changed the image repo

zerozakiihitoshiki commented 4 years ago

@ydye @hzy46 Any suggestion? Seems customer already changed the image repo

I follow this doc, and change ~/kubespray/roles/download/defaults/main.yml line 337 to dnsautoscaler_image_repo: "docker.io/kubesphere/cluster-proportional-autoscaler-{{ image_arch }}". It works.

But I meet new problem. I run commed python3 script/openpai-generator.py -m path/to/master.csv -w path/to/worker.csv -c path/to/config.yml -o path/to/output, and the error is:

Traceback (most recent call last):
  File "script/openpai-generator.py", line 254, in <module>
    main()
  File "script/openpai-generator.py", line 225, in main
    wait_nvidia_device_plugin_ready()
  File "script/openpai-generator.py", line 157, in wait_nvidia_device_plugin_ready
    while pod_is_ready_or_not("name", "nvidia-device-plugin-ds", "Nvidia-Device-Plugin") != True:
  File "script/openpai-generator.py", line 91, in pod_is_ready_or_not
    config.load_kube_config()
  File "/home/xxx/.local/lib/python3.5/site-packages/kubernetes/config/kube_config.py", line 739, in load_kube_config
    persist_config=persist_config)
  File "/home/xxx/.local/lib/python3.5/site-packages/kubernetes/config/kube_config.py", line 701, in _get_kube_config_loader_for_yaml_file
    'Invalid kube-config file. '
kubernetes.config.config_exception.ConfigException: Invalid kube-config file. No configuration found.
ydye commented 4 years ago

You shoule copy your kubeconfig to ~/.kube/config,

please refer to this: https://github.com/microsoft/pai/blob/master/contrib/kubespray/script/kubernetes-boot.sh#L8

zerozakiihitoshiki commented 4 years ago

You shoule copy your kubeconfig to ~/.kube/config,

please refer to this: https://github.com/microsoft/pai/blob/master/contrib/kubespray/script/kubernetes-boot.sh#L8

Thank you for your help. After doing that, I rerun the command, then:

2020-07-06 10:53:58,815 [ERROR] - openpai-generator.py:188 : Allocatable GPU number in openpai-master-01 is 0, current quick start script does not allow.
2020-07-06 10:53:58,816 [ERROR] - openpai-generator.py:189 : Please remove openpai-master-01 from your workerlist, or check if the device plugin is running healthy on the node.
ydye commented 4 years ago

Let's clarify your environment.

1: Cluster Size. Number of master, number of worker. I guess, you only have one node, which is both master and worker. 2: VM type. GPU or Non-GPU. I guess the only node of you is non-gpu vm.

If the assumption is right, and you are at the stage to deploy openpaiservice. You should do the following works.

1: change the python script

to

    hived_config["unit-cpu"] = int( min_cpu )
    hived_config["unit-mem"] = int( min_mem / min_cpu )

2: Modify services-configuration.yaml.template (Change GPU relative logic)

to

cpu: 1

https://github.com/microsoft/pai/blob/master/contrib/kubespray/quick-start/services-configuration.yaml.template#L70

to

childCellNumber: {{ env["hived"]["unit-cpu"] }}

3: following the script, please manually execute the command. Note the file path maybe different from yours. Please update the corresponding file path based on your env.

zerozakiihitoshiki commented 4 years ago

My environment:

  1. Cluster Size: One node.
  2. Phyical machine with GPU.
ydye commented 4 years ago

Did you install nvidia-drivers and nvidia-docker-runtime on the host? And set it as default runtime

zerozakiihitoshiki commented 4 years ago

Did you install nvidia-drivers and nvidia-docker-runtime on the host? And set it as default runtime

Thank you. I set nvidia-docker-runtime as the default runtime, it works.

When I rerun this command, an issue occure when starting up Nvidia-Device-Plugin. This is may because driver/library version mismatch.

xxx@openpai-master-01:~$ kubectl get --namespace=kube-system pods | grep nvidia-device-plugin
nvidia-device-plugin-daemonset-rbjvw        0/1     CrashLoopBackOff   43         17h
xxx@openpai-master-01:~$ kubectl describe pod nvidia-device-plugin-daemonset-rbjvw -n kube-system
Name:                 nvidia-device-plugin-daemonset-rbjvw
Namespace:            kube-system
Priority:             2000001000
Priority Class Name:  system-node-critical
Node:                 openpai-master-01/172.168.3.107
Start Time:           Mon, 06 Jul 2020 21:23:01 +0800
Labels:               controller-revision-hash=74b655f448
                      name=nvidia-device-plugin-ds
                      pod-template-generation=1
Annotations:          scheduler.alpha.kubernetes.io/critical-pod:
Status:               Running
IP:                   10.207.147.137
Controlled By:        DaemonSet/nvidia-device-plugin-daemonset
Containers:
  nvidia-device-plugin-ctr:
    Container ID:   docker://2595be5b0bf1d29322e70141fcb838b98a5d477116e1cb0b6a47cac29a2704ae
    Image:          nvidia/k8s-device-plugin:1.0.0-beta4
    Image ID:       docker-pullable://nvidia/k8s-device-plugin@sha256:94d46bf513cbc43c4d77a364e4bbd409d32d89c8e686e12551cc3eb27c259b90
    Port:           <none>
    Host Port:      <none>
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       ContainerCannotRun
      Message:      OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:449: container init caused \"process_linux.go:432: running prestart hook 0 caused \\\"error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: nvml error: driver/library version mismatch\\\\n\\\"\"": unknown
      Exit Code:    128
      Started:      Tue, 07 Jul 2020 14:50:47 +0800
      Finished:     Tue, 07 Jul 2020 14:50:47 +0800
    Ready:          False
    Restart Count:  44
    Environment:    <none>
    Mounts:
      /var/lib/kubelet/device-plugins from device-plugin (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-bg6z5 (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  device-plugin:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/kubelet/device-plugins
    HostPathType:
  default-token-bg6z5:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-bg6z5
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     CriticalAddonsOnly
                 node.kubernetes.io/disk-pressure:NoSchedule
                 node.kubernetes.io/memory-pressure:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute
                 node.kubernetes.io/pid-pressure:NoSchedule
                 node.kubernetes.io/unreachable:NoExecute
                 node.kubernetes.io/unschedulable:NoSchedule
                 nvidia.com/gpu:NoSchedule
Events:
  Type     Reason   Age                      From                        Message
  ----     ------   ----                     ----                        -------
  Warning  BackOff  2m30s (x907 over 3h22m)  kubelet, openpai-master-01  Back-off restarting failed container

I also get this:

xxx@openpai-master-01:~$ nvidia-smi
Tue Jul  7 16:48:58 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.50       Driver Version: 418.30       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla M60           Off  | 00000000:07:00.0 Off |                  Off |
| N/A   34C    P8    14W / 150W |      0MiB /  8129MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla M60           Off  | 00000000:08:00.0 Off |                  Off |
| N/A   30C    P8    13W / 150W |      0MiB /  8129MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla M60           Off  | 00000000:85:00.0 Off |                  Off |
| N/A   31C    P8    14W / 150W |      0MiB /  8129MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla M60           Off  | 00000000:86:00.0 Off |                  Off |
| N/A   27C    P8    13W / 150W |      0MiB /  8129MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Should I upgrade nvidia driver manually to 430.50? I don't know if it will cause more problems to the installation.

ydye commented 4 years ago

How do u install nvidia drivers?Through apt?

zerozakiihitoshiki commented 4 years ago

How do u install nvidia drivers?Through apt?

I think 418.30 was installed by openpai(v0.11.0)

ydye commented 4 years ago

You'd better remove the drivers installed by openpai. After v1.0.0, we will remove this dependency and recommand user to install drivers through apt.

Here is a playbooks to remove openpai. You can follow it to remove it https://github.com/microsoft/pai/blob/master/contrib/kubespray/clean-nvidia-drivers-installed-by-paictl.yml

zerozakiihitoshiki commented 4 years ago

You'd better remove the drivers installed by openpai. After v1.0.0, we will remove this dependency and recommand user to install drivers through apt.

Here is a playbooks to remove openpai. You can follow it to remove it https://github.com/microsoft/pai/blob/master/contrib/kubespray/clean-nvidia-drivers-installed-by-paictl.yml

Thank you, I will try.