Closed zerozakiihitoshiki closed 4 years ago
@zerozakiihitoshiki You should redeploy the k8s by following https://openpai.readthedocs.io/en/latest/manual/cluster-admin/installation-guide.html.
We need k8s version which is 1.15.x, I think v0.11.0 using an old version of k8s
@zerozakiihitoshiki You should redeploy the k8s by following https://openpai.readthedocs.io/en/latest/manual/cluster-admin/installation-guide.html.
We need k8s version which is 1.15.x, I think v0.11.0 using an old version of k8s
I use command python paictl.py cluster k8s-clean -p ~/pai-config/
to clean previous deployment before redeploy the k8s.
But when I install OpenPAIv1.0.1
, I follow the old steps, I use quick-start-example.yaml
to build quick-start.yaml
, and then generate cluster configuration. I think these steps may cause problems.
I check kubernetes-configuration.yaml
in /pai-config/
:
kubernetes:
# Find the namesever in /etc/resolv.conf
cluster-dns: 8.8.8.8
# To support k8s ha, you should set an lb address here.
# If deploy k8s with single master node, please set master IP address here
load-balance-ip: ******
# specify an IP range not in the same network segment with the host machine.
service-cluster-ip-range: 10.254.0.0/16
# According to the etcdversion, you should fill a corresponding backend name.
# If you are not familiar with etcd, please don't change it.
storage-backend: etcd3
# The docker registry used in the k8s deployment. If you can access to gcr, we suggest to use gcr.
docker-registry: docker.io/openpai
# http://gcr.io/google_containers/hyperkube. Or the tag in your registry.
hyperkube-version: v1.9.9
# http://gcr.io/google_containers/etcd. Or the tag in your registry.
# If you are not familiar with etcd, please don't change it.
etcd-version: 3.2.17
# http://gcr.io/google_containers/kube-apiserver. Or the tag in your registry.
apiserver-version: v1.9.9
# http://gcr.io/google_containers/kube-scheduler. Or the tag in your registry.
kube-scheduler-version: v1.9.9
# http://gcr.io/google_containers/kube-controller-manager
kube-controller-manager-version: v1.9.9
# http://gcr.io/google_containers/kubernetes-dashboard-amd64
dashboard-version: v1.8.3
# The path to storage etcd data.
etcd-data-path: "/var/etcd"
Then I run command python paictl.py cluster k8s-bootup -p ~/pai-config/
, I get k8s version which is v1.9.9
.
But when I change the version of k8s in kubernetes-configuration.yaml
(I also change docker-registry to docker.io/mirrorgcrio)
, the error will be The connection to the server *******:8080 was refused - did you specify the right host or port?
.
Please following this doc: https://openpai.readthedocs.io/en/latest/manual/cluster-admin/installation-guide.html#installation-guide to redeploy the k8s.
The previous step: python paictl.py cluster k8s-bootup -p ~/pai-config/
is deprecated and can not work for v1.0.1 release.
Please following this doc: https://openpai.readthedocs.io/en/latest/manual/cluster-admin/installation-guide.html#installation-guide to redeploy the k8s. The previous step:
python paictl.py cluster k8s-bootup -p ~/pai-config/
is deprecated and can not work for v1.0.1 release.
I am following this #4614 to deploy, but I meet this problem:
fatal: [openpai-master-01 -> xxx.xxx.xxx.xxx]: FAILED! => {"attempts": 4, "changed": true, "cmd": ["/usr/bin/docker", "pull", "k8s.gcr.io/cluster-proportional-autoscaler-amd64:1.6.0"], "delta": "0:00:15 .224502", "end": "2020-07-03 17:56:19.658035", "msg": "non-zero return code", "rc": 1, "start": "202 0-07-03 17:56:04.433533", "stderr": "Error response from daemon: Get https://k8s.gcr.io/v2/: net/htt p: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)", "stderr_lines": ["Error response from daemon: Get https://k8s.gcr.io/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)"], "stdout": "", "std out_lines": []}
I have changed config.yml
like this:
user: xxx
password: xxxx
branch_name: pai-1.0.y
docker_image_tag: v1.0.1
# Optional
#############################################
# Ansible-playbooks' inventory hosts' vars. #
#############################################
# ssh_key_file_path: /path/to/you/key/file
#####################################
# OpenPAI's service image registry. #
#####################################
# docker_registry_domain: docker.io
# docker_registry_namespace: openpai
# docker_registry_username: exampleuser
# docker_registry_password: examplepasswd
###########################################################################################
# Pre-check setting #
# By default, we assume your gpu environment is nvidia. So your runtime should be nvidia. #
# If you are using AMD or other environment, you should modify it. #
###########################################################################################
# worker_default_docker_runtime: nvidia
# docker_check: true
# resource_check: true
# gpu_type: nvidia
########################################################################################
# Advanced docker configuration. If you are not familiar with them, don't change them. #
########################################################################################
# docker_data_root: /mnt/docker
# docker_config_file_path: /etc/docker/daemon.json
# docker_iptables_enabled: false
## An obvious use case is allowing insecure-registry access to self hosted registries.
## Can be ipaddress and domain_name.
## example define 172.19.16.11 or mirror.registry.io
# openpai_docker_insecure_registries:
# - mirror.registry.io
# - 172.19.16.11
## Add other registry,example China registry mirror.
# openpai_docker_registry_mirrors:
# - https://registry.docker-cn.com
# - https://mirror.aliyuncs.com
#######################################################################
# kubespray setting #
#######################################################################
# If you couldn't access to gcr.io or docker.io, please configure it.
# gcr_image_repo: "gcr.io"
gcr_image_repo: "docker.io/kubesphere"
# kube_image_repo: "gcr.io/google-containers"
kube_image_repo: "docker.io/zhaowenlei"
# quay_image_repo: "quay.io"
quay_image_repo: "quay-mirror.qiniu.com"
# docker_image_repo: "docker.io"
docker_image_repo: "docker.io"
# etcd_image_repo: "quay.io/coreos/etcd"
# pod_infra_image_repo: "gcr.io/google_containers/pause-{{ image_arch }}"
pod_infra_image_repo: "registry.aliyuncs.com/google_containers/pause-{{ image_arch }}"
# kubeadm_download_url: "https://storage.googleapis.com/kubernetes-release/release/{{ kubeadm_version }}/bin/linux/{{ image_arch }}/kubeadm"
# hyperkube_download_url: "https://storage.googleapis.com/kubernetes-release/release/{{ kube_version }}/bin/linux/{{ image_arch }}/hyperkube"
# openpai_kube_network_plugin: calico
# openpai_kubespray_extra_var:
# kay: value
# key: value
#######################################################################
# host daemon port setting #
#######################################################################
# host_daemon_port_start: 40000
# host_daemon_port_end: 65535
@ydye @hzy46 Any suggestion? Seems customer already changed the image repo
@ydye @hzy46 Any suggestion? Seems customer already changed the image repo
I follow this doc, and change ~/kubespray/roles/download/defaults/main.yml
line 337 to dnsautoscaler_image_repo: "docker.io/kubesphere/cluster-proportional-autoscaler-{{ image_arch }}"
. It works.
But I meet new problem. I run commed python3 script/openpai-generator.py -m path/to/master.csv -w path/to/worker.csv -c path/to/config.yml -o path/to/output
, and the error is:
Traceback (most recent call last):
File "script/openpai-generator.py", line 254, in <module>
main()
File "script/openpai-generator.py", line 225, in main
wait_nvidia_device_plugin_ready()
File "script/openpai-generator.py", line 157, in wait_nvidia_device_plugin_ready
while pod_is_ready_or_not("name", "nvidia-device-plugin-ds", "Nvidia-Device-Plugin") != True:
File "script/openpai-generator.py", line 91, in pod_is_ready_or_not
config.load_kube_config()
File "/home/xxx/.local/lib/python3.5/site-packages/kubernetes/config/kube_config.py", line 739, in load_kube_config
persist_config=persist_config)
File "/home/xxx/.local/lib/python3.5/site-packages/kubernetes/config/kube_config.py", line 701, in _get_kube_config_loader_for_yaml_file
'Invalid kube-config file. '
kubernetes.config.config_exception.ConfigException: Invalid kube-config file. No configuration found.
You shoule copy your kubeconfig to ~/.kube/config,
please refer to this: https://github.com/microsoft/pai/blob/master/contrib/kubespray/script/kubernetes-boot.sh#L8
You shoule copy your kubeconfig to ~/.kube/config,
please refer to this: https://github.com/microsoft/pai/blob/master/contrib/kubespray/script/kubernetes-boot.sh#L8
Thank you for your help. After doing that, I rerun the command, then:
2020-07-06 10:53:58,815 [ERROR] - openpai-generator.py:188 : Allocatable GPU number in openpai-master-01 is 0, current quick start script does not allow.
2020-07-06 10:53:58,816 [ERROR] - openpai-generator.py:189 : Please remove openpai-master-01 from your workerlist, or check if the device plugin is running healthy on the node.
Let's clarify your environment.
1: Cluster Size. Number of master, number of worker. I guess, you only have one node, which is both master and worker. 2: VM type. GPU or Non-GPU. I guess the only node of you is non-gpu vm.
If the assumption is right, and you are at the stage to deploy openpaiservice. You should do the following works.
1: change the python script
remove the following line (GPU Resource relative logic). https://github.com/microsoft/pai/blob/master/contrib/kubespray/script/openpai-generator.py#L187 https://github.com/microsoft/pai/blob/master/contrib/kubespray/script/openpai-generator.py#L193 https://github.com/microsoft/pai/blob/master/contrib/kubespray/script/openpai-generator.py#L193
modify the source code https://github.com/microsoft/pai/blob/master/contrib/kubespray/script/openpai-generator.py#L200
to
hived_config["unit-cpu"] = int( min_cpu )
hived_config["unit-mem"] = int( min_mem / min_cpu )
2: Modify services-configuration.yaml.template (Change GPU relative logic)
remove the following line https://github.com/microsoft/pai/blob/master/contrib/kubespray/quick-start/services-configuration.yaml.template#L64
modify the following line https://github.com/microsoft/pai/blob/master/contrib/kubespray/quick-start/services-configuration.yaml.template#L65
to
cpu: 1
to
childCellNumber: {{ env["hived"]["unit-cpu"] }}
3: following the script, please manually execute the command. Note the file path maybe different from yours. Please update the corresponding file path based on your env.
My environment:
Did you install nvidia-drivers and nvidia-docker-runtime on the host? And set it as default runtime
Did you install nvidia-drivers and nvidia-docker-runtime on the host? And set it as default runtime
Thank you. I set nvidia-docker-runtime as the default runtime, it works.
When I rerun this command, an issue occure when starting up Nvidia-Device-Plugin. This is may because driver/library version mismatch.
xxx@openpai-master-01:~$ kubectl get --namespace=kube-system pods | grep nvidia-device-plugin
nvidia-device-plugin-daemonset-rbjvw 0/1 CrashLoopBackOff 43 17h
xxx@openpai-master-01:~$ kubectl describe pod nvidia-device-plugin-daemonset-rbjvw -n kube-system
Name: nvidia-device-plugin-daemonset-rbjvw
Namespace: kube-system
Priority: 2000001000
Priority Class Name: system-node-critical
Node: openpai-master-01/172.168.3.107
Start Time: Mon, 06 Jul 2020 21:23:01 +0800
Labels: controller-revision-hash=74b655f448
name=nvidia-device-plugin-ds
pod-template-generation=1
Annotations: scheduler.alpha.kubernetes.io/critical-pod:
Status: Running
IP: 10.207.147.137
Controlled By: DaemonSet/nvidia-device-plugin-daemonset
Containers:
nvidia-device-plugin-ctr:
Container ID: docker://2595be5b0bf1d29322e70141fcb838b98a5d477116e1cb0b6a47cac29a2704ae
Image: nvidia/k8s-device-plugin:1.0.0-beta4
Image ID: docker-pullable://nvidia/k8s-device-plugin@sha256:94d46bf513cbc43c4d77a364e4bbd409d32d89c8e686e12551cc3eb27c259b90
Port: <none>
Host Port: <none>
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: ContainerCannotRun
Message: OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:449: container init caused \"process_linux.go:432: running prestart hook 0 caused \\\"error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: nvml error: driver/library version mismatch\\\\n\\\"\"": unknown
Exit Code: 128
Started: Tue, 07 Jul 2020 14:50:47 +0800
Finished: Tue, 07 Jul 2020 14:50:47 +0800
Ready: False
Restart Count: 44
Environment: <none>
Mounts:
/var/lib/kubelet/device-plugins from device-plugin (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-bg6z5 (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
device-plugin:
Type: HostPath (bare host directory volume)
Path: /var/lib/kubelet/device-plugins
HostPathType:
default-token-bg6z5:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-bg6z5
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: CriticalAddonsOnly
node.kubernetes.io/disk-pressure:NoSchedule
node.kubernetes.io/memory-pressure:NoSchedule
node.kubernetes.io/not-ready:NoExecute
node.kubernetes.io/pid-pressure:NoSchedule
node.kubernetes.io/unreachable:NoExecute
node.kubernetes.io/unschedulable:NoSchedule
nvidia.com/gpu:NoSchedule
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning BackOff 2m30s (x907 over 3h22m) kubelet, openpai-master-01 Back-off restarting failed container
I also get this:
xxx@openpai-master-01:~$ nvidia-smi
Tue Jul 7 16:48:58 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.50 Driver Version: 418.30 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla M60 Off | 00000000:07:00.0 Off | Off |
| N/A 34C P8 14W / 150W | 0MiB / 8129MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla M60 Off | 00000000:08:00.0 Off | Off |
| N/A 30C P8 13W / 150W | 0MiB / 8129MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla M60 Off | 00000000:85:00.0 Off | Off |
| N/A 31C P8 14W / 150W | 0MiB / 8129MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla M60 Off | 00000000:86:00.0 Off | Off |
| N/A 27C P8 13W / 150W | 0MiB / 8129MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Should I upgrade nvidia driver manually to 430.50? I don't know if it will cause more problems to the installation.
How do u install nvidia drivers?Through apt?
How do u install nvidia drivers?Through apt?
I think 418.30 was installed by openpai(v0.11.0)
You'd better remove the drivers installed by openpai. After v1.0.0, we will remove this dependency and recommand user to install drivers through apt.
Here is a playbooks to remove openpai. You can follow it to remove it https://github.com/microsoft/pai/blob/master/contrib/kubespray/clean-nvidia-drivers-installed-by-paictl.yml
You'd better remove the drivers installed by openpai. After v1.0.0, we will remove this dependency and recommand user to install drivers through apt.
Here is a playbooks to remove openpai. You can follow it to remove it https://github.com/microsoft/pai/blob/master/contrib/kubespray/clean-nvidia-drivers-installed-by-paictl.yml
Thank you, I will try.
Short summary about the issue/question:
error: unable to recognize "priority-class.yaml": no matches for kind "PriorityClass" in version "scheduling.k8s.io/v1"
Brief what process you are following: I have cleaned previous deployment(v0.11.0) and start to install OpenPAI v1.0.1. I meet this error when running command
python paictl.py service start
. This error is not occurred during deploying v0.11.0.Error Log:
OpenPAI Environment: