Closed wm10240 closed 5 years ago
1) Can you paste the error log of the drivers-one-shot? 2) Link, how did you configure this field? And btw the install process will be a little long.
@ydye The error in drivers-one-shot following, and I find a problem, I use 1080ti, I am not be sure the parameters in cluster-configuration.yaml, example:
gpu:
# type: gpu{type}
type: geforce1080ti
count: 4
machine-type: D8SV3
machine-type: NC24R
# docker logs k8s_nvidia-drivers_drivers-one-shot-fgkws_default_f7b0489d-6963-11e8-8568-2c4d54461f8c_12
++ uname -r
+ KERNEL_FULL_VERSION=4.4.0-87-generic
+ CURRENT_DRIVER=/var/drivers/nvidia/current
+ echo ======== If NVIDIA present exit early =========
+ nvidiaPresent
+ [[ -f /proc/driver/nvidia/version ]]
+ grep -q 384.69 /proc/driver/nvidia/version
======== If NVIDIA present exit early =========
+ lsmod
+ grep -qE '^nvidia'
+ lsmod
+ grep -qE '^nvidia_uvm'
+ [[ -e /dev/nvidia0 ]]
+ [[ -e /var/drivers/nvidia/384.69/lib64/libnvidia-ml.so ]]
+ return 5
+ echo ======== If NVIDIA driver already running uninstall it =========
======== If NVIDIA driver already running uninstall it =========
+ lsmod
+ grep -qE '^nvidia'
++ lsmod
++ tr -s ' '
++ grep -E '^nvidia'
++ cut -f 4 -d ' '
+ DEP_MODS='
nvidia_drm
nvidia_modeset,nvidia_uvm'
+ for mod in '${DEP_MODS//,/ }'
+ rmmod nvidia_drm
rmmod: ERROR: Module nvidia_drm is in use
+ echo 'The driver nvidia_drm is still in use, can'\''t unload it.'
The driver nvidia_drm is still in use, can't unload it.
In machine-type, you should fill it with the id of machine-sku which you defined in this field. If your GPU type is geforce1080ti, you should set the gpu type as geforce1080ti
.
I assume you have installed GPU-drivers on your machine before bootstrap pai. There will be two solution in this situation.
The drivers' docker image: https://github.com/Microsoft/pai/blob/pai-0.8.y/src/drivers/build/drivers.dockerfile The nvidia drivers installer: https://github.com/Microsoft/pai/blob/pai-0.8.y/src/drivers/build/install-nvidia-drivers
https://github.com/Microsoft/pai/blob/pai-0.8.y/src/drivers/deploy/drivers.yaml.template
We can see the environment of nvidia in docker image is:
ENV NV_DRIVER=/var/drivers/nvidia/$NVIDIA_VERSION
ENV LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$NV_DRIVER/lib:$NV_DRIVER/lib64
ENV PATH=$PATH:$NV_DRIVER/bin
And install path in the script:
./NVIDIA-Linux-x86_64-$NVIDIA_VERSION/nvidia-installer \
--utility-prefix=$NV_DRIVER \
--opengl-prefix=$NV_DRIVER \
--x-prefix=$NV_DRIVER \
--compat32-prefix=$NV_DRIVER \
--opengl-libdir=lib64 \
--utility-libdir=lib64 \
--x-library-path=lib64 \
--compat32-libdir=lib \
-s -N || exit $?
The mount path in yaml file:
- mountPath: /var/drivers
name: driver-path
。。。。
- name: driver-path
hostPath:
path: /var/drivers
So if you don't want uninstall the GPU drivers. You should do some work following.
Delete this folder: https://github.com/Microsoft/pai/tree/pai-0.8.y/src/drivers/deploy
Change this field, and set the drivers path on your host. https://github.com/Microsoft/pai/blob/pai-0.8.y/src/hadoop-node-manager/deploy/hadoop-node-manager.yaml.template#L109
@ydye I uninstall the nvidia-driver, it works, thanks a lot, but when don't uninstall the nvidia-driver and edit the file as you idea, some error:
# ./deploy.py -d -p /cluster-configuration
Enter labelExpend!
Login Succeeded
2018-06-11 01:44:16,835 [INFO] - __main__ : docker registry login successfully
Error from server (AlreadyExists): configmaps "host-configuration" already exists
Error from server (AlreadyExists): configmaps "docker-credentials" already exists
Error from server (AlreadyExists): configmaps "gpu-configuration" already exists
Error from server (AlreadyExists): error when creating "secret.yaml": secrets "*" already exists
Traceback (most recent call last):
File "./deploy.py", line 451, in <module>
main()
File "./deploy.py", line 438, in main
bootstrap_service(service_config)
File "./deploy.py", line 273, in bootstrap_service
dependency_bootstrap(serv, service_config, started_service)
File "./deploy.py", line 253, in dependency_bootstrap
dependency_bootstrap(pre_serv, service_config, started_service)
File "./deploy.py", line 252, in dependency_bootstrap
for pre_serv in service_config['servicelist'][serv]['prerequisite']:
KeyError: 'drivers'
and other error, in "/pai/pai-management/bootstrap/hadoop-service", node-label.sh:
kubectl label nodes 192.168.1.* hdfsrole=worker
kubectl label nodes 192.168.1.* yarnrole=worker
kubectl label nodes 192.168.1.* hdfsrole=worker
kubectl label nodes 192.168.1.* yarnrole=worker
this file may miss some label message, for example: zookeeper, I don't find how to generator node-label.sh( node-label.sh.template to node-label.sh), pls give a help, ths
Did you label some node with pai-role=master
like this.
In bootstrap/hadoop-service/node-label.sh
will label the node with zookeeper="true"
https://github.com/Microsoft/pai/blob/master/pai-management/bootstrap/hadoop-service/node-label.sh.template#L36
Now, we have some hardcode to expand the label pai-role=master
to other label, such as zookeeper. This part will be change in the future. Now our code will expand the label in this way.
@ydye hi, thank for your reply.
@wm10240
The label should be pai-master: "true"
.
This function is called here. https://github.com/Microsoft/pai/blob/master/pai-management/paiLibrary/clusterObjectModel/paiObjectModel.py#L293
How did you edit the node-label.sh? Did you add the node-label by yourself?
@ydye sorry, I give you the wrong message. The label machine-list:
machine-list:
- hostname: dev-*
hostip: 192.168.1.*
machine-type: D8SV3
etcdid: etcdid1
#sshport: PORT (Optional)
#username: username (Optional)
#password: password (Optional)
k8s-role: master
dashboard: "true"
zkid: "1"
pai-master: "true"
dashboard: "true"
watchdog: "true"
the node-label.sh:
kubectl label nodes 192.168.1.* hdfsrole=master
kubectl label nodes 192.168.1.* yarnrole=master
kubectl label nodes 192.168.1.* zookeeper=true
kubectl label nodes 192.168.1.* jobhistory=true
kubectl label nodes 192.168.1.* launcher=true
kubectl label nodes 192.168.1.* restserver=true
kubectl label nodes 192.168.1.* webportal=true
kubectl label nodes 192.168.1.* prometheus=true
kubectl label nodes 192.168.1.* grafana=true
kubectl label nodes 192.168.1.* pylon=true
kubectl label nodes 192.168.1.* hadoop-name-node=true
kubectl label nodes 192.168.1.* node-exporter=true
kubectl label nodes 192.168.1.* hadoop-resource-manager=true
kubectl label nodes 192.168.1.* hdfsrole=worker
kubectl label nodes 192.168.1.* yarnrole=worker
if use "./deploy.py -d -p /cluster-configuration" ,the node_label.sh:
kubectl label nodes 192.168.1.* hdfsrole=worker
kubectl label nodes 192.168.1.* yarnrole=worker
kubectl label nodes 192.168.1.* hdfsrole=worker
kubectl label nodes 192.168.1.* yarnrole=worker
@wm10240
@ydye I have 2 nodes yes, it's right
@ydye other error:
docker build -f Dockerfiles/Dockerfile.build.base -t pai.build.base:hadoop2.7.2-cuda8.0-cudnn6-devel-ubuntu16.04 Dockerfiles/
shoule be:
docker build -f Dockerfiles/cuda8.0-cudnn6/Dockerfile.build.base -t pai.build.base:hadoop2.7.2-cuda8.0-cudnn6-devel-ubuntu16.04 Dockerfiles/
COPY copied_file/pylon/src/* /root/
I don't find directory "copied_file"
@wm10240
@abuccts Can you follow it?
This folder will be generated by script when building image, and cleaned after building finished.
BTW, If you have more issues which may not be relative to current topic, you'd better fire another issue. Because someone may come across the same problem as you found, so a new issue may help others. @wm10240
my gpu is Tesla K40m.what I should set the gpu type ?
2018-07-12 01:11:18,068 [INFO] - paiLibrary.paiService.service_management_start : Begin to generate service drivers's template file 2018-07-12 01:11:18,068 [INFO] - paiLibrary.paiService.service_template_generate : Begin to generate the template file in service drivers's configuration. 2018-07-12 01:11:18,068 [INFO] - paiLibrary.paiService.service_template_generate : Create template mapper for service drivers. 2018-07-12 01:11:18,069 [INFO] - paiLibrary.paiService.service_template_generate : Done. Template mapper for service drivers is created. 2018-07-12 01:11:18,069 [INFO] - paiLibrary.paiService.service_template_generate : Generate the template file bootstrap/drivers/node-label.sh.template. 2018-07-12 01:11:18,069 [INFO] - paiLibrary.paiService.service_template_generate : Save the generated file to bootstrap/drivers/node-label.sh. 2018-07-12 01:11:18,073 [INFO] - paiLibrary.paiService.service_template_generate : Generate the template file bootstrap/drivers/drivers.yaml.template. 2018-07-12 01:11:18,073 [INFO] - paiLibrary.paiService.service_template_generate : Save the generated file to bootstrap/drivers/drivers.yaml. 2018-07-12 01:11:18,076 [INFO] - paiLibrary.paiService.service_template_generate : Generate the template file bootstrap/drivers/stop.sh.template. 2018-07-12 01:11:18,076 [INFO] - paiLibrary.paiService.service_template_generate : Save the generated file to bootstrap/drivers/stop.sh. 2018-07-12 01:11:18,079 [INFO] - paiLibrary.paiService.service_template_generate : Generate the template file bootstrap/drivers/refresh.sh.template. 2018-07-12 01:11:18,080 [INFO] - paiLibrary.paiService.service_template_generate : Save the generated file to bootstrap/drivers/refresh.sh. 2018-07-12 01:11:18,084 [INFO] - paiLibrary.paiService.service_template_generate : The template file of service drivers is generated. 2018-07-12 01:11:18,084 [INFO] - paiLibrary.paiService.service_management_start : Begin to start service: [ drivers ] 2018-07-12 01:11:18,085 [INFO] - paiLibrary.paiService.service_start : Begin to execute service drivers's start script. error: 'machinetype' already has a value (gpu), and --overwrite is false Error from server (AlreadyExists): error when creating "drivers.yaml": daemonsets.extensions "drivers-one-shot" already exists /usr/local/lib/python2.7/dist-packages/requests/init.py:83: RequestsDependencyWarning: Old version of cryptography ([1, 2, 3]) may cause slowdown. warnings.warn(warning, RequestsDependencyWarning) /usr/local/lib/python2.7/dist-packages/requests/init.py:83: RequestsDependencyWarning: Old version of cryptography ([1, 2, 3]) may cause slowdown. warnings.warn(warning, RequestsDependencyWarning) drivers-one-shot is not ready yet. Please wait for a moment! drivers-one-shot is not ready yet. Please wait for a moment! drivers-one-shot is not ready yet. Please wait for a moment! drivers-one-shot is not ready yet. Please wait for a moment!
Hi, Can you paste the log of drivers' pod? You can get it with following command.
kubectl logs drivers-name
And you can get drivers' pod name in this way.
kubectl get pod | grep "drivers"
drivers-one-shot-8x4sp 0/1 NodeLost 0 16h drivers-one-shot-mlqhw 0/1 CrashLoopBackOff 153 16h
kubectl logs drivers-one-shot-mlqhw ++ uname -r
nvidia_modeset,nvidia_uvm'
@feichaohao
echo 'The driver nvidia is still in use, can'''t unload it.'
exit 1
The driver nvidia is still in use, can't unload it.
Have you manually install GPU drivers before? If yes, before installing drivers by pai, you will have to uninstall it.
@ydye That's work.I also service start. but I do't know what port can visit when 9286 is noting.
@feichaohao You could visit 9090 port (k8s-dashboard). To investigate what's wrong with webportal and restserver.
I meet this error :Error from server (BadRequest): container "nvidia-drivers" in pod "drivers-one-shot-nh4c2" is waiting to start: ContainerCreating
@feichaohao Can you ssh to the target machine. And paste the output with the command following?
sudo docker ps -a
I know what error about the message. I need vpn to load image.
close the issue
Hi, @ydye . I met the same issue. Uninstalling the nvidia driver does not work for me. More details are shown as follow:
run python paictl.py service start -p ~/pai-config
2018-11-29 14:28:34,320 [INFO] - paiLibrary.paiService.service_management_start : Begin to clean all service's generated template file
2018-11-29 14:28:34,321 [INFO] - paiLibrary.paiService.service_template_clean : Begin to delete the generated template of cluster-configuration's service.
2018-11-29 14:28:34,321 [INFO] - paiLibrary.paiService.service_template_clean : The generated template files of cluster-configuration's service have been cleaned up.
2018-11-29 14:28:34,321 [INFO] - paiLibrary.paiService.service_management_start : Successfully start cluster-configuration
2018-11-29 14:28:34,321 [INFO] - paiLibrary.paiService.service_management_start : -----------------------------------------------------------
2018-11-29 14:28:34,324 [INFO] - paiLibrary.paiService.service_management_start : -----------------------------------------------------------
2018-11-29 14:28:34,324 [INFO] - paiLibrary.paiService.service_management_start : Begin to generate service drivers's template file
2018-11-29 14:28:34,325 [INFO] - paiLibrary.paiService.service_template_generate : Begin to generate the template file in service drivers's configuration.
2018-11-29 14:28:34,325 [INFO] - paiLibrary.paiService.service_template_generate : Create template mapper for service drivers.
2018-11-29 14:28:34,325 [INFO] - paiLibrary.paiService.service_template_generate : Done. Template mapper for service drivers is created.
2018-11-29 14:28:34,325 [INFO] - paiLibrary.paiService.service_template_generate : Generate the template file bootstrap/drivers/node-label.sh.template.
2018-11-29 14:28:34,325 [INFO] - paiLibrary.paiService.service_template_generate : Save the generated file to bootstrap/drivers/node-label.sh.
2018-11-29 14:28:34,328 [INFO] - paiLibrary.paiService.service_template_generate : Generate the template file bootstrap/drivers/drivers.yaml.template.
2018-11-29 14:28:34,329 [INFO] - paiLibrary.paiService.service_template_generate : Save the generated file to bootstrap/drivers/drivers.yaml.
2018-11-29 14:28:34,331 [INFO] - paiLibrary.paiService.service_template_generate : Generate the template file bootstrap/drivers/stop.sh.template.
2018-11-29 14:28:34,331 [INFO] - paiLibrary.paiService.service_template_generate : Save the generated file to bootstrap/drivers/stop.sh.
2018-11-29 14:28:34,335 [INFO] - paiLibrary.paiService.service_template_generate : Generate the template file bootstrap/drivers/refresh.sh.template.
2018-11-29 14:28:34,335 [INFO] - paiLibrary.paiService.service_template_generate : Save the generated file to bootstrap/drivers/refresh.sh.
2018-11-29 14:28:34,339 [INFO] - paiLibrary.paiService.service_template_generate : The template file of service drivers is generated.
2018-11-29 14:28:34,339 [INFO] - paiLibrary.paiService.service_management_start : Begin to start service: [ drivers ]
2018-11-29 14:28:34,339 [INFO] - paiLibrary.paiService.service_start : Begin to execute service drivers's start script.
node/192.168.1.232 not labeled
daemonset.apps/drivers-one-shot created
/usr/local/lib/python2.7/dist-packages/requests/__init__.py:83: RequestsDependencyWarning: Old version of cryptography ([1, 2, 3]) may cause slowdown.
warnings.warn(warning, RequestsDependencyWarning)
/usr/local/lib/python2.7/dist-packages/requests/__init__.py:83: RequestsDependencyWarning: Old version of cryptography ([1, 2, 3]) may cause slowdown.
warnings.warn(warning, RequestsDependencyWarning)
drivers-one-shot is not ready yet. Please wait for a moment!
drivers-one-shot is not ready yet. Please wait for a moment!
run kubectl describe pod driver-one-shot
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SuccessfulMountVolume 39m kubelet, 192.168.1.232 MountVolume.SetUp succeeded for volume "driver-path"
Normal SuccessfulMountVolume 39m kubelet, 192.168.1.232 MountVolume.SetUp succeeded for volume "device-path"
Normal SuccessfulMountVolume 39m kubelet, 192.168.1.232 MountVolume.SetUp succeeded for volume "kernel-head"
Normal SuccessfulMountVolume 39m kubelet, 192.168.1.232 MountVolume.SetUp succeeded for volume "modules-path"
Normal SuccessfulMountVolume 39m kubelet, 192.168.1.232 MountVolume.SetUp succeeded for volume "drivers-log"
Warning FailedCreatePodSandBox 39m kubelet, 192.168.1.232 Failed create pod sandbox.
Normal Started 38m (x3 over 39m) kubelet, 192.168.1.232 Started container
Normal Pulling 38m (x4 over 39m) kubelet, 192.168.1.232 pulling image "docker.io/openpai/drivers:v0.7.2"
Normal Pulled 38m (x4 over 39m) kubelet, 192.168.1.232 Successfully pulled image "docker.io/openpai/drivers:v0.7.2"
Normal Created 38m (x4 over 39m) kubelet, 192.168.1.232 Created container
Warning BackOff 4m (x151 over 39m) kubelet, 192.168.1.232 Back-off restarting failed container
run kubectl logs drivers-one-shot-hmtzj
failed to open log file "/var/log/pods/0d8c07d9-f3e3-11e8-88ce-ac1f6b9285be/nvidia-drivers_9.log": open /var/log/pods/0d8c07d9-f3e3-11e8-88ce-ac1f6b9285be/nvidia-drivers_9.log: no such file or directory
And I get the message containers with unready status: [nvidia-drivers]
from http://\<master>:9090
I have removed the nvidia driver and the worker contains 10 geforce1080ti.
Thanks you!
@ICEORY Please ssh to a node which contains the drivers-one-shot service. And then get the service log with the docker command as following.
sudo docker logs ${container-id}
@ydye Thanks, I am trying v0.8.2 now (the previous version is 0.7.2), and fix this issue by uninstalling the nvidia driver.
@ICEORY congrats and I will close this issue.
when run “./deploy.py -d -p /cluster-configuration”,find the error:
I check the pod, error:
@ydye hi, can you give a help? thanks a lot