microsoft / pai

Resource scheduling and cluster management for AI
https://openpai.readthedocs.io
MIT License
2.63k stars 548 forks source link

drivers-one-shot is not ready yet. Please wait for a moment! #666

Closed wm10240 closed 5 years ago

wm10240 commented 6 years ago

when run “./deploy.py -d -p /cluster-configuration”,find the error:

Login Succeeded
2018-06-06 08:31:09,939 [INFO] - __main__ : docker registry login successfully
Error from server (AlreadyExists): configmaps "host-configuration" already exists
Error from server (AlreadyExists): configmaps "docker-credentials" already exists
Error from server (AlreadyExists): configmaps "gpu-configuration" already exists
secret "novumind" created
node "192.168.1.*" labeled
node "192.168.1.*" labeled
daemonset.extensions "drivers-one-shot" created
drivers-one-shot is not ready yet. Please wait for a moment!
drivers-one-shot is not ready yet. Please wait for a moment!
drivers-one-shot is not ready yet. Please wait for a moment!

I check the pod, error:

Events:
  Type     Reason                  Age              From                   Message
  ----     ------                  ----             ----                   -------
  Normal   SuccessfulMountVolume   4m               kubelet, 192.168.1.55  MountVolume.SetUp succeeded for volume "data-path"
  Warning  FailedCreatePodSandBox  3s (x9 over 3m)  kubelet, 192.168.1.55  Failed create pod sandbox.

@ydye hi, can you give a help? thanks a lot

ydye commented 6 years ago

1) Can you paste the error log of the drivers-one-shot? 2) Link, how did you configure this field? And btw the install process will be a little long.

wm10240 commented 6 years ago

@ydye The error in drivers-one-shot following, and I find a problem, I use 1080ti, I am not be sure the parameters in cluster-configuration.yaml, example:

 gpu:
    # type: gpu{type}
      type: geforce1080ti
      count: 4

machine-type: D8SV3

machine-type: NC24R
# docker logs  k8s_nvidia-drivers_drivers-one-shot-fgkws_default_f7b0489d-6963-11e8-8568-2c4d54461f8c_12
++ uname -r
+ KERNEL_FULL_VERSION=4.4.0-87-generic
+ CURRENT_DRIVER=/var/drivers/nvidia/current
+ echo ======== If NVIDIA present exit early =========
+ nvidiaPresent
+ [[ -f /proc/driver/nvidia/version ]]
+ grep -q 384.69 /proc/driver/nvidia/version
======== If NVIDIA present exit early =========
+ lsmod
+ grep -qE '^nvidia'
+ lsmod
+ grep -qE '^nvidia_uvm'
+ [[ -e /dev/nvidia0 ]]
+ [[ -e /var/drivers/nvidia/384.69/lib64/libnvidia-ml.so ]]
+ return 5
+ echo ======== If NVIDIA driver already running uninstall it =========
======== If NVIDIA driver already running uninstall it =========
+ lsmod
+ grep -qE '^nvidia'
++ lsmod
++ tr -s ' '
++ grep -E '^nvidia'
++ cut -f 4 -d ' '
+ DEP_MODS='

nvidia_drm
nvidia_modeset,nvidia_uvm'
+ for mod in '${DEP_MODS//,/ }'
+ rmmod nvidia_drm
rmmod: ERROR: Module nvidia_drm is in use
+ echo 'The driver nvidia_drm is still in use, can'\''t unload it.'
The driver nvidia_drm is still in use, can't unload it.
ydye commented 6 years ago

Configuration

In machine-type, you should fill it with the id of machine-sku which you defined in this field. If your GPU type is geforce1080ti, you should set the gpu type as geforce1080ti.

Drivers issue

I assume you have installed GPU-drivers on your machine before bootstrap pai. There will be two solution in this situation.

We can see the environment of nvidia in docker image is:

ENV NV_DRIVER=/var/drivers/nvidia/$NVIDIA_VERSION
ENV LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$NV_DRIVER/lib:$NV_DRIVER/lib64
ENV PATH=$PATH:$NV_DRIVER/bin

And install path in the script:

./NVIDIA-Linux-x86_64-$NVIDIA_VERSION/nvidia-installer \
    --utility-prefix=$NV_DRIVER \
    --opengl-prefix=$NV_DRIVER \
    --x-prefix=$NV_DRIVER \
    --compat32-prefix=$NV_DRIVER \
    --opengl-libdir=lib64 \
    --utility-libdir=lib64 \
    --x-library-path=lib64 \
    --compat32-libdir=lib \
    -s -N || exit $?

The mount path in yaml file:

 - mountPath: /var/drivers
          name: driver-path
。。。。
- name: driver-path
        hostPath:
          path: /var/drivers

So if you don't want uninstall the GPU drivers. You should do some work following.

steps1

Delete this folder: https://github.com/Microsoft/pai/tree/pai-0.8.y/src/drivers/deploy

steps2

Change this field, and set the drivers path on your host. https://github.com/Microsoft/pai/blob/pai-0.8.y/src/hadoop-node-manager/deploy/hadoop-node-manager.yaml.template#L109

wm10240 commented 6 years ago

@ydye I uninstall the nvidia-driver, it works, thanks a lot, but when don't uninstall the nvidia-driver and edit the file as you idea, some error:

# ./deploy.py -d -p /cluster-configuration
Enter labelExpend!
Login Succeeded
2018-06-11 01:44:16,835 [INFO] - __main__ : docker registry login successfully
Error from server (AlreadyExists): configmaps "host-configuration" already exists
Error from server (AlreadyExists): configmaps "docker-credentials" already exists
Error from server (AlreadyExists): configmaps "gpu-configuration" already exists
Error from server (AlreadyExists): error when creating "secret.yaml": secrets "*" already exists
Traceback (most recent call last):
  File "./deploy.py", line 451, in <module>
    main()
  File "./deploy.py", line 438, in main
    bootstrap_service(service_config)
  File "./deploy.py", line 273, in bootstrap_service
    dependency_bootstrap(serv, service_config, started_service)
  File "./deploy.py", line 253, in dependency_bootstrap
    dependency_bootstrap(pre_serv, service_config, started_service)
  File "./deploy.py", line 252, in dependency_bootstrap
    for pre_serv in service_config['servicelist'][serv]['prerequisite']:
KeyError: 'drivers'

and other error, in "/pai/pai-management/bootstrap/hadoop-service", node-label.sh:

 kubectl label nodes 192.168.1.* hdfsrole=worker

    kubectl label nodes 192.168.1.* yarnrole=worker

    kubectl label nodes 192.168.1.* hdfsrole=worker

    kubectl label nodes 192.168.1.* yarnrole=worker

this file may miss some label message, for example: zookeeper, I don't find how to generator node-label.sh( node-label.sh.template to node-label.sh), pls give a help, ths

ydye commented 6 years ago

Did you label some node with pai-role=master like this.

In bootstrap/hadoop-service/node-label.sh will label the node with zookeeper="true" https://github.com/Microsoft/pai/blob/master/pai-management/bootstrap/hadoop-service/node-label.sh.template#L36 Now, we have some hardcode to expand the label pai-role=master to other label, such as zookeeper. This part will be change in the future. Now our code will expand the label in this way.

wm10240 commented 6 years ago

@ydye hi, thank for your reply.

  1. I have label the pai-role=master ;
  2. I think the error comes from the function labelExpend, and this function never be called,so the node-label.sh is wrong, some pod will not be created. before running “ ./deploy.py -d -p /cluster-configuration”, I edit the node-label.sh and use 'bash node-label.sh', It work. So I think it may be a bug.
ydye commented 6 years ago

@wm10240

Label

The label should be pai-master: "true".

Label expand.

This function is called here. https://github.com/Microsoft/pai/blob/master/pai-management/paiLibrary/clusterObjectModel/paiObjectModel.py#L293

Question

How did you edit the node-label.sh? Did you add the node-label by yourself?

wm10240 commented 6 years ago

@ydye sorry, I give you the wrong message. The label machine-list:

machine-list:

    - hostname: dev-*
      hostip: 192.168.1.*
      machine-type: D8SV3
      etcdid: etcdid1
      #sshport: PORT (Optional)
      #username: username (Optional)
      #password: password (Optional)
      k8s-role: master
      dashboard: "true"
      zkid: "1"
      pai-master: "true"
      dashboard: "true"
      watchdog: "true"

the node-label.sh:

    kubectl label nodes 192.168.1.* hdfsrole=master
    kubectl label nodes 192.168.1.* yarnrole=master
    kubectl label nodes 192.168.1.* zookeeper=true
    kubectl label nodes 192.168.1.* jobhistory=true
    kubectl label nodes 192.168.1.* launcher=true
    kubectl label nodes 192.168.1.* restserver=true
    kubectl label nodes 192.168.1.* webportal=true
    kubectl label nodes 192.168.1.* prometheus=true
    kubectl label nodes 192.168.1.* grafana=true
    kubectl label nodes 192.168.1.* pylon=true
    kubectl label nodes 192.168.1.* hadoop-name-node=true
    kubectl label nodes 192.168.1.* node-exporter=true
    kubectl label nodes 192.168.1.* hadoop-resource-manager=true

    kubectl label nodes 192.168.1.* hdfsrole=worker

    kubectl label nodes 192.168.1.* yarnrole=worker

if use "./deploy.py -d -p /cluster-configuration" ,the node_label.sh:


    kubectl label nodes 192.168.1.* hdfsrole=worker

    kubectl label nodes 192.168.1.* yarnrole=worker

    kubectl label nodes 192.168.1.* hdfsrole=worker

    kubectl label nodes 192.168.1.* yarnrole=worker
ydye commented 6 years ago

@wm10240

wm10240 commented 6 years ago

@ydye I have 2 nodes yes, it's right

wm10240 commented 6 years ago

@ydye other error:

1. "pai/job-tutorial/README.md"

docker build -f Dockerfiles/Dockerfile.build.base -t pai.build.base:hadoop2.7.2-cuda8.0-cudnn6-devel-ubuntu16.04 Dockerfiles/
shoule be:
docker build -f Dockerfiles/cuda8.0-cudnn6/Dockerfile.build.base -t pai.build.base:hadoop2.7.2-cuda8.0-cudnn6-devel-ubuntu16.04 Dockerfiles/

2. "pai/pai-management/src/pylon/dockerfile"

COPY copied_file/pylon/src/* /root/

I don't find directory "copied_file"

ydye commented 6 years ago

@wm10240

Question 1

@abuccts Can you follow it?

Question 2

This folder will be generated by script when building image, and cleaned after building finished.

ydye commented 6 years ago

BTW, If you have more issues which may not be relative to current topic, you'd better fire another issue. Because someone may come across the same problem as you found, so a new issue may help others. @wm10240

feichaohao commented 6 years ago

my gpu is Tesla K40m.what I should set the gpu type ?

feichaohao commented 6 years ago

2018-07-12 01:11:18,068 [INFO] - paiLibrary.paiService.service_management_start : Begin to generate service drivers's template file 2018-07-12 01:11:18,068 [INFO] - paiLibrary.paiService.service_template_generate : Begin to generate the template file in service drivers's configuration. 2018-07-12 01:11:18,068 [INFO] - paiLibrary.paiService.service_template_generate : Create template mapper for service drivers. 2018-07-12 01:11:18,069 [INFO] - paiLibrary.paiService.service_template_generate : Done. Template mapper for service drivers is created. 2018-07-12 01:11:18,069 [INFO] - paiLibrary.paiService.service_template_generate : Generate the template file bootstrap/drivers/node-label.sh.template. 2018-07-12 01:11:18,069 [INFO] - paiLibrary.paiService.service_template_generate : Save the generated file to bootstrap/drivers/node-label.sh. 2018-07-12 01:11:18,073 [INFO] - paiLibrary.paiService.service_template_generate : Generate the template file bootstrap/drivers/drivers.yaml.template. 2018-07-12 01:11:18,073 [INFO] - paiLibrary.paiService.service_template_generate : Save the generated file to bootstrap/drivers/drivers.yaml. 2018-07-12 01:11:18,076 [INFO] - paiLibrary.paiService.service_template_generate : Generate the template file bootstrap/drivers/stop.sh.template. 2018-07-12 01:11:18,076 [INFO] - paiLibrary.paiService.service_template_generate : Save the generated file to bootstrap/drivers/stop.sh. 2018-07-12 01:11:18,079 [INFO] - paiLibrary.paiService.service_template_generate : Generate the template file bootstrap/drivers/refresh.sh.template. 2018-07-12 01:11:18,080 [INFO] - paiLibrary.paiService.service_template_generate : Save the generated file to bootstrap/drivers/refresh.sh. 2018-07-12 01:11:18,084 [INFO] - paiLibrary.paiService.service_template_generate : The template file of service drivers is generated. 2018-07-12 01:11:18,084 [INFO] - paiLibrary.paiService.service_management_start : Begin to start service: [ drivers ] 2018-07-12 01:11:18,085 [INFO] - paiLibrary.paiService.service_start : Begin to execute service drivers's start script. error: 'machinetype' already has a value (gpu), and --overwrite is false Error from server (AlreadyExists): error when creating "drivers.yaml": daemonsets.extensions "drivers-one-shot" already exists /usr/local/lib/python2.7/dist-packages/requests/init.py:83: RequestsDependencyWarning: Old version of cryptography ([1, 2, 3]) may cause slowdown. warnings.warn(warning, RequestsDependencyWarning) /usr/local/lib/python2.7/dist-packages/requests/init.py:83: RequestsDependencyWarning: Old version of cryptography ([1, 2, 3]) may cause slowdown. warnings.warn(warning, RequestsDependencyWarning) drivers-one-shot is not ready yet. Please wait for a moment! drivers-one-shot is not ready yet. Please wait for a moment! drivers-one-shot is not ready yet. Please wait for a moment! drivers-one-shot is not ready yet. Please wait for a moment!

ydye commented 6 years ago

Hi, Can you paste the log of drivers' pod? You can get it with following command.

kubectl logs drivers-name

And you can get drivers' pod name in this way.

kubectl get pod | grep "drivers"
feichaohao commented 6 years ago

drivers-one-shot-8x4sp 0/1 NodeLost 0 16h drivers-one-shot-mlqhw 0/1 CrashLoopBackOff 153 16h

feichaohao commented 6 years ago

kubectl logs drivers-one-shot-mlqhw ++ uname -r

nvidia_modeset,nvidia_uvm'

ydye commented 6 years ago

@feichaohao

echo 'The driver nvidia is still in use, can'''t unload it.'
exit 1
The driver nvidia is still in use, can't unload it.

Have you manually install GPU drivers before? If yes, before installing drivers by pai, you will have to uninstall it.

feichaohao commented 6 years ago

@ydye That's work.I also service start. but I do't know what port can visit when 9286 is noting.

ydye commented 6 years ago

@feichaohao You could visit 9090 port (k8s-dashboard). To investigate what's wrong with webportal and restserver.

feichaohao commented 6 years ago

I meet this error :Error from server (BadRequest): container "nvidia-drivers" in pod "drivers-one-shot-nh4c2" is waiting to start: ContainerCreating

ydye commented 6 years ago

@feichaohao Can you ssh to the target machine. And paste the output with the command following?

sudo docker ps -a
feichaohao commented 6 years ago

I know what error about the message. I need vpn to load image.

fanyangCS commented 6 years ago

close the issue

ICEORY commented 5 years ago

Hi, @ydye . I met the same issue. Uninstalling the nvidia driver does not work for me. More details are shown as follow:

I have removed the nvidia driver and the worker contains 10 geforce1080ti.

Thanks you!

ydye commented 5 years ago

@ICEORY Please ssh to a node which contains the drivers-one-shot service. And then get the service log with the docker command as following.

sudo docker logs ${container-id}
ICEORY commented 5 years ago

@ydye Thanks, I am trying v0.8.2 now (the previous version is 0.7.2), and fix this issue by uninstalling the nvidia driver.

ydye commented 5 years ago

@ICEORY congrats and I will close this issue.