microsoft / DLWorkspace

Deep Learning Workspace
Other
202 stars 75 forks source link

On-prem Single-Ubuntu #193

Open shizhaojingszj opened 6 years ago

shizhaojingszj commented 6 years ago

Hi guys, I want to test DLworkspace on my ubuntu server WITHOUT connecting with any azure cluster thing.

So I am following SingleUbuntu on-prem this document. Unfortunately I am stuck at the last line.

./deploy.py --verbose scriptblocks ubuntu_uncordon

My ubuntu server has the same setup as indicated in the document:

  1. ubuntu 16.04 x64

Things I have done:

  1. run src/ClusterBootstrap/install_prerequisites.sh as mentioned in DevEnvironment/Readme.md.
  2. I have manually installed docker-ce, build an new GPU-favored kubernetes binary and add the folder to PATH, also DevEnvironment/Readme.md.
  3. I create a pair of ssh-keys, and manually put them inside ./src/ClusterBootstrap/deploy/sshkey folder
  4. I also setup the mssqlserver docker container to run on my Ubuntu server at the port and setup authentication as shown in my config.yaml (docker-compose run db, I am assuming DLworkspace will find the database and setup everything it needs)
  5. I setup an email account at outlook.com and put info inside my config.yaml, follow Auth

Now my whole config.yaml file looks like this: image

Now I keep getting the error shown below: image

It looks like kubernetes cluster is not running. So what should I do to get it work on my own server?

shizhaojingszj commented 6 years ago

Several things to clarify:

  1. My ubuntu server has domain-name GPU2
  2. cluster_name and clusterId are randomly chosen
jinlccs commented 6 years ago

Please uncomment:

useclusterfile : true

and try again.

shizhaojingszj commented 6 years ago

Thanks for your suggestion, I have uncommented the line and tried again, but I keep getting another error: "curl error" as shown in this gist

Besides this curl error, I also found that my docker service was not working after running the deploy.py, when I tried to manually start it, the error message was "Failed to start docker.service: Unit flanneld.service not found."

I cannot get docker service work again, unless I reboot the Ubuntu server. In that gist mensioned above, I'm pretty sure my docker service is running before running deploy.py.

Any suggestions?

hongzhili commented 6 years ago

@shizhaojingszj Please make sure the flannel service is not required by docker service. BTW, which version of codes are you using? It seems your codes are not up-to-date.

shizhaojingszj commented 6 years ago

MY BIG MISTAKE I thought I'm using this repo but actually using another repo master branch, I'm looking forward to retrying the whole thing once the following problem is fixed.

"Please make sure the flannel service is not required by docker service."

I don't know how to restore my docker service to stop making flanneld as a dependency. Could you explain more details?

  1. I have tried several times to reinstall docker-ce, but failed everytime. image

When try to restart docker service I keep getting this image

But after a reboot, docker run --rm -it hello-world looks fine, but failed after running deploy.py again.

shizhaojingszj commented 6 years ago

https://gist.github.com/shizhaojingszj/00ac3f7d57ce0b0aaf1886a906f5185c

The above gist showed my docker info results