Rebase docker on tensorflow (can now use nvidia-docker for GPU version)

dbdean commented 7 years ago

Sorry for dropping a biggish pull request on you unannounced, but I haven't been able to get the GPU code working, and I noticed that a lot of the GPU (and CPU) Dockerfile seemed to be based on the tensorflow docker tools at https://github.com/tensorflow/tensorflow/tree/master/tensorflow/tools/docker, so I thought it might be worth changing the base FROM image from gdal to tensorflow.

This has made the Dockerfiles much simpler (and the gpu one is now automatically generated in the make file as it is only four characters different from the cpu version).

I have also added a very simple set of python unittests that just test module importing at the moment, and added them to the Makefile and the Travis CI configuration. All test currently pass!

I can also confirm that the CPU training works everywhere I've tried it, and the GPU training works on AWS EC2 GPU instances.

I would like to update tensorflow and tflearn, but I've left it at 0.8 and the arbitrary git checkout for now. Updating those can be a problem for a different pull request.

Thanks!

andrewljohnson commented 7 years ago

Thanks for the PR, I'll try and build on my GPU box.

andrewljohnson commented 7 years ago

Currently building this on my Mac and Linux GPU box to confirm everything works.

andrewljohnson commented 7 years ago

@dbdean this fails for me on Linux/GPU because I don't have nvidia-docker. This throws command not found for nvidia-docker when running make dev-gpu.

Could you update this pull request to edit the README so there are instructions on how to do the full workflow on a GPU-enabled box?

andrewljohnson commented 7 years ago

Another good tweak for this PR would be to set up the tests to run in travis.yml

dbdean commented 7 years ago

@andrewljohnson, I have updated the README to provide instruction for installing nvidia-docker on Linux hosts.

WRT to the tests being in travis, that should already be setup in the .travis.yml file already in this PR.

dbdean commented 7 years ago

I've made some further changes to this PR, mostly about using the same docker run script instead of separate scripts for cpu and gpu usage. I've also made notebooks use the script too, and confirmed that I can access the notebooks over http.

andrewljohnson commented 7 years ago

Sorry to push back on this, but I am having trouble getting this to work, and I think others will too. It seems problematic to make getting up and running on this take so many more steps, especially if not explained clearly.

Can we make this be step-by-step set of instructions, that don't require reading docs from NVIDIA?
It's not clear to me as written whether I need to install nvidia-modprobe (until I got the error that needed to). And then we need to tell the user how to do this (i.e. have a step where we just tell the user to do sudo apt-get install nvidia-modprobe and when that needs to happen)
It's not clear to me how I install the NVIDIA drivers.

I feel like the README for downloading and building this project should be distilled to a clear set of steps that someone can mindlessly follow. Maybe some people will have to reference external docs, but they shouldn't have to until they hit a snag in the steps included in the README.

The instructions can be like "this is how you do it on Ubuntu, " and should state the sequence of commands to run in the terminal, along with any GUI steps explained.

Andrew Johnson Founder gaiagps.com http://gaiagps.com

On Mon, Mar 6, 2017 at 5:36 PM, David Dean notifications@github.com wrote:

I've made some further changes to this PR, mostly about using the same docker run script instead of separate scripts for cpu and gpu usage. I've also made notebooks use the script too, and confirmed that I can access the notebooks over http.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/trailbehind/DeepOSM/pull/76#issuecomment-284593046, or mute the thread https://github.com/notifications/unsubscribe-auth/AAbzfyv5OHMXn8uzCKVg8mfBf0rymctQks5rjLSwgaJpZM4MJ7S6 .

dbdean commented 7 years ago

@andrewljohnson, no problems. That's a perfectly reasonable request for this PR.

I kept it simple, as I was concerned about the instructions going stale as new NVIDIA drivers are released and/or nvidia-docker. However, the NVIDIA documentations is so poor, it is probably a good idea to at least provide a currently-working example.

I'll try and put something together over the next few days, outlining how I got it working in AWS GPU instances at least. Hopefully that should cover enough of the pitfalls that most people can get through it without too many changes.

dbdean commented 7 years ago

While I've updated the instructions above, it doesn't actually work yet as written, at least on ubuntu 16.04 EC2 instances. I've gotten it to work on 14.04 already, so I think I'll get it to work there again from scratch, and provide those instructions.

dbdean commented 7 years ago

Ok. I think my instructions were correct, but I hadn't made sure to download the latest NVIDIA driver because the NVIDIA download website is confusing. I haven't checked everything, yet, though. I let you know when it all works for me, and you can try @andrewljohnson.

dbdean commented 7 years ago

@andrewljohnson, I have run through those instructions on a fresh AWS EC2 16.04 GPU instance, and confirmed that everything works through to training of the neural network. Can you please have a go and let me know how it goes for you on your box. Thanks!

trailbehind / DeepOSM

Rebase docker on tensorflow (can now use nvidia-docker for GPU version) #76