Training Mocha Networks on AWS

zacsketches commented 7 years ago

After using lots of deep learning frameworks I find that I enjoy programming ML using Julia and I prefer Mocha over MxNet syntax. Therefore, I'm interested in growing the Mocha documentation and examples from their current state to a place where Mocha is a platform for learning Deep Learning.

With this goal in I wrote an extension to the MNIST tutorial earlier this month that provided an example of learning curves.

I'd now like to write an extension of the CIFAR-10 tutorial that shows how to train the model on AWS with a GPU instance. However, I'm having trouble running the GPU backend. I have a p2.xlarge instance provisioned with the NVIDIA tools and an Ubuntu 14.04 OS. nvidia-smi runs correctly.

It fails on Pkg.test("Mocha") with the following output:

-- Testing Pooling(Mocha.Pooling.Max)  on Mocha.GPUBackend{Float64}...
    > Setup
ERROR: LoadError: LoadError: Bad param
 [inlined code] from /home/ubuntu/.julia/v0.4/Mocha/src/cuda/cudnn.jl:53
 in set_pooling_descriptor at /home/ubuntu/.julia/v0.4/Mocha/src/cuda/cudnn.jl:412
 in setup_etc at /home/ubuntu/.julia/v0.4/Mocha/src/cuda/layers/pooling.jl:19
 in setup at /home/ubuntu/.julia/v0.4/Mocha/src/layers/pooling.jl:74
 in test_pooling_layer at /home/ubuntu/.julia/v0.4/Mocha/test/layers/pooling.jl:46
 in test_pooling_layer at /home/ubuntu/.julia/v0.4/Mocha/test/layers/pooling.jl:165
 in include at ./boot.jl:261
 in include_from_node1 at ./loading.jl:320
 in anonymous at /home/ubuntu/.julia/v0.4/Mocha/test/runtests.jl:26
 in map_to! at abstractarray.jl:1286
 in map at abstractarray.jl:1308
 in test_dir at /home/ubuntu/.julia/v0.4/Mocha/test/runtests.jl:25
 in include at ./boot.jl:261
 in include_from_node1 at ./loading.jl:320
 in process_options at ./client.jl:280
 in _start at ./client.jl:378
while loading /home/ubuntu/.julia/v0.4/Mocha/test/layers/pooling.jl, in expression starting on line 173
while loading /home/ubuntu/.julia/v0.4/Mocha/test/runtests.jl, in expression starting on line 85
=======================================[ ERROR: Mocha ]========================================

failed process: Process(`/usr/bin/julia --check-bounds=yes --code-coverage=none --color=yes /home/ubuntu/.julia/v0.4/Mocha/test/runtests.jl`, ProcessExited(1)) [1]

===============================================================================================
ERROR: Mocha had test errors
 in error at ./error.jl:21
 in test at pkg/entry.jl:803
 in anonymous at pkg/dir.jl:31
 in cd at file.jl:22
 in cd at pkg/dir.jl:31
 in test at pkg.jl:71

I'm going to keep troubleshooting this, but if anyone has a successful path for running Mocha on AWS please let me know your setup.

zacsketches commented 7 years ago

After trying six or seven different AMI's and dozens of combinations of Julia version => Mocha version, the following configuration allows training on AWS.

p2.xlarge instance Bitfusion Deep Learning AMI Julia==>v0.4.7 built from source Mocha==>0.1.2

Results from this setup on the CIFAR10 example:

29-Oct 15:22:43:INFO:root:  Accuracy (avg over 10000) = 78.8200%
29-Oct 15:22:43:INFO:root:---------------------------------------------------------
29-Oct 15:22:43:INFO:root:
29-Oct 15:22:43:DEBUG:root:Destroying network CIFAR10-train
29-Oct 15:22:43:DEBUG:root:Destroying network CIFAR10-test
29-Oct 15:22:43:INFO:root:Shutting down CuDNN backend...
29-Oct 15:22:43:INFO:root:CuDNN Backend shutdown finished!

real    19m13.617s
user    14m1.893s
sys 5m12.049s

There are build specifics to make this combination work that I'm going to document in a new tutorial on training Mocha in the cloud. When the new tutorial is up I'll close this comment.

@pluskid there is an unmistakeable build error in the compatibilty.jl file related to the way you are trying to identify the BLAS library. Any version of Julia past 0.4.7 will not build Mocha correctly, and is probably the culprit for the failing travis builds. This might belong in a separate issue, but this was the last hurdle I had to figure out in order to find a working combination on AWS.

pluskid commented 7 years ago

Thanks! I'm a bit busy recently. Will take a look at the blas issue when I have a chance. Could you open an issue for that for tracking?

zacsketches commented 7 years ago

@pluskid I had a busy week and couldn't back to Mocha until now. This weekend I'll create a new issue describing the blas issue with enough detail that you should be able to get it fixed.

I also found a few minor errors in my last tutorial.

The summit request shows the wrong link in the image
The times on the final image need correcting

pluskid / Mocha.jl

Training Mocha Networks on AWS #220