Open zacsketches opened 7 years ago
After trying six or seven different AMI's and dozens of combinations of Julia version => Mocha version, the following configuration allows training on AWS.
p2.xlarge instance Bitfusion Deep Learning AMI Julia==>v0.4.7 built from source Mocha==>0.1.2
Results from this setup on the CIFAR10 example:
29-Oct 15:22:43:INFO:root: Accuracy (avg over 10000) = 78.8200%
29-Oct 15:22:43:INFO:root:---------------------------------------------------------
29-Oct 15:22:43:INFO:root:
29-Oct 15:22:43:DEBUG:root:Destroying network CIFAR10-train
29-Oct 15:22:43:DEBUG:root:Destroying network CIFAR10-test
29-Oct 15:22:43:INFO:root:Shutting down CuDNN backend...
29-Oct 15:22:43:INFO:root:CuDNN Backend shutdown finished!
real 19m13.617s
user 14m1.893s
sys 5m12.049s
There are build specifics to make this combination work that I'm going to document in a new tutorial on training Mocha in the cloud. When the new tutorial is up I'll close this comment.
@pluskid there is an unmistakeable build error in the compatibilty.jl
file related to the way you are trying to identify the BLAS library. Any version of Julia past 0.4.7 will not build Mocha correctly, and is probably the culprit for the failing travis builds. This might belong in a separate issue, but this was the last hurdle I had to figure out in order to find a working combination on AWS.
Thanks! I'm a bit busy recently. Will take a look at the blas issue when I have a chance. Could you open an issue for that for tracking?
@pluskid I had a busy week and couldn't back to Mocha until now. This weekend I'll create a new issue describing the blas issue with enough detail that you should be able to get it fixed.
I also found a few minor errors in my last tutorial.
After using lots of deep learning frameworks I find that I enjoy programming ML using Julia and I prefer Mocha over MxNet syntax. Therefore, I'm interested in growing the Mocha documentation and examples from their current state to a place where Mocha is a platform for learning Deep Learning.
With this goal in I wrote an extension to the MNIST tutorial earlier this month that provided an example of learning curves.
I'd now like to write an extension of the CIFAR-10 tutorial that shows how to train the model on AWS with a GPU instance. However, I'm having trouble running the GPU backend. I have a p2.xlarge instance provisioned with the NVIDIA tools and an Ubuntu 14.04 OS.
nvidia-smi
runs correctly.It fails on
Pkg.test("Mocha")
with the following output:I'm going to keep troubleshooting this, but if anyone has a successful path for running Mocha on AWS please let me know your setup.