pluskid / Mocha.jl

Deep Learning framework for Julia
Other
1.29k stars 254 forks source link

Since upgrading to Ubuntu 16.04 LTS, Mocha tests fail #195

Closed phiber1 closed 6 years ago

phiber1 commented 8 years ago

I've recently updated to Ubuntu 16.04 LTS, with CUDA 7.5 and gcc/g++ 5.3.1.

With the backend set to use GPU, Pkg.test("Mocha") fails as follows:

Pkg.test("Mocha") INFO: Testing Mocha Configuring Mocha...

failed process: Process(/usr/bin/julia --check-bounds=yes --code-coverage=none --color=yes /home/ENG.pvt/mark/.julia/v0.4/Mocha/test/runtests.jl, ProcessExited(1)) [1]

ERROR: Mocha had test errors in test at ./pkg/entry.jl:803 in anonymous at ./pkg/dir.jl:31 in cd at ./file.jl:22

phiber1 commented 8 years ago

Sorry, long day. I pasted the wrong error. That previous error was obviously stemming from the fact that I needed to recompile kernels.cu after updating everything.

THIS (below) is the failing error in Pkg.test("Mocha"):

-- Testing convolution layer with shared param on Mocha.GPUBackend{Float64}... 25-Apr 17:25:34:INFO:root:Constructing net test-shared-params on Mocha.GPUBackend... 25-Apr 17:25:34:INFO:root:Topological sorting 5 layers... 25-Apr 17:25:34:INFO:root:Setup layers... ERROR: LoadError: LoadError: Bad param [inlined code] from /home/ENG.pvt/mark/.julia/v0.4/Mocha/src/cuda/cudnn.jl:54 in set_filter_descriptor at /home/ENG.pvt/mark/.julia/v0.4/Mocha/src/cuda/cudnn.jl:198 while loading /home/ENG.pvt/mark/.julia/v0.4/Mocha/test/layers/shared-parameters.jl, in expression starting on line 50 while loading /home/ENG.pvt/mark/.julia/v0.4/Mocha/test/runtests.jl, in expression starting on line 85 ====================================================[ ERROR: Mocha ]====================================================

failed process: Process(/usr/bin/julia --check-bounds=yes --code-coverage=none --color=yes /home/ENG.pvt/mark/.julia/v0.4/Mocha/test/runtests.jl, ProcessExited(1)) [1]

ERROR: Mocha had test errors in test at ./pkg/entry.jl:803 in anonymous at ./pkg/dir.jl:31 in cd at ./file.jl:22

pluskid commented 8 years ago

What version of cuDNN are you using? If you updated your cuda, it is also recommended to update cuDNN to latest version (v4).

phiber1 commented 8 years ago

The latest version is 5, not 4. Which is probably the cause of the error. On Ubuntu 16.04 LTS, CUDA SDK is 7.5.18, and cuDNN version is 5.0.4.

pluskid commented 8 years ago

Oh, I see. 5 is still not released as stable yet and quite a few incompatible changes are introduced there, so I am not sure if Mocha.jl runs with cuDNN v5 yet. That might well be the issue.

phiber1 commented 8 years ago

Just checking in... Any word on cuDNN version 5 support? Even though version 5 is still a release candidate, it's been the default installed version in Ubuntu's repository for almost a month now.

davidparks21 commented 8 years ago

I was curious actually, is there a good reason not to simply include the libraries compiled for the major OSs directly with Mocha and just load the the appropriate included native library? I took a peek at the NVIDIA license and it seems to allow it. It seems like that might streamline installation. It seems to be matlab's approach, for example. But perhaps I'm not aware of some complexity?

pluskid commented 8 years ago

@davidparks21 Somehow the library files are very big. e.g. libcudnn.so.4 itself is 60 MB. Plus I am not really sure about the licensing. Even downloading the files you need to be registered as NVidia developer account (though it is free to register). Also the dependency is not only on the library but also on CUDA drivers.

phiber1 commented 8 years ago

At the very least, you should get nVidia to list Mocha on their main cuDNN page, and the cuDNN frameworks page here: https://developer.nvidia.com/deep-learning-frameworks

julia is a serious language gaining traction with those of us who actually still care about high-performance (my original go-to language has traditionally been CUDA Fortran), and its support for CUDA via Mocha is something that should be more widely publicized. People are much more likely to use Mocha because it will seem familiar to Caffe users. Much more so than MXnet. Something to think about.

phiber1 commented 8 years ago

I finally had time to actually look into the cause of this issue... Putting aside the new RNN support in cuDNN v5 (which would be nice to add separately), the direct cause of the aforementioned error is due to the new datatype cudnnTensorFormat_t, and its use as an argument in all the get/set descriptor functions. This wasn't present in v4. I'm also noticing the addition of a cudnnScaleTensor(). I'm sure there are some others, this was just a quick glance. I can understand the tedium in having to deal with this, since it basically means pouring over the current v5 cudnn_library.pdf docs along with the release notes and comparing with the v4 docs looking for changes/additions. I would offer to help, but my time is at a premium at present, and if you're planning on doing this anyway in short order, I can wait a little longer.

pluskid commented 8 years ago

@phiber1 Thanks! Users are more than welcome to contribute, but personally I will probably not have time to follow up all the breaking changes before they officially release cuDNN v5.

pluskid commented 6 years ago

Mocha is now updated for cuDNN v5.1 and CUDA 8