pluskid / Mocha.jl

Deep Learning framework for Julia
Other
1.29k stars 254 forks source link

Cuda exception #199

Open simsim314 opened 8 years ago

simsim314 commented 8 years ago

I'm getting this error, anytime I want to run CUDA in mocha. I've confirmed it in several combinations of cuDNN and CUDA version, (specifically cuDNN 4 + 3 and CUDA 7.0 and 7.5). Each time I checked I recompiled the kernels with VS2013 compiler, switch the correct paths in environment variables, and restarted windows (I work on windows10 julia 0.4.5 - everything is updated, mocha is compiled).

I've successfully ran training with MOCHA_USE_NATIVE_EXT flag, and it seems I didn't do anything too stupid for the GPU (I enable _ENV["MOCHA_USECUDA"] = "true" before using Mocha), and the GPU setup looks like starting to work (initialization is working). But then I get an error.

CUDA enabled [DEV=0](MOCHA_USE_CUDA environment variable detected) Native Ext disabled by default Mocha configured, continue loading module... DefaultBackend = Mocha.GPUBackend 07-May 13:52:21:INFO:root:Initializing CuDNN backend... 07-May 13:52:22:INFO:root:CuDNN backend initialized! 07-May 13:52:23:INFO:root:Constructing net MNIST-train on Mocha.GPUBackend... . . . .The net params . . . Please submit a bug report with steps to reproduce this fault, and any error messages that follow (in their entirety). Thanks. Exception: EXCEPTION_ACCESS_VIOLATION at 0x7ff984bacc0d -- unknown function (ip: 00007FF984BACC0D) cudnnAddTensor at (unknown line) add_tensor at C:\Users\SimSim314.julia\v0.4\Mocha\src\cuda/cudnn.jl:54 jlcall_add_tensor_2489 at (unknown line) jl_apply_generic at C:\Users\SimSim314\AppData\Local\Julia-0.4.5\bin\libjulia.dll (unknown line) forward at C:\Users\SimSim314.julia\v0.4\Mocha\src\cuda\layers/convolution.jl:89 jlcall_forward_2481 at (unknown line) jl_apply_generic at C:\Users\SimSim314\AppData\Local\Julia-0.4.5\bin\libjulia.dll (unknown line) forward at C:\Users\SimSim314.julia\v0.4\Mocha\src\net.jl:148 jlcall_forward_2378 at (unknown line) jl_apply_generic at C:\Users\SimSim314\AppData\Local\Julia-0.4.5\bin\libjulia.dll (unknown line) init_solve at C:\Users\SimSim314.julia\v0.4\Mocha\src\solvers.jl:189 solve at C:\Users\SimSim314.julia\v0.4\Mocha\src\solvers.jl:234 jl_apply_generic at C:\Users\SimSim314\AppData\Local\Julia-0.4.5\bin\libjulia.dll (unknown line) jl_interpret_toplevel_expr at C:\Users\SimSim314\AppData\Local\Julia-0.4.5\bin\libjulia.dll (unknown line) jl_interpret_toplevel_thunk_with at C:\Users\SimSim314\AppData\Local\Julia-0.4.5\bin\libjulia.dll (unknown line) jl_eval_with_compiler_p at C:\Users\SimSim314\AppData\Local\Julia-0.4.5\bin\libjulia.dll (unknown line) jl_parse_eval_all at C:\Users\SimSim314\AppData\Local\Julia-0.4.5\bin\libjulia.dll (unknown line) jlload at C:\Users\SimSim314\AppData\Local\Julia-0.4.5\bin\libjulia.dll (unknown line) include at boot.jl:261 jl_apply_generic at C:\Users\SimSim314\AppData\Local\Julia-0.4.5\bin\libjulia.dll (unknown line) include_from_node1 at loading.jl:320 jl_apply_generic at C:\Users\SimSim314\AppData\Local\Julia-0.4.5\bin\libjulia.dll (unknown line) process_options at client.jl:280 _start at client.jl:378 jlcall__start_505 at (unknown line) jl_apply_generic at C:\Users\SimSim314\AppData\Local\Julia-0.4.5\bin\libjulia.dll (unknown line) unknown function (ip: 00000000004018D0) unknown function (ip: 00000000004028AB) unknown function (ip: 000000000040140C) unknown function (ip: 000000000040153B) BaseThreadInitThunk at C:\WINDOWS\system32\KERNEL32.DLL (unknown line) RtlUserThreadStart at C:\WINDOWS\SYSTEM32\ntdll.dll (unknown line)

davidparks21 commented 8 years ago

I just got Mocha/CUDA running on windows 10 recently. I used CUDA libraries 7.5.18, and CUDNN v4. I did have issues with not having Visual Studio 2013 (not 2015) installed BEFORE I installed the cuda libraries, so if you have that situation going on, go back and re-install VS2013 and then re-install the CUDA libraries.

Also, can you run nvidia-smi successfully? That should be a basic sanity check for the cuda installation. It sounds like Mocha is finding the libraries on your system, so I'm betting on an issue with the cuda installation, which were a little finicky for me too.

You might also run Pkg.test("Mocha") as well, those should pass once everything is working well.

Oh, just thought of one other possibility, I think one of the windows DLLs for CuDNN might have been added after the release version, so if you're on the release version of Mocha maybe it's finding an unexpected DLL on the path (particulary possible if you have multiple CuDNN installations). src/cuda/cudnn.jl performs the following search for CuDNN dll's:

Libdl.find_library(["cudnn64_70.dll", "cudnn64_65.dll", "cudnn32_70.dll", "cudnn32_65.dll", "cudnn64_4.dll"], [""])

Make sure that command is finding the library you expect. For me that's cudnn64_4.dll. Double check the source you're working from includes that dll, if not check out the latest build with:

Pkg.rm("Mocha") Pkg.clone("https://github.com/pluskid/Mocha.jl.git")

MichaelSimkinCortica commented 8 years ago

I've installed and reinstalled CUDA several times (and asking to re-install VS is kinda big request, as it takes "forever" just for "make sure" - I would prefer to avoid reinstalling VS, but CUDA addons are installed in VS2013 - and with VS2015 you can't compile mocha kernels - because CUDA nvcc is not supporting it). I've also hacked the PATH and CUDA_PATH to make sure only one CUDA version is visible to "outsiders".

Pkg.test("Mocha") - passes all the tests with GPU enabled, in all 4 configurations I mentioned (CUDA 7 and 7.5 X cudnn v3 + v4)

cudnn is working fine - I've even hacked the find_library to see if v5 of cudnn would work - and it didn't, so I moved to cudnn v4 and v3 and they both worked in test (and returned the correct cudnn library) but threw the same exception in the actual run.

I've updated the Mocha.jl (and all other jl packages) several times, everything is brand new. I can try the rm+clone option, but what I'm saying is that I avoided all the basic "stupid" mistakes, I've tried several options, checked they all don't have something dumb, and the error goes down to windows KERNEL32.DLL and ntdll.dll (not necessary proves something, just saying it's something deeper than "find_library" mistake).

I can only say that CNTK is working fine on my system, with cudnn. So it's probably not something related to definitions or hardware or installations - it's something in the Julia core (or mocha core - not sure).

davidparks21 commented 8 years ago

It does sound like like you've hit the important points, sounds like you've run into just about every gotcha I bumped into and resolved them in a way that sounds logical enough to me.

This line still makes me a bit suspicious though.

Exception: EXCEPTION_ACCESS_VIOLATION at 0x7ff984bacc0d -- unknown function (ip: 00007FF984BACC0D) cudnnAddTensor at (unknown line) add_tensor at C:\Users\SimSim314.julia\v0.4\Mocha\src\cuda/cudnn.jl:54

I did a little poking around google and found this seemingly related discussion:

https://devtalk.nvidia.com/default/topic/863107/problem-with-cudnnaddtensor-function/

It's certainly possible I'm off base here, but my intuition leads me towards thinking that CuDNN v3 is being loaded, and I only know for sure that v4 works for me. I have no experience trying v3. I did try v5, and that's not exactly the error I saw but it's suspiciously similar.

What exactly is the output of running this command from the Julia REPL? I know you set the path, but this is the end-all verification that the expected dll is being loaded.

Libdl.find_library(["cudnn64_70.dll", "cudnn64_65.dll", "cudnn32_70.dll", "cudnn32_65.dll", "cudnn64_4.dll"], [""])

Also, take a look at this file: src/cuda/cudnn.jl (~line 41) and compare what I pasted above to what you have, I believe there were some recent commits that changed that line and it's possible you have a difference from me on that particular line of code. It's certainly the line to be suspicious of.

Incidentally, the configuration I have working is: windows 10 workstation, Cuda 7.5 64 bit windows cuda toolkit, CuDNN v4