uncomplicate / deep-diamond

A fast Clojure Tensor & Deep Learning library
https://aiprobook.com
Eclipse Public License 1.0
428 stars 17 forks source link

What version of CUDA Toolkit does Deep diamond work with? #12

Closed qeshi closed 3 years ago

qeshi commented 3 years ago

Hi!

I'm trying to run deep diamond with an NVIDIA GPU.

When I start a repl in deep-diamond and try to use a uncomplicate.diamond.internal.cudnn.factory I get and exception. I run (use 'uncomplicate.diamond.internal.cudnn.factory).

If I use the cuda_11.3.0_465.19.01_linux version I get an exception that says that

Stack trace from the attempt to load the library as a resource:
java.lang.UnsatisfiedLinkError: /tmp/libJCudnn-11.2.0-linux-x86_64.so: libcudnn.so.8: cannot open shared object file: No such file or directory

If I understand everything correctly JCuda is dependent on a specific version of the NVIDIA toolkit drivers, so I need to use version 11.2.0 in order to use JCuda 11.2.0 .

But if I install cuda_11.2.0_460.27.04_linux.run or cuda_11.2.2_460.32.03_linux.run I get another exception.

Execution error (UnsatisfiedLinkError) at java.lang.ClassLoader$NativeLibrary/load0 (ClassLoader.java:-2).
/home/ubuntu/.javacpp/cache/opencl-3.0-1.5.5-linux-x86_64.jar/org/bytedeco/opencl/linux-x86_64/libjniOpenCL.so: /usr/local/cuda-11.2/targets/x86_64-linux/lib/libOpenCL.so: version `OPENCL_2.2' not found (required by /home/ubuntu/.javacpp/cache/opencl-3.0-1.5.5-linux-x86_64.jar/org/bytedeco/opencl/linux-x86_64/libjniOpenCL.so)

Which seems to have to do with that libOpenCL.so doesn't support OPENCL_2.2 in the cuda 11.2.0 version.

If I run the nm command on libOpenCL.so that is located here /usr/local/cuda-11.2/targets/x86_64-linux/lib I get this output:

nm -gDC libOpenCL.so

0000000000000000 A OPENCL_1.0
0000000000000000 A OPENCL_1.1
0000000000000000 A OPENCL_1.2
0000000000000000 A OPENCL_2.0
                 w _ITM_deregisterTMCloneTable
                 w _ITM_registerTMCloneTable
                 w _Jv_RegisterClasses
                 w __cxa_finalize
                 w __gmon_start__
00000000000033b0 T clBuildProgram
... truncated output

It doesn't seem to have OPENCL_2.2 that we are interested in.

But if I run the same command on the cuda 11.3 version of libOpenCL.so it seems to be included:

ubuntu@ip-xxx:/usr/local/cuda-11.3/targets/x86_64-linux/lib$ nm -gDC libOpenCL.so
0000000000000000 A OPENCL_1.0
0000000000000000 A OPENCL_1.1
0000000000000000 A OPENCL_1.2
0000000000000000 A OPENCL_2.0
0000000000000000 A OPENCL_2.1
0000000000000000 A OPENCL_2.2
0000000000000000 A OPENCL_3.0
blueberry commented 3 years ago

That OpenCL exception was introduced recently by JavaCPP's DNNL bindings, probably caused by automatically enabling Intel's GPU support over OpenCL. I solved it (on my system) by installing Intel's OpenCL (for CPU support in my case) and generic OpenCL 3.0 loader (ocl-icd package on Arch Linux), alongside CUDA's OpenCL that you already have.

blueberry commented 3 years ago

BTW, if the problem is caused only by 2.2/3.0 stuff, I expect this to be solved by itself when JCuda is updated to 11.3, which should happen soon enough.