jcuda / jcuda-main

Summarizes the main JCuda libraries
MIT License
99 stars 20 forks source link

JCublas2 problem with 9.0 on maven (wrong version number somewhere?) #24

Closed luigirocca closed 6 years ago

luigirocca commented 6 years ago

Hi everyone.

I'm trying to make my project work with the new 9.0 maven deployment, since it should be more convenient for us moving onward (as described here https://github.com/jcuda/jcuda-main/issues/21 ).

After solving a bunch of issues (mainly, the fact the os and arch must be given manually to my sbt project, as per issue https://github.com/jcuda/jcuda-main/issues/22 ), everything downloads and compiles, but I am stuck with the following error at runtime:

[info]   java.lang.UnsatisfiedLinkError: Error while loading native library "JCublas2-0.9.0-linux-x86_64"
...
[info] Stack trace from the attempt to load the library as a file:
[info] java.lang.UnsatisfiedLinkError: no JCublas2-0.9.0-linux-x86_64 in java.library.path
...
[info] Stack trace from the attempt to load the library as a resource:
[info] java.lang.UnsatisfiedLinkError: /tmp/libJCublas2-0.9.0-linux-x86_64.so: libcublas.so.9.1: cannot open shared object file: No such file or directory

I have the right library in my cache, downloaded from maven:

$ jar -tf .ivy2/cache/org.jcuda/jcublas-natives/jars/jcublas-natives-0.9.0-linux-x86_64.jar
META-INF/MANIFEST.MF
META-INF/
lib/
META-INF/maven/
META-INF/maven/org.jcuda/
META-INF/maven/org.jcuda/jcublas-natives/
lib/libJCublas-0.9.0-linux-x86_64.so
lib/libJCublas2-0.9.0-linux-x86_64.so
META-INF/maven/org.jcuda/jcublas-natives/pom.xml
META-INF/maven/org.jcuda/jcublas-natives/pom.properties

I think that the problem is that JCublas2 searches for a libcublas.so.9.1 on my system, but I only have libcublas.so.9.0 - which is right, given that the current supported version with JCuda should be 9.0 and not 9.1. Everything else is 9.0 and it must be so for everything to keep working. What's happening here?

Any ideas?

jcuda commented 6 years ago

OK, this is bad :-(

This is exactly what I was afraid of when I asked about that in https://github.com/jcuda/jcuda-main/issues/21#issuecomment-357018569

@blueberry Any idea how this could be solved cleanly?

I'm afraid that, in the worst case, I'll have to look for a contributor who can provide the linux binaries for 0.9.0 that are actually linked against the CUDA 9.0 binaries, and create a new release.

At least, now we know for sure that it does not work this way :-/

blueberry commented 6 years ago

@luigirocca Is there a reason you can not install 9.1?

@jcuda @luigirocca I'm not sure what is the best way. If it was about my machine, this is approximate list of fixes ordered by my preferences:

  1. Fix JCublas build to link against libcublas.so instead of the explicit version.
  2. Install CUDA 9.1, which is the latest version that should be fully compatible with 9.0
  3. Create a link named libcublas.so.9.1 that points to libcublas.so.9.0 2a. Some alternative way of adapting library names that is defined in POSIX/Linux. I am not that knowledgeable, so I am not sure what is the recommended mechanism, but I am sure that there is (at least) one standard way, since this is a fairly common issue.

By explicitly linking to 9.0 we then lose the compatibility with 9.1, which is the version that most people install since this is what Nvidia CUDA download currently offers as a default...

jcuda commented 6 years ago

Any workaround that will have to be done manually at the users' site is not really feasible. People will declare the dependency to JCublas via a POM, and it basically has to work out-of-the-box, without renaming files or creating symbolic links.

Not having a working JCuda version for CUDA 9.0 isn't really desriable either.

In doubt, I'll try to install CUDA 9.0 on a linux VM and see whether I can compile the libraries there. (I'll not be able to try this before monday, though). One issue is that even if I manage to compile the lib (which may or may not work), I'll not be able to test it on the VM, but maybe someone can test it on his machine before it is uploaded to Maven Central.

blueberry commented 6 years ago

The usual solution to this is that virtually every unix/linux distro has libx.so link in addition to more specific links like libx.so.1 etc.

I can test the lib you build. Even better, I'll test it with CUDA 9.1, so we will check how portable it is.

jcuda commented 6 years ago

Then I wonder why there was a specific reference to libcublas.so.9.1. From what I've read so far, the "SONAME" (as e.g. mentioned on https://cmake.org/cmake/help/v3.3/command/target_link_libraries.html ) should be the "generic" one, namely libcublas.so (with the links to the versioned ones, libcublas.so.9 and libcublas.so.9.1). However, until now, it did not seem necessary to treat this case, and I hope that it won't be necessary to tweak the (intimidatingly complex) https://github.com/jcuda/jcuda-common/blob/master/CMake/FindCUDA.cmake file in any way...

EDIT: I started reading more at http://www.kaizou.org/2015/01/linux-libraries/ , but still hope that someone can give a pointer to how this can be solved nicely (regardless of that, I'll try to set up the VM for compiling for CUDA 9.0 ASAP)

blueberry commented 6 years ago

For the libraries that I compiled, there was nothing special to do: I usually link against libx.so (-lx; no matter whether with plain gcc, or with a more sophisticated build tool such as make or a Java-based maven plugins), and everything else related to the actual library name+version was handled by the system.

jcuda commented 6 years ago

Well, the libJCublas2-0.9.0-linux-x86_64.so contains a specific reference to libcublas.so.9.1. Regardless of where this comes from (and how it could be avoided in future binaries), the current one does not work for CUDA 9.0.

(So in this case, the fact that I never have enough time is nearly a "good thing": At least the broken version 0.9.0 was not yet mentioned in the README or the website. But it is in Maven Central, and this is bad. I'll try to set up a VM for compiling today)

blueberry commented 6 years ago

I've just skimmed FindCUDA, and it is rather complex indeed. However, most of that file deals with detecting numerous historical combinations of versions and features. On the other hand, does JCuda need all that? As I understand:

  1. JCuda requires a very recent version of CUDA (at least in build time). Right now that means 9+.
  2. All that setup is really needed only on the machines that build JCuda binaries, and it is reasonable to expect that on those machines, CUDA 9 is present or is possible to install. There is no need to test whether there is CUDA 4.1+ present etc.

Is it possible to just test whether there is CUDA on the machine, and setting the right include and lib directories where appropriate?

I don't mind even having to specify these directories manually in cmake-gui or wherever is the best place, if that would mean predictable and explicit build. Another option is also to expect of the person that builds the JCuda binary to have those cuda directories included in the path in the environment, so they are generally available.

Anyway, I'm ok with any solution that you think is best. Ping me when there is a build to test.

blueberry commented 6 years ago

This is the output of grepping the jcuda.build directory left from the JCuda build for libcublas.so:

./jcublas/JCublas2JNI/bin/CMakeFiles/JCublas2.dir/build.make:96:/home/dragan/workspace/java/jcuda/jcublas/nativeLibraries/linux/x86_64/lib/libJCublas2-0.9.0-linux-x86_64.so: /usr/local/cuda/lib64/libcublas.so
./jcublas/JCublas2JNI/bin/CMakeFiles/JCublas2.dir/link.txt:1:/usr/bin/g++-6 -fPIC   -shared -Wl,-soname,libJCublas2-0.9.0-linux-x86_64.so -o /home/dragan/workspace/java/jcuda/jcublas/nativeLibraries/linux/x86_64/lib/libJCublas2-0.9.0-linux-x86_64.so CMakeFiles/JCublas2.dir/src/JCublas2.cpp.o -Wl,-rpath,/usr/local/cuda/lib64 /usr/local/cuda/lib64/libcudart_static.a -lpthread -lrt -ldl /usr/local/cuda/lib64/libcublas.so ../../../lib/libJCudaCommonJNI.a 
./jcublas/JCublasJNI/bin/CMakeFiles/JCublas.dir/build.make:96:/home/dragan/workspace/java/jcuda/jcublas/nativeLibraries/linux/x86_64/lib/libJCublas-0.9.0-linux-x86_64.so: /usr/local/cuda/lib64/libcublas.so
./jcublas/JCublasJNI/bin/CMakeFiles/JCublas.dir/link.txt:1:/usr/bin/g++-6 -fPIC   -shared -Wl,-soname,libJCublas-0.9.0-linux-x86_64.so -o /home/dragan/workspace/java/jcuda/jcublas/nativeLibraries/linux/x86_64/lib/libJCublas-0.9.0-linux-x86_64.so CMakeFiles/JCublas.dir/src/JCublas.cpp.o -Wl,-rpath,/usr/local/cuda/lib64 /usr/local/cuda/lib64/libcudart_static.a -lpthread -lrt -ldl /usr/local/cuda/lib64/libcublas.so ../../../lib/libJCudaCommonJNI.a 
./CMakeCache.txt:267:CUDA_cublas_LIBRARY:FILEPATH=/usr/local/cuda/lib64/libcublas.so
./CMakeCache.txt:327:JCublas2_LIB_DEPENDS:STATIC=general;/usr/local/cuda/lib64/libcudart_static.a;general;-lpthread;general;/usr/lib/librt.so;general;/usr/lib/libdl.so;general;/usr/local/cuda/lib64/libcublas.so;general;JCudaCommonJNI;
./CMakeCache.txt:336:JCublas_LIB_DEPENDS:STATIC=general;/usr/local/cuda/lib64/libcudart_static.a;general;-lpthread;general;/usr/lib/librt.so;general;/usr/lib/libdl.so;general;/usr/local/cuda/lib64/libcublas.so;general;JCudaCommonJNI;

The only place where 9.1 is ever mentioned is unrelated to this, and is:

./CMakeCache.txt:264:CUDA_VERSION:STRING=9.1
jcuda commented 6 years ago

Until now I treated to FindCUDA script as a convenient black-box: Refer to it in the CMake file, and then just do cuda_add_cublas_to_target and you're done. The hassle of figuring out the CUDA SDK root directory for the different OSes, properly setting up dependencies between the CUDA libraries and setting the required include directories was hidden.

It would be OK if this could be replaced with a simple, 20-line-CMake file that just found the required libraries for JCuda. But CMake can be a b!tch: There are always 5 different ways to achieve the same goal, usually 3 of them do not work, and 2 of them don't work on every OS, so that you still need a 6th way with some specific workarounds.

(Or to put it that way: I think that with a hand-crafted JCuda-CUDA-CMake file, there would be dozens of issues like https://github.com/jcuda/jcuda/issues/3 , causing even more headaches).

In the end, FindCUDA does work as expected and desired, if the compilation is done on a system with the appropriate CUDA version installed. It's hard to figure out where and how the CUDA_VERSION:STRING=9.1 might become relevant in the build process. This is only a guess, but: It might also be that although you're only referring to the .so file, the GCC linker somehow "resolves" this name, locally, on your system, so that so.9.1 eventually ends up in the binary. And also a guess: It might be possible to avoid this sort of name resolution.

But before I haven't read and digested something like http://www.kaizou.org/2015/01/linux-libraries/ to gain a deeper understanding of what this "ELF" and "SONAME" stuff is about, and unless someone who is an expert with the GCC linking mechanisms proposes a profound, tested(!) and stable solution, I'd prefer the path that already worked for all previous versions - namely, to compile the binary linking against the matching CUDA binary version.

blueberry commented 6 years ago

@jcuda JCuda properly links to the "generic" cuda lib, while JCublas does not:

ldd libJCudaDriver-0.9.0-linux-x86_64.so 
    linux-vdso.so.1 (0x00007fff5a5ba000)
    libpthread.so.0 => /usr/lib/libpthread.so.0 (0x00007efd39290000)
    librt.so.1 => /usr/lib/librt.so.1 (0x00007efd39088000)
    libdl.so.2 => /usr/lib/libdl.so.2 (0x00007efd38e84000)
    libcuda.so.1 => /usr/lib/libcuda.so.1 (0x00007efd381f6000)
    libstdc++.so.6 => /usr/lib/libstdc++.so.6 (0x00007efd37e6f000)
    libm.so.6 => /usr/lib/libm.so.6 (0x00007efd37b23000)
    libgcc_s.so.1 => /usr/lib/libgcc_s.so.1 (0x00007efd3790c000)
    libc.so.6 => /usr/lib/libc.so.6 (0x00007efd37555000)
    /usr/lib64/ld-linux-x86-64.so.2 (0x00007efd396e5000)
    libnvidia-fatbinaryloader.so.387.34 => /usr/lib/libnvidia-fatbinaryloader.so.387.34 (0x00007efd37303000)
ldd libJCublas-0.9.0-linux-x86_64.so 
    linux-vdso.so.1 (0x00007ffc483c9000)
    libpthread.so.0 => /usr/lib/libpthread.so.0 (0x00007ff47d3ba000)
    librt.so.1 => /usr/lib/librt.so.1 (0x00007ff47d1b2000)
    libdl.so.2 => /usr/lib/libdl.so.2 (0x00007ff47cfae000)
    libcublas.so.9.1 => /usr/local/cuda/lib64/libcublas.so.9.1 (0x00007ff479a17000)
    libstdc++.so.6 => /usr/lib/libstdc++.so.6 (0x00007ff479690000)
    libm.so.6 => /usr/lib/libm.so.6 (0x00007ff479344000)
    libgcc_s.so.1 => /usr/lib/libgcc_s.so.1 (0x00007ff47912d000)
    libc.so.6 => /usr/lib/libc.so.6 (0x00007ff478d76000)
    /usr/lib64/ld-linux-x86-64.so.2 (0x00007ff47d801000)
jcuda commented 6 years ago

Interesting. Any idea what might be the reason for this? Do you see any structural differences regarding the libcublas.* vs. libcuda.* symlinks in your installation directory? Maybe this is somehow related to different version numbers being present due to incompatibilities. As far as I know, the main changes in CUDA 9.1 have been related to CUBLAS (including several patches that even came after 9.1 was officially released!)

jcuda commented 6 years ago

OK, I managed to compile the linux binaries for CUDA 9.0 on a VM. Of course, I cannot test them. So attached here is a package that should make it easy to test whether the libraries work in general on Linux with CUDA 9.0 ( @luigirocca ) and maybe also with CUDA 9.1 ( @blueberry )

JCu-0.9.0-LINUX_FIX_2018-03-12.zip

In order to test this, you may have to delete the JCuda .so files for version 0.9.0 from your tmp directory.

The archive contains the Java JARs and the native JARs, and a small JCudaTest.java which hopefully would show whether a library cannot be loaded at all. The compileAndStartLinuxCommandLine.txt contains the command lines for compiling and starting this test.

Note: I know, all this currently feels very crude and brittle. This is really only intended as a very basic test. If it works in general, I'll bump the version to 0.9.0b and perform a new, proper release.

blueberry commented 6 years ago

@jcuda Doesn'\t work with CUDA 9.1:

ldd: warning: you do not have execution permission for `./libJCublas2-0.9.0-linux-x86_64.so'
    linux-vdso.so.1 (0x00007ffefddc6000)
    libpthread.so.0 => /usr/lib/libpthread.so.0 (0x00007fd499b5f000)
    librt.so.1 => /usr/lib/librt.so.1 (0x00007fd499957000)
    libdl.so.2 => /usr/lib/libdl.so.2 (0x00007fd499753000)
    libcublas.so.9.0 => not found
    libstdc++.so.6 => /usr/lib/libstdc++.so.6 (0x00007fd4993cc000)
    libc.so.6 => /usr/lib/libc.so.6 (0x00007fd499015000)
    /usr/lib64/ld-linux-x86-64.so.2 (0x00007fd49a044000)
    libm.so.6 => /usr/lib/libm.so.6 (0x00007fd498cc9000)
    libgcc_s.so.1 => /usr/lib/libgcc_s.so.1 (0x00007fd498ab2000)

(ignore the execution permission; the real issue is that it directly links to libcublas.so.9.0)

jcuda commented 6 years ago

Linking to libcublas.so.9.0 is "ok-ish" for JCuda 0.9.0, because the version numbers have been aligned since 0.2.x. If it was somehow possible to link only to libcublas.so.9 (so that it also worked with 9.1) then that would be great, but right now, I have no idea how to accomplish that.

The main goal right now is to create a release 0.9.0(b) that works properly with CUDA 9.0, to replace the version that currently is in Maven Central.

@luigirocca It would be great to hear whether it works with CUDA 9.0.

I can only try to apologize for my lack of conscientiousness here...

blueberry commented 6 years ago

I've tried manually changing version through cmake-gui from "9.1" to "9", but it doesn't affect the linking: when I build the natives it still gets linked to libcublas.so.9.1...

I hope that by "replace" you mean "release new updated version (0.9.0b)". I do not know whether it is possible to actually replace the same version in maven central, but even if it is, it goes contrary to maven's central goals of repeatable builds, since it would break hundreds of projects that use my clojure libraries, and that worked for people who simply installed the latest CUDA download (9.1)...

blueberry commented 6 years ago

@jcuda Does it work for multiple versions of CUDA on Windows? If not, maybe the (temporary) solution would be to have two versions of JCuda: 0.9.0 that targets CUDA 9.0 and JCuda 0.9.1 that targets CUDA 9.1.

luigirocca commented 6 years ago

Hi all.

sorry for being late to the discussion, I was away from office until today. And thanks for all the efforts! I have tried the JCuda-test zip package uploaded by @jcuda and it works on my machine (at least, it throws no error). This is the output:

$ java  -cp ".:jcublas-0.9.0.jar:jcublas-natives-0.9.0-linux-x86_64.jar:jcublas-natives-0.9.0-windows-x86_64.jar:jcuda-0.9.0.jar:jcuda-natives-0.9.0-linux-x86_64.jar:jcuda-natives-0.9.0-windows-x86_64.jar:jcufft-0.9.0.jar:jcufft-natives-0.9.0-linux-x86_64.jar:jcufft-natives-0.9.0-windows-x86_64.jar:jcurand-0.9.0.jar:jcurand-natives-0.9.0-linux-x86_64.jar:jcurand-natives-0.9.0-windows-x86_64.jar:jcusolver-0.9.0.jar:jcusolver-natives-0.9.0-linux-x86_64.jar:jcusolver-natives-0.9.0-windows-x86_64.jar:jcusparse-0.9.0.jar:jcusparse-natives-0.9.0-linux-x86_64.jar:jcusparse-natives-0.9.0-windows-x86_64.jar:jnvgraph-0.9.0.jar:jnvgraph-natives-0.9.0-linux-x86_64.jar:jnvgraph-natives-0.9.0-windows-x86_64.jar" JCudaTest

Pointer: Pointer[nativePointer=0x10206a00000,byteOffset=0]

$ echo $?
0

The situation is not optimal though. The package should work with both 9.0 and 9.1.

Some random thoughts and questions:

What do you think?

blueberry commented 6 years ago

@luigirocca Yes, I can help with testing. BTW, do you perhaps have a macOS in the office? We are currently lacking the macOS build...

luigirocca commented 6 years ago

Unfortunately we have no macOS machine. Are you sure that a macOS build makes sense right now? All recent Apple hardware ships with AMD radeons, AFAIK. This makes a macOS build quite a corner case, for example people with a custom built machine or with old machines. It makes sense to add macOS to the build only if someone explicitly asks for it and is willing to provide support for the compiling and testing processes I think.

I'm attaching a test with the libraries compiled on my machine but I don't really see how anything could change. I have tried to look into the compilation process and files but I didn't notice anything obvious about why jcublas should link against a specific version instead of generic ones like the rest. I fear that making two separate releases in maven (9.0 and 9.1) might be the only way to solve this. JCu-0.9.0-luigirocca.tar.gz

jcuda commented 6 years ago

@blueberry

I hope that by "replace" you mean "release new updated version (0.9.0b)".

As you said: It is crucial that "What happens in Vegas Maven Central stays in Maven Central". It is not possible to replace an existing version, even if the existing version has a major flaw. So the goal is indeed to offer a newer version 0.9.0b.

This is in line with your comment:

to have two versions of JCuda: 0.9.0 that targets CUDA 9.0 and JCuda 0.9.1 that targets CUDA 9.1

This is the goal. The problem right now is that there is a release in Maven Central that behaves unexpectedly: It's version 0.9.0, supposed to work with CUDA 9.0, and it does so on Windows, but not on Linux.

I haven't yet tried whether JCuda 0.9.0 works with CUDA 9.1 on Windows. According to what luigirocca said as his last point, it works, but the dependency resolution on Windows is obviously quite different from that on Linux.

I also hesitate to update to CUDA 9.1 until this issue is resolved. There should be a version 0.9.0 for CUDA 9.0, regardless of whether it works with 9.1 or not.

Although, of course, I agree with @luigirocca :

The package should work with both 9.0 and 9.1.

This would be preferable, although I'm not familiar enough with the Linux/GCC linker magic to know for sure whether or how this can be achieved.


Regarding your questions:

  • It wasn't clear to us that CUDA 9.1 and CUDNN 7.1 are completely backward compatible with 9.0 and 7.0. In fact we were holding back the upgrade because we feared dependency problems. Can you confirm that this is not the case, and that we should be able to safely upgrade to cuda 9.1 and cudnn 7.1 and continue to use JCuda? is this what are you currently doing on your machines? (sorry, it may be that this is a stupid question and that I'm missing something obvious here).

As mentioned above, there are some caveats regarding "compatibility". (Maybe blueberry can give a more specific answer for Linux). Intuitively, one would expect a program that is compiled for 9.0 to also run on 9.1. But the devil is in the detail, and the detail may just be this:

  • There's something strange going on here with the compilation of the library. As you all have noticed, the jcublas should be linked against the generic jcublas-9 and not against a specific minor version, as all the other libs do.

That's right. All this seems to be caused by an overly specific ...so.9.0 library dependency of which nobody knows where it comes from or how to avoid it. I'll read more about that, but would really appreciate help from someone who is more familiar with GCC/Linux/glink/ELF etc.

  • Right now we are working on locally compiled libraries, foregoing maven. I can try to run the test with our local libs and upload the zip package again. @blueberry, if you had the patience of testing again on your machine, I could then start to try modifying the compilation process and cmake stuff in the quest to make it compile right.

That's also correct. This quick fix/test was only done to see whether a JCublas library that is linked on a linux VM under CUDA 9.0 works for CUDA 9.0 (which seems to be the case), and whether it works on CUDA 9.1 (which is not the case).

Any help or hints regarding the CMake magic that might be necessary to compile it in a way that works for 9.0 and 9.1 would be highly appreciated. (As mentioned above: The "Library versioning and compatibility" section of http://www.kaizou.org/2015/01/linux-libraries/ may be helpful here, but I still have to read it more thoroughly)

  • If we aren't able to make it work with both versions, I do agree that the only solution would be to provide a JCuda-9.0 targeting Cuda-9.0/Cudnn-7.0 and a JCuda-9.1 targeting Cuda-9.1/Cudnn-7.1. I will avoid upgrading to 9.1 for now so that the option of compiling against 9.0 on my machine remains open.

Only a side note here: The CUDNN version number and the CUDA version number are, to my knowledge, largely independent. So you could, for example, mix CUDNN 7.1 and CUDA 9.0 or even CUDA 8.0. E.g. the release notes at http://docs.nvidia.com/deeplearning/sdk/cudnn-release-notes/rel_704.html#rel_704 talk about bugfixes for "cudnn v7 with CUDA 8.0".

However, of course, JCuda adds another layer here, so one has to be extra careful (particularly until this .so.X.Y thingy is resolved...)


EDIT: Still, my gut feeling is that this specific .so.9.0 version number somehow stems from the fact that the update from CUDA 9.0 to 9.1 mainly targeted CUBLAS. That's a very vague guess, but I could imagine that this specificity comes from some version information that is actually contained in the underlying libcublas.so/a, and omitting it might cause difficulties in other setups. But let's see what we can figure out here.

luigirocca commented 6 years ago

Just to clarify: we confirm that on windows JCuda 9.0 maven works with Cuda 9.0, with maven libraries (not a big surprise I know). Nobody has installed 9.1 here. It would still be interesting to know whether that would work or not on windows (jcuda 9.0 with cuda 9.1), but I can't make this kind of test now, unfortunately.

@jcuda, thanks for all your answers and clarifications. I have some knowledge about the GCC toolchain and linking on Linux, but not a lot on CMake. I will take a look at some additional info on this issue if I can, but the time I can dedicate to it is kind of limited.

My idea is that we could try in the following days to understand together if this can be solved successfully (that is, having a jcuda library that works with both minor versions) or not. If not, than the only solution would be having two jcuda minor versions for both cuda minor versions. BTW, I am starting to think that this could be the safest solution in any case, probably avoiding other possible headaches for good. For sure I would be more confident of upgrading to cuda 9.1 knowing by changing to a JCuda version that has been explicitly built against cuda 9.1, and the same for cudnn (even though I understand what you're saying about cudnn being independent from specific cuda versions).

Regarding your last edit @jcuda: I somewhat agree with your feeling. That is why knowing what happens in windows would be interesting (but I am not an expert at all in windows linking and if the dependency resolution is similar or less strict compared to unix, or whatever).

blueberry commented 6 years ago

@jcuda I wrote a wall here, but deleted it to save you from reading million words that do not help much.

The gist is that CMake combined with CUDA-fu makes a mess where my (limited) knowledge of how C toolchain works on POSIX is not enough. If I was to write a Linux-only CUDA wrapper from scratch I would simply state minimal requirements for the building machine (let's say GCC 6+, CUDA 9+, make, etc.) and link to appropriate libraries by -llibx -lliby -llibz and the standard tooling combined with the package manager would make sure that the resulting binary is quite generic. No need for CMake beast there. If I wanted to make it work on Windows, I would have created a separate build script for Visual C++, or used a Maven plugin (like with neanderthal), but the linking would have stayed similarly simple: -llibx -lliby -llibz somewhere in the not-so-huge build config.

Now, CMake in general seems to be rather complex and seem to add many features that are not going by the Unix philosophy. This particular CMake+CUDA build is beyond words :) It seems to me that only a CMake veteran can hope to penetrate this and find what might be a culprit. I run out of ideas here pretty quickly...

Have you thought about, as a long-time option, to explore using Nar Maven Plugin as a build system for JCuda and other multiplatform Java wrappers? I have similar situation with MKL as we have here with CUDA and it has served me rather well...

blueberry commented 6 years ago

PS. I understand that this is also a mid-to-long-term issue but it is related: Is there a reason to keep all JCuda binaries separated? Why not keeping jcuda, jcublas, jsparse, etc. together, and have one native JCuda binary that links to CUDA binaries provided by the system? Then I hope it would be easier to handle all jcuda inter-dependencies, and vastly simplify the build system.

blueberry commented 6 years ago

...and another smallish addition regarding versioning. Maybe it's a good time to think about transferring from 0.9.X to 9.X.Y for JCuda. That way, the relationship between JCuda and CUDA versions will be more pronounced, just like with JOCLBlast and JOCL, and there is the Y for patches and cases like we have now. Current version uses too many numbers that do not carry much information, and in cases like this, when some patch is needed, you have to rely on suffixes such as b that you have to explain in many words...

jcuda commented 6 years ago

Some points could be discussed in separate issues, but to address them here quickly:

Note: Most of the following is not directly related to this issue. I'll further investigate the .so.X.Y issue and write about any insights here

Maybe it's a good time to think about transferring from 0.9.X to 9.X.Y for JCuda.

This, in fact, is something that I also wrote above, but removed it before posting, also to reduce the (still rather big) "wall of words" ;-) Yes, I think that there may be a "JCuda 10.0.0", rather than a "JCuda 0.10.0".

Is there a reason to keep all JCuda binaries separated?

On the one hand, I tend to keep separate things separated. It's usually easier to combine small blocks than to tear apart one monolithic codebase. On the other hand, I agree that e.g. the bunch of git clone... calls and the issues of tagging and branching can be cumbersome. To some extent, this is due to "historical" reasons. The directory structure is not what you would do when starting something like this from scratch. And I think it would be reasonable to at least combine the "core" projects (which may not include JNvgraph, and certainly not JCudnn).

But I think that the jcuda-main project with the build infrastructure that combines everything mitigates the problems here. Further restructurings might be considered, but I'm not sure whether this would make the build process itself so much easier.

Have you thought about, as a long-time option, to explore using Nar Maven Plugin as a build system for JCuda and other multiplatform Java wrappers?

This has been considered for JCuda as well as JOCL. It has also been discussed in some forum thread about bringing JCuda to GitHub (sorry, there are some formatting issues there due to a change in the forum software).

In the good ol' days, I did have separate build files: Visual Studio Project files for Windows, and make files for Linux+Mac. But this was also cumbersome to update. They had been created by using the "CUDA samples" files as some sort of "template". During the transition to GitHub, I also considered (and IIRC tried out) to compile the native part with Maven (also the "intermediate" solution, https://github.com/cmake-maven-project/cmake-maven-project ).

But the only tool that seems to be commonly accepted and reliable for cross-platform-C-builds seems to be CMake - even when it sometimes can be a pain in the back to make it do what it should do. Manually figuring out things like the CUDA_SDK_ROOT_DIR, handling the different CUDA versions (particularly: Knowing what has to be changed in the build files when a new version pops up!), setting the compiler flags, and more importantly, the right linker flags is something that I would not try to do manually for the plethora of possible target OSes and compilers out there.

In the end, one argument to not use Maven was that the native part and the Java part can be built separately, the native libraries always have to be built first, CMake is the standard tool for C builds, and eventually, the natives have to be compiled on different OSes anyhow before they could be assembled by Maven. But I won't deny that, to some extent, one reason to not change the current structure is simply that I'm afraid of sinking a lot of time in the attempt to radically change (and "simplify") the build, only to figure out that there is no simpler solution. (Pull requests are welcome, though ;-))

jcuda commented 6 years ago

Back to on-topic: From a quick glance at the makefiles, it at least seems like the specific name does not appear in the JCublas2 link stage:

/usr/bin/c++ -fPIC -shared -Wl,-soname,libJCublas2-0.9.0-linux-x86_64.so -o ../../nativeLibraries/linux/x86_64/lib/libJCublas2-0.9.0-linux-x86_64.so CMakeFiles/JCublas2.dir/src/JCublas2.cpp.o /usr/local/cuda /lib64/libcudart_static.a -lpthread -lrt -ldl /usr/local/cuda/lib64/libcublas.so ../../../lib/libJCudaCommonJNI.a -Wl,-rpath,/usr/local/cuda/lib64

... but again, I'm not an expert here, and just started browsing, so this may not mean much (and even less explains why the specific name ends up in the binary)

EDIT: The latter may be related to the -soname parameter. I'll try this out, but first have to juggle a bit with the VMs. Doing this solely on the command line is hardly feasible.

blueberry commented 6 years ago

OTOH, notice how -lpthread -lrt are linked from the general path, while /usr/local/cuda/lib64/libcublas.so is listed as a specific file. Is there a way to link it as -lcublas?

jcuda commented 6 years ago

The difference between using -lcublas and the full path is, to my understanding, only whether the linker relies on the "standard discovery procedure" for libraries. That is, when adding the directory to the LD_LIBARARY_PATH, one can use -lcublas, but it does not make a difference for the result.

Two (possibly irrelevant) things that I noticed:

1. There are the following files in the CUDA library directory: libcublas.so, which is a link to libcublas.so.9.0, which is a link to libcublas.so.9.0.176 Note that there is no libcublas.so.9 (which is what we would "need").

2. Compiling the "simpleCUBLAS" sample results in an executable that als links to libcublas.so.9.0.

For me, these are strong signs that 9.0 is the "least specific version that works". Or to put it that way: I'm now at the point where I have to anticipate that it's simply not possible to link against .so.9.

(At least, not dynamically. One could link it statically, but I certainly won't put several >50MB libraries into a JAR...)


So unless there are objections or other findings, I'll schedule a release for "0.9.0b". (Unfortunately, time is once more a limiting factor, and I still have to sort out a few other things, but hope that I can at least prepare everything (also for a "final" test) this week)

luigirocca commented 6 years ago

Most other libraries in my cuda lib64 directory have the symlinks with the same structure. Yet they do not cause problems, it seems.

BTW the week before I had also tried to add a symlink called 9.1 but it didn't work.

Personally I agree with the move of having a 0.9.0b release built and tested for cuda 9.0 (and hopefully in the future a 0.9.1 release built and tested with cuda 9.1). I'm not in a hurry, and whenever it's ready I can participate in testing it.

About the other things you discussed:

(EDIT: sorry, I accidentally closed the issue while commenting. I opened it again...)

jcuda commented 6 years ago

Until now, the version numbers had been in sync, as in CUDA X.Y -> JCuda 0.X.Y. The current problems mainly stem from the fact that the JCuda binaries are linked against a newer version of the CUDA binaries. However, the primary goal now is to have a working "0.9.0(b)" version as well.

@blueberry

JCuda properly links to the "generic" cuda lib, while JCublas does not

I also had a short look at this yesterday. The main reason here is probably that the CUDA driver library is handled differently. Roughly speaking, I think it accesses a library that is "part of the Graphics Card Driver". (I mean, it says "libcuda.so.1", which does not make any sense, conceptually...). The other libraries seemed to have the specific .9.0 dependency. (I did not check all of them, but even one would be enough)

I'll probably open some issues for the other points (versioning, CMake or not, directory structure), but rather as a place for discussion. It's unlikely that this will be changed radically in the near future. There are (too many) more pressing (albeit smaller) tasks in the queue.

jcuda commented 6 years ago

JCuda 0.9.0b should soon be available in Maven Central, at

<dependency>
    <groupId>org.jcuda</groupId>
    <artifactId>jcuda</artifactId>
    <version>0.9.0b</version>
</dependency>
<dependency>
    <groupId>org.jcuda</groupId>
    <artifactId>jcublas</artifactId>
    <version>0.9.0b</version>
</dependency>
<dependency>
    <groupId>org.jcuda</groupId>
    <artifactId>jcufft</artifactId>
    <version>0.9.0b</version>
</dependency>
<dependency>
    <groupId>org.jcuda</groupId>
    <artifactId>jcusparse</artifactId>
    <version>0.9.0b</version>
</dependency>
<dependency>
    <groupId>org.jcuda</groupId>
    <artifactId>jcusolver</artifactId>
    <version>0.9.0b</version>
</dependency>
<dependency>
    <groupId>org.jcuda</groupId>
    <artifactId>jcurand</artifactId>
    <version>0.9.0b</version>
</dependency>
<dependency>
    <groupId>org.jcuda</groupId>
    <artifactId>jnvgraph</artifactId>
    <version>0.9.0b</version>
</dependency>
<dependency>
    <groupId>org.jcuda</groupId>
    <artifactId>jcudnn</artifactId>
    <version>0.9.0b</version>
</dependency>

I would appreciate if @luigirocca could test this on Linux/CUDA 9.0/CUDNN7.0

I'll test the Windows part ASAP.

If everything works as expected, I'd publicly announce this update (in READMEs, the website etc).

Afterwards, I'll tackle the update for 9.1 and the latest version of CUDNN as soon as possible.

The build and release process - including the building on a Linux VM - is becoming a bit more "streamlined" now, although there are still several manual steps involved, and I'd still like to improve and simplify all this...

luigirocca commented 6 years ago

Hi @jcuda, I was away until now. I will check this ASAP and let you know if it works here. Thanks!

luigirocca commented 6 years ago

OK. The situation is strange to say the least. I have added back the following lines in my sbt project:

lazy val mavenProps = settingKey[Unit]("workaround for Maven properties")
lazy val jcudaOs = settingKey[String]("")
lazy val jcudaArch = settingKey[String]("")

jcudaOs := "linux"
jcudaArch := "x86_64"
mavenProps := {
  sys.props("jcuda.os") = jcudaOs.value
  sys.props("jcuda.arch") = jcudaArch.value
  ()
}

libraryDependencies += "org.jcuda" % "jcuda" % "0.9.0b"
libraryDependencies += "org.jcuda" % "jcuda-natives" % "0.9.0b"
libraryDependencies += "org.jcuda" % "jcuda-common" % "0.9.0b"
libraryDependencies += "org.jcuda" % "jcublas" % "0.9.0b"
libraryDependencies += "org.jcuda" % "jcurand" % "0.9.0b"
libraryDependencies += "org.jcuda" % "jcurand-natives" % "0.9.0b"
libraryDependencies += "org.jcuda" % "jcusolver" % "0.9.0b"
libraryDependencies += "org.jcuda" % "jcusolver-natives" % "0.9.0b"
libraryDependencies += "org.jcuda" % "jcusparse" % "0.9.0b"
libraryDependencies += "org.jcuda" % "jcusparse-natives" % "0.9.0b"
libraryDependencies += "org.jcuda" % "jnvgraph" % "0.9.0b"
libraryDependencies += "org.jcuda" % "jnvgraph-natives" % "0.9.0b"
libraryDependencies += "org.jcuda" % "jcublas-natives" % "0.9.0b"
libraryDependencies += "org.jcuda" % "jcufft" % "0.9.0b"
libraryDependencies += "org.jcuda" % "jcufft-natives" % "0.9.0b"
libraryDependencies += "org.jcuda" % "jcudnn" % "0.9.0b"
libraryDependencies += "org.jcuda" % "jcudnn-natives" % "0.9.0b"

And removed the local lib directory with the jars I had compiled myself from the git repos.

The situation is the following one:

[warn]  Detected merged artifact: [NOT FOUND  ] org.jcuda#jcufft-natives;0.9.0b!jcufft-natives.jar (273ms).
[warn] ==== public: tried
[warn]  Detected merged artifact: [NOT FOUND  ] org.jcuda#jcublas-natives;0.9.0b!jcublas-natives.jar (931ms).
[warn] ==== public: tried
[warn]  Detected merged artifact: [NOT FOUND  ] org.jcuda#jcuda-natives;0.9.0b!jcuda-natives.jar (1015ms).
[warn] ==== public: tried
[warn]  Detected merged artifact: [NOT FOUND  ] org.jcuda#jcurand-natives;0.9.0b!jcurand-natives.jar (1021ms).
[warn] ==== public: tried
[warn]  Detected merged artifact: [NOT FOUND  ] org.jcuda#jcusolver-natives;0.9.0b!jcusolver-natives.jar (1011ms).
[warn] ==== public: tried
[warn]  Detected merged artifact: [NOT FOUND  ] org.jcuda#jcudnn-natives;0.9.0b!jcudnn-natives.jar (988ms).
[warn] ==== public: tried
[warn]   https://repo1.maven.org/maven2/org/jcuda/jcudnn-natives/0.9.0b/jcudnn-natives-0.9.0b-${jcuda.os}-${jcuda.arch}.jar
[warn]  Detected merged artifact: [NOT FOUND  ] org.jcuda#jnvgraph-natives;0.9.0b!jnvgraph-natives.jar (998ms).
[warn] ==== public: tried
[warn]   https://repo1.maven.org/maven2/org/jcuda/jnvgraph-natives/0.9.0b/jnvgraph-natives-0.9.0b-${jcuda.os}-${jcuda.arch}.jar
[warn]  Detected merged artifact: [NOT FOUND  ] org.jcuda#jcusparse-natives;0.9.0b!jcusparse-natives.jar (1023ms).
[warn] ==== public: tried

BUT, when I do sbt compile the second time, it finds them and the compile works. This is consistent; if I do rm -fr ~/.ivy2/caches/org.jcuda, then it downloads the non-native libs the first time and it fails to find the natives; the second time I run it, it finds and downloads the natives.

I am at a loss here and have a feeling that something really stupid might be happening - probably on my part. If I go back to using local libs everything works perfectly, obviously. Also, all the libs are present in my ivy2 cache, jcudnn included (I checked). Unfortunately, this week and the next one I won't have a lot of time for debugging. If you have any suggestion though, I will be very happy to try them out.

jcuda commented 6 years ago

There is something stupid happening, but not on your side: After another look at the jcudnn-natives-0.9.0b-linux-x86_64.jar, it seems like it simply does not contain the native library. Obviously, this part went wrong and went unnoticed because I had to disable the tests - they cannot be executed on a VM :-(

The question of why the natives are not found during the first build is still open, though. It could be interesting to see whether this is somehow related to SBT. If it is not too much effort, and if your time allows it, it would be good to see whether the same behavior is observed with pure Maven. (There had been some hiccups with SBT occasionally, and I don't know whether this might cause some glitches here).

Now, (assuming that the other issue is related to SBT), I'm a bit unsure how to handle the JCudnn issue. The "core" JCuda part seems to work, basically. So I could either leave JCuda at 0.9.0b and create JCudnn 0.9.0c, or update all of them. I'd prefer the latter, to have a consistent version number for the whole release, but still have to see what's the best way to proceed here.

luigirocca commented 6 years ago

No worries! I should have checked myself what was inside the cudnn jar with jar -tf...

I'm happy with the consistent version bump, if you ask me (i.e. everything gets bumped up to 0.9.0c) - but whatever works for you is fine, I think).

I can try with pure maven, but I have never used it before and I don't have time to dive into documentation now. If you can point me to short instructions/command lines/incantations/whatever, I will be happy to try.

BTW, would it make sense to have a stub project on github that depends on all the libraries and tries to load them and perform some basic usage? With something like this, I would be happy to try and run it on a linux machine with nvidia cuda installed everytime you release, time allowing.

jcuda commented 6 years ago

The last point sounds like a good idea, and could also make it easier to test: It could be a project that just contains one trivial class (similar to the one in the ZIP that I uploaded here a while ago) and shows how the dependencies can be declared.

It could contain the Maven POM, and/or gradle/SBT build files. Probably in one GitHub repo, with subdirectories like

jcuda-maven
jcuda-sbt
jcuda-gradle

or so, each containing the respective build files.

jcuda commented 6 years ago

Version 0.9.0c has been uploaded to Maven Central.

Inspired by the comment from @luigirocca , I also created a jcuda-examples repo:

https://github.com/jcuda/jcuda-examples

(Hoping that it will not cause too much confusion with the jcuda-samples repo...). The goal here is to show the basic setup using different build tools.

Right now, it only contains the Maven project. After cloning, it should be possible to run a very basic test with

mvn clean package exec:exec

which will show (whether the examples project works and) whether the 0.9.0c binaries for Linux now work as expected.

Examples for Gradle, SBT and others will likely be added later. Pull requests are welcome, as always.


Dependencies for 0.9.0c:

<dependency>
    <groupId>org.jcuda</groupId>
    <artifactId>jcuda</artifactId>
    <version>0.9.0c</version>
</dependency>
<dependency>
    <groupId>org.jcuda</groupId>
    <artifactId>jcublas</artifactId>
    <version>0.9.0c</version>
</dependency>
<dependency>
    <groupId>org.jcuda</groupId>
    <artifactId>jcufft</artifactId>
    <version>0.9.0c</version>
</dependency>
<dependency>
    <groupId>org.jcuda</groupId>
    <artifactId>jcusparse</artifactId>
    <version>0.9.0c</version>
</dependency>
<dependency>
    <groupId>org.jcuda</groupId>
    <artifactId>jcusolver</artifactId>
    <version>0.9.0c</version>
</dependency>
<dependency>
    <groupId>org.jcuda</groupId>
    <artifactId>jcurand</artifactId>
    <version>0.9.0c</version>
</dependency>
<dependency>
    <groupId>org.jcuda</groupId>
    <artifactId>jnvgraph</artifactId>
    <version>0.9.0c</version>
</dependency>
<dependency>
    <groupId>org.jcuda</groupId>
    <artifactId>jcudnn</artifactId>
    <version>0.9.0c</version>
</dependency>
luigirocca commented 6 years ago

I've clone, entered the directory jcuda-example-maven, then issued the command mvn clean package exec:exec. It downloads a lot of stuff, then it fails with following error:

Exception in thread "main" java.lang.UnsatisfiedLinkError: Error while loading native library "JCudaDriver-0.9.0b-linux-x86_64"
Operating system name: Linux
Architecture         : amd64
Architecture bit size: 64
---(start of nested stack traces)---
Stack trace from the attempt to load the library as a file:
java.lang.UnsatisfiedLinkError: no JCudaDriver-0.9.0b-linux-x86_64 in java.library.path
    at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1867)
    at java.lang.Runtime.loadLibrary0(Runtime.java:870)
    at java.lang.System.loadLibrary(System.java:1122)
    at jcuda.LibUtils.loadLibrary(LibUtils.java:143)
    at jcuda.driver.JCudaDriver.<clinit>(JCudaDriver.java:296)
    at org.jcuda.example.maven.JCudaExampleMaven.main(JCudaExampleMaven.java:21)
Stack trace from the attempt to load the library as a resource:
java.io.IOException: No resource found with name '/lib/libJCudaDriver-0.9.0b-linux-x86_64.so'
    at jcuda.LibUtils.writeResourceToFile(LibUtils.java:323)
    at jcuda.LibUtils.loadLibraryResource(LibUtils.java:255)
    at jcuda.LibUtils.loadLibrary(LibUtils.java:158)
    at jcuda.driver.JCudaDriver.<clinit>(JCudaDriver.java:296)
    at org.jcuda.example.maven.JCudaExampleMaven.main(JCudaExampleMaven.java:21)
---(end of nested stack traces)---

    at jcuda.LibUtils.loadLibrary(LibUtils.java:193)
    at jcuda.driver.JCudaDriver.<clinit>(JCudaDriver.java:296)
    at org.jcuda.example.maven.JCudaExampleMaven.main(JCudaExampleMaven.java:21)
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 37.987 s
[INFO] Finished at: 2018-04-05T11:17:49+02:00
[INFO] Final Memory: 21M/300M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.codehaus.mojo:exec-maven-plugin:1.3.2:exec (default-cli) on project example-maven: Command execution failed. Process exited with an error: 1 (Exit value: 1) -> [Help 1]
[ERROR] 
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR] 
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException

Is there something that I'm missing? What I notice is that there seem to be a dependency on something 0.9.0b, shouldn't it be 0.9.0c?

jcuda commented 6 years ago

OK, now I've passed the point where this is becoming really embarrassing. That was a stupid, stupid mistake on my side. I definitely have to automate some of the steps that now require a sort of conscientiousness that is hard to maintain in the long run. At least I'll improve my internal checklist so that something like this never happens again. I'll do another update tomorrow. Sorry for the hassle.

luigirocca commented 6 years ago

No worries @jcuda . FIY, I'll be mostly away until the end of the next week. Maybe I will be able to try something every now and then but only occasionally until I'm back :-).

jcuda commented 6 years ago

So let's see whether the embarrassment continues: Version 0.9.0d has been released.

It may still take 1-2 hours before it is promoted into Maven Central.

But the example repo at https://github.com/jcuda/jcuda-examples/tree/master/jcuda-example-maven has been updated, so it should be rather quick & easy to test.

Thanks again for your patience and support!

jcuda commented 6 years ago

(BTW: I just tried it on windows, and it works (i.e. it is already promoted to Central), but this does not tell much about this issue in general...)

luigirocca commented 6 years ago

Hi @jcuda ! Sorry for the long wait. I tried the jcuda-example-maven test on my linux machine and it seems to me that it works as expected:

[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 6.201 s
[INFO] Finished at: 2018-04-20T15:21:16+02:00
[INFO] Final Memory: 20M/253M
[INFO] ------------------------------------------------------------------------

I do not have time to test right now if the 0.9.0d maven release works in my sbt project too, but I don't see why it shouldn't. My idea is that we can close the issue for now and hope that it will stay closed. I will give you some feedback about sbt too in the future, as soon as I can find the spare time to set it up again.

Many many thanks!

jcuda commented 6 years ago

Great, thanks for testing @luigirocca !

Since you mentioned JCudnn specifically, you may also try to un-comment the line https://github.com/jcuda/jcuda-examples/blob/master/jcuda-example-maven/src/main/java/org/jcuda/example/maven/JCudaExampleMaven.java#L31 and see whether this works. (Note that the NVIDIA CUDNN library will have to be in a "visible" path then. Basically in the LD_LIBRARY_PATH, or (for testing) simply in the root directory of the project)

If this also works, then this issue can indeed be closed.

luigirocca commented 6 years ago

It works for me (my cudnn installation is in the same path as cuda).

I've only uncommented the line you mentioned and then gave again the mvn clean package exec:exec command in the jcuda-example-maven directory - I hope it is enough!

jcuda commented 6 years ago

Great, then I'll close this one, prepare the uploads/README updates, and (finally) do the update for CUDA 9.1.

Thanks again!