CUDA 11.7 - Githubissues

jcuda commented 2 years ago

The update for CUDA 11.7 will be tracked here.

jcuda commented 2 years ago

I'll probably not be able to do the update this week, but will have some time starting mid of next week, so "stay tuned"...

jcuda commented 2 years ago

The update for CUDA 11.7.0 is done. The state that is used for creating the native libraries is tagged as version-11.7.0-RC00, corresponding to the current state in master.

@blueberry I had some difficulties with cuDNN here....

^{These details may not be immediately relevant for you, but provide some context:}

The latest cuDNN binaries that are available for windows are cudnn-windows-x86_64-8.4.1.50_cuda11.6-archive: Version 8.4.1.50, supposedly for CUDA 11.6. There are no dedicated libraries for cuDNN for CUDA 11.7. Using these latest binaries, most of the cuDNN calls caused a plain crash. I mean, not an error or so, but a plain termination of the JVM. Something seems to be awfully wrong there. But... using the older DLLs from cuDNN cudnn-11.4-windows-x64-v8.2.2.26 worked smoothly, even for the latest version. There have not been any API changes between these versions, FWIW, so I'm not entirely sure where I should start digging to investigate this further...

It would be great to know if you encounter any difficulties with cuDNN when it tries to run the "binding tests" in the Maven build, and if so (or even if not:) which version of cuDNN you are using, and for which CUDA version this cuDNN version is intended...

blueberry commented 2 years ago

Hi @jcuda, I can do this in 2 weeks. Lots of different stuff right now at my plate.

corepointer commented 2 years ago

Hi! I just tried to compile this on Ubuntu 22 with CUDA 11.7.0 and CUDNN 8.4.1 and can confirm your finding, @jcuda , that it is not working :-/ The 11.6 in the file name of the cudnn package should not be a problem, as the download page says "11.x".

corepointer commented 2 years ago

I fixed the mentioned cuDNN issues in jcuda/jcudnn#6

corepointer commented 2 years ago

For one of the fixes I'm not sure if the real function call would work now. It just seems that cudnnCTCLoss_v8() does not like being called with a cudnn handle that is null.

jcuda commented 2 years ago

I'll review https://github.com/jcuda/jcudnn/pull/6 ASAP. For now, it would not make sense to compile 11.7 (except for "trying out whether everything else works", maybe). I'll drop a note here when the fixes have been applied.

corepointer commented 2 years ago

For one of the fixes I'm not sure if the real function call would work now. It just seems that cudnnCTCLoss_v8() does not like being called with a cudnn handle that is null.

Correction one more time: Since all the parameters are 0 or null pointer in the binding test, any of these parameters could be the problem. In this quick fix I only test for the handle to be null and avoid calling the function altogether.

jcuda commented 2 years ago

I wrote a comment in the PR. The 'BasicBindingTest' is really just checking whether the native version of the functions exists. Usually, calling such a function with nulls/0s should cause some error or exception, but should not crash (and that's sufficient for this test). More details have to be investigated to figure out what's wrong about that with the latest versions of CUDA+cuDNN.

jcuda commented 2 years ago

A heads-up:

The bug at https://github.com/jcuda/jcusparse/issues/3 has not been fixed yet, but will try to do this as soon as possible (together with investigating the current issues of cuDNN).

For everybody who considered to create the Linux binaries: Hold on for a moment. I'd really like to get these bug fixes in for the 11.7 release...

jcuda commented 2 years ago

I think the issue with cuDNN that caused some delays here is now ... "resolved" via https://github.com/jcuda/jcudnn/commit/5104edb30b1000934bceeb93842587133f7c0b75

@corepointer The issues that you wanted to solve in https://github.com/jcuda/jcudnn/pull/6 may have been caused by me having missed to update the JCudnn.hpp header file for the latest API. And this, in turn, was caused by the fact that I could not run the 'basic binding test' without causing segfaults with the latest version of cuDNN.

Details:

Most of the cuDNN functions apparently 'did not work' in the latest version. But it seems that this was only caused by a missing 'ZLIB'. The effect of that was that cuDNN printed a message ...

Could not locate zlibwapi.dll. Please make sure it is in your library path!

and then blatantly terminated the process. But this message did not appear in the Eclipse console or the Maven build! . Only after some pain-in-the-back debugging steps, I saw it, and doing a websearch eventually led to this: https://docs.nvidia.com/deeplearning/cudnn/install-guide/index.html#install-zlib-linux . Yeah. Just change the installation procedure between 8.1.1 and 8.4.1 in a way that breaks everyhing unless you re-read the whole installation instructions for each release from top to bottom. Why not.

There is still one function that causes a segfault, namely cudnnCTCLoss_v8. This function is now simply skipped in the test. I'm pretty sure that this will bite back at some point. I just cannot sort out this one right now.

If somebody wants to give building the JCuda binaries for Linux a try:

Install ZLIB, as described at https://docs.nvidia.com/deeplearning/cudnn/install-guide/index.html#install-zlib-linux ... 🙄
Build JCuda as usual, from the state that is tagged as version-11.7.0-RC01, which is the same as master at the time of writing this

jcuda commented 2 years ago

Just a short ping @blueberry or @corepointer , for the case that someone wants to give that current state a try.

blueberry commented 2 years ago

I didn't forget. I just need my system with CUDA 11.6 for several more weeks, because I'm in the middle of a development job. I thought that some of the people who need older clib would prefer that they build it. If not, I'll build it in a few weeks (but the clib I can't control).

corepointer commented 2 years ago

I created binaries for cud{a,nn} 11.7.0/8.4.1 on Ubuntu 18.04 last night. Seems to run ok but I'm not done testing. If all is well I'll make that available tonight.

jcuda commented 2 years ago

If you intend to upload them in the jcuda-binaries directory, maybe blueberry can have a look, too: He has several larger libraries built on top of JCuda, with considerable test coverage (thus, implicitly testing JCuda, which is a good thing. In JCuda itself, the testing is a bit shallow: Beyond the 'basic binding tests', I usually just try out the samples, which is not nearly enough test coverage...)

corepointer commented 2 years ago

Sorry for the delay. Binaries are online. Remaining issues are most probably in SystemDS and not JCuda (not done testing everything though).

blueberry commented 2 years ago

If you intend to upload them in the jcuda-binaries directory, maybe blueberry can have a look, too: He has several larger libraries built on top of JCuda, with considerable test coverage (thus, implicitly testing JCuda, which is a good thing. In JCuda itself, the testing is a bit shallow: Beyond the 'basic binding tests', I usually just try out the samples, which is not nearly enough test coverage...)

Could you release a beta version, so I try them as-is with tests from all of my libraries right away? I have to upgrade my system to cuda 11.7 to try JCuda 0.11.7, and I am not sure whether I can easily switch back, so we'd have to go all-in.

corepointer commented 2 years ago

Could you release a beta version, so I try them as-is with tests from all of my libraries right away? I have to upgrade my system to cuda 11.7 to try JCuda 0.11.7, and I am not sure whether I can easily switch back, so we'd have to go all-in.

You can install CUDA 11.7 (or any versoin) manually and set some environment variables to point a shell to it. This way you prevent interference with your system:

Download the "runfile" version of CUDA and make that file executable.
cuda_11.7.0_515.43.04_linux.run --silent --toolkit --no-drm --no-man-page --no-opengl-libs --override --installpath=/opt/cuda-11.7
Unpack the CuDNN you downloaded to /opt/cuda-11.7/lib64
Modify and export PATH to contain /opt/cuda-11.7/bin and LD_LIBRARY_PATH to contain /opt/cuda-11.7/lib64.

Now you should be good to go from within the shell that contains the modified environment variables. The method described above does not install the driver, so the only system modification is the driver that might need upgrading (but not necessarily if it's not too outdated). I guess it's best to leave that to your system packages.

jcuda commented 2 years ago

@blueberry I understand the hesitation. Upgrading CUDA always feels like a one-way road.

But I'm not entirely sure what you mean by

Could you release a beta version, so I try them as-is with tests from all of my libraries right away?

It sounds like you wanted to try the JCuda 11.7 binaries when you still have CUDA 11.6 installed, but that is not supposed to work. I don't know whether the instructions by corepointer are a solution for that in your case.

The releases do involve some manual work (packaging, tagging, uploading to sonatype...). I have to allocate a bit of time for that, and would prefer to not have a "beta" release today and a "real" release tomorrow
The releases usually go relatively smoothly. There have been very few cases where something was so fundamentally broken that a "patch" release was necessary

So I'd rarther create a 11.7.0 release from the binaries that corepointer provided (a bit later today). You could try to test them with CUDA 11.6 (even though that will likely not work), or with CUDA 11.7 which was installed 'manually', as described by corepointer. In any case: Iff there is something fundamentally wrong with this release, I'd try to fix it and create a 11.7.0b as quickly as possible.

Or to put it that way: There are the options of

releasing a 11.7.0-beta, and a 11.7.0 (that either is the same, or contains some fixes), or
releasing a 11.7.0 directly (and a 11.7.0b, if fixes are necessary)

I don't see the technical difference between then when it comes to testing a certain release...

jcuda commented 2 years ago

JCuda 11.7.0 is on its way into Maven Central, and should soon be available under the usual coordinates:

<dependency>
    <groupId>org.jcuda</groupId>
    <artifactId>jcuda</artifactId>
    <version>11.7.0</version>
</dependency>
<dependency>
    <groupId>org.jcuda</groupId>
    <artifactId>jcublas</artifactId>
    <version>11.7.0</version>
</dependency>
<dependency>
    <groupId>org.jcuda</groupId>
    <artifactId>jcufft</artifactId>
    <version>11.7.0</version>
</dependency>
<dependency>
    <groupId>org.jcuda</groupId>
    <artifactId>jcusparse</artifactId>
    <version>11.7.0</version>
</dependency>
<dependency>
    <groupId>org.jcuda</groupId>
    <artifactId>jcusolver</artifactId>
    <version>11.7.0</version>
</dependency>
<dependency>
    <groupId>org.jcuda</groupId>
    <artifactId>jcurand</artifactId>
    <version>11.7.0</version>
</dependency>
<dependency>
    <groupId>org.jcuda</groupId>
    <artifactId>jcudnn</artifactId>
    <version>11.7.0</version>
</dependency>

@corepointer Thanks for providing the binaries for this release, and @blueberry Thanks for the continued support and tests!

One detail @corepointer :

I really appreciate your contribution of the native libraries. But for the release, I actually need the JAR files that contain the libraries. When you do the (jcuda-parent) mvn clean install and (jcuda-main) mvn clean package after compiling the native libraries, there should be a bunch of JARs in the jcuda-main/output directory. Among them should be 7 (seven) JARs for the native libraries, like jcublas-natives-11.7.0-linux-x86_64.jar These are the libraries that I need.

I can of course take the native libraries and pack them into JARs, and I just did this for this release, but ... in contrast to just using the JARs that are created during the automated build, that's a manual process, and thus, more error prone. (Fingers crossed that I didn't make some stupid mistake here...)

@blueberry Depending on your intended test strategy (i.e. testing this with CUDA 11.6, or doing a 'local' installation of CUDA 11.7, or doing a 'global' installation of CUDA 11.7): When you encounter any errors, just drop me a note, and I'll try to fix them ASAP.

corepointer commented 2 years ago

@jcuda I've pushed the jars in corepointer/jcuda-binaries@51db3399885b1b5f303eb9d076dc4554fd1a0426

jcuda commented 2 years ago

Thanks. Iff it is necessary to create a new release (with unmodified natives), I'll use these. For the release above, I created the JARs manually (hopefully without messing something up in the process).

blueberry commented 2 years ago

Hi Marco,

I have just tried JCuda 11.7.0, and, surprisingly, it DOES work on my existing CUDA 11.6 platform! I'll update Deep Diamond with breaking changes introduced with this JCudnn release (device params and a few other stuff), then I'll first make sure everything works with CUDA 11.6, and finally update to CUDA 11.7 and try everything again.

I don't know what you've changed but now JCuda is much more flexible! Thank you.

blueberry commented 2 years ago

And JCuda 11.7.0 works well with CUDA 11.7 too! Thank you @jcuda

jcuda / jcuda-main

CUDA 11.7 #50