jcuda / jcuda-main

Summarizes the main JCuda libraries
MIT License
99 stars 20 forks source link

Broken cudnnRNNForward function signature in jcuda (possibly all cudnn v8 versions) #49

Closed blueberry closed 1 week ago

blueberry commented 2 years ago

Hi @jcuda

I'm in the middle of trying out the jcuda RNN portion of the JCudnn library. As usual, everything is well thought out and I manage to find my way through a relatively convoluted nvidia's ways to make working examples in JCuda.

However, it seems that the automated JCuda generator miss this showstopping aspect of cudnnRNNForward function. Namely, the argument devSeqLengths (https://github.com/jcuda/jcudnn/blob/157f7b4631eb5292f81cd7635cf35a07c1c11985/JCudnnJava/src/main/java/jcuda/jcudnn/JCudnn.java#L3554) is declared as int[] in JCuda, which follows the cudnn signature where it has been declared as const int32_t devSeqLengths[].

HOWEVER, the cudnn documentation states that this argument "must be stored in GPU memory as it is accessed asynchronously by GPU kernels" (https://docs.nvidia.com/deeplearning/cudnn/api/index.html#cudnnRNNForward).

My code throws an CUDNN_STATUS_BAD_PARAM exception from cudnn at this point, and no matter what variations I try for other parameters I can't get it to work. Can you please check whether this is a broken call in JCudnn, or JCuda does transfer that array to GPU memory under the hood? I'm out of ideas regarding my code.

blueberry commented 2 years ago

Update: by setting the hx, hy, cx, cy arguments I made it execute (I still have to make sure the results are correct, but if they are not, that's my mistake), so this means that the error was in my code and JCudnn does this call properly.

Maybe it would be helpful if you could comment on the "must be stored in GPU memory" part, so I, or other future user of JCuda, can more easily follow similar remarks from the cuda documentation, and how generally JCuda handles that.

Thank you and sorry for the false alarm.

jcuda commented 2 years ago

I'm surprised that you closed this. The documentation that you linked to indeed seems to indicate that this should be a Pointer to GPU memory. So even if it does not crash immediately, the function might access wrong data, and might only seem to work (and produce wrong results), or cause inexplicable crashes (maybe at a later point). Maybe, it works only because the function was called in a form where it does not require (and therefore not really access) the given parameter.

(Unfortunately, the function is one of the hyper-complex ones: It's nearly impossible to understand what the function is doing without spending hours for reading documentation, and even then, it's essentially impossible to try out in a standalone fashion (even less in some "unit test")).

I'll reopen it for now, because it should probably be investigated. I'd like to verify this insofar that there should be a sample (as small as possible) that involves this function and this parameter. From a quick search, I didn't find any sample that actually uses this function, but will have a closer look at the samples in the latest cuDNN release. If you have some "minimal sample" that involves cudnnRNNForward and that you could provide here (maybe even going into jcuda-samples eventually...?), that would be great. The "test coverage" of many JCuda libraries is obviously far too low...

In some way, this question is similar to https://github.com/jcuda/jcuda-main/issues/48 , because the general issue often is: "Which memory type does a parameter have?". I'll try to allocate some time tomorrow for this and the other issue.

blueberry commented 2 years ago

Hi Marco. So far I've made it work as intended, for the minimal example (in Clojure). Here's the test example: https://github.com/uncomplicate/deep-diamond/blob/4edc61ef52a4d6c44e0da0d9c2980fbd1b4f6713/test/uncomplicate/diamond/internal/cudnn/core_test.clj#L476

You are right that there's no example anywhere on the internet that demonstrates how to use this function. It's probably becuse it's newly introduced in cudnn v8. There is typically one available examlpe for the deprecated pre-v8 function. Nevertheless, RNN examles are very, very, scarce. I wouldn't be surprised if the test I've created is the most concise and straightforward that you'll find anywhere. There is also the equivalent test that uses Intel's DNNL library in the repository that I've linked. The results match.

Anyway, now that I've explored the cudnn RNN API, the next task is to integrate in into the existing deep diamond infrastructure. Then I'll be able to compare it to the existing working DNNL backend and see whether there is any problem with that memory being on device or not. Perhaps cudnn transparently maps the memory you gave it. I don't know, but in any case we'll be able to test this.

BTW my code is in Clojure, but I believe you'll be able to follow it with your Java knowledge.

blueberry commented 2 years ago

Update: You seem to be right that this argument get ignored. I deliberately provided incorrect dimensions there, and my example still produces correct results. I suspect that this is because I use the common case where all sequences in the batch are of same equal length (the only case that is supported by DNNL). So, this might not be a showstopper for now, but it is something that should be handled generally, if the full cudnn capability is the goal.

jcuda commented 2 years ago

Just a short heads-up: I have changed the type of all devSeqLengths.* parameters from int[] to (device) Pointer in https://github.com/jcuda/jcudnn/commit/d99fe32434630a5d9734c2d29939e558ca890b88

I have not yet tried it out. (I'll consider to port your sample to Java, or ... maybe just try out the Clojure version. I tried out neanderthal a while ago, but have seen that you have worked extensively in that area, and might have a look at deep-diamond at some point...). But maybe you have the chance to give it a try.

blueberry commented 2 years ago

Sure, I'll try it out, but in the next few days/weeks I have to wrap the RNN support up. Once I make it work with the current JCudnn, I'll build the new one and hopefully it will work. In the meantime, it seems that CUDA 11.7 is out. Maybe that will be a good opportunity to introduce this change in a release?

jcuda commented 2 years ago

Sure, this one and the fix from https://github.com/jcuda/jcuda-main/issues/48#issuecomment-1140268730 should preferably both go into the next release. I don't know exactly when I'll tackle the update for 11.7, but just skimmed over the Changelog, and it seems like there are not sooo many API changes.

jcuda commented 1 week ago

I think that this is resolved. Feel free to repoen if not.