jcuda / jcuda-main

Summarizes the main JCuda libraries
MIT License
99 stars 20 forks source link

JCuSparse SpGEMM wrong parameter handling #48

Open corepointer opened 2 years ago

corepointer commented 2 years ago

Hello!

When using the latest JCuda to do a sparse matrix multiplication, the parameters externalBuffer1 of cusparseSpGEMM_workEstimation(...) and externalBuffer2 of cusparseSpGEMM_compute(...) are checked for being NULL [1], [2] and an error is thrown. According to the sample code here [3], it is intended for these parameters to be NULL. The same issue most likely occurs with other variants of JCuda's SpGEMM handling (cusparseSpGEMMreuse_compute(...) etc). Removing the check for NULL and guarding the subsequent getPointer() with an 'if' works just fine.

Regards, Mark

[1] jcusparse/JCusparseJNI/src/JCusparse.cpp:39520 [2] jcusparse/JCusparseJNI/src/JCusparse.cpp:39641 [3] https://github.com/NVIDIA/CUDALibrarySamples/blob/master/cuSPARSE/spgemm/spgemm_example.c

corepointer commented 2 years ago

PS: just noticed that I could have added this issue more appropriately to the jcusparse repository - sorry :blush:

jcuda commented 2 years ago

Several parts of the more sophisticated libraries (like JCuSparse and JCusolver) certainly have a poor test coverage. It's hard or impossible to create automated tests from scratch. And there are certainly some function bindings that are just plainly wrong in terms of parameter handling, and this may remain unnoticed, as long as there is no actual sample that uses these functions. I generally have to rely on the ... let's call it 'integration test' level: Porting some of the CUDA samples to JCuda, and see whether these work.

I try to get the parameter handling correct. But sometimes, the CUDA API documentation is everything but clear. They usually say whether something points to host or device memory. But whether something may or may not be NULL is often not stated explicitly.

In any case: Thank you for the pointers, I'll have a closer look at this ASAP. But in the meantime: There recently has been a forum thread about a very similar issue, and in that context, I actually made some fixes for the parameter handling and ported the spgemm_example.c to JCuda, and posted it at https://forum.byte-welt.net/t/on-entry-to-cusparsespgemm-createdescr-parameter-number-1-descr-had-an-illegal-value-null-pointer/23472/2 . (It might be that I inserted some new Pointer() there at one place or the other, that serves as "pointer that is NULL for CUDA, but not null for Java", but I'll have to check that again...).

Maybe you want to check it out (and I'd be curious whether it works for you), but I'll check more closely whether this refers to the same functions/parameters that you referred to here, or whether these are further instances of the same problem that I overlooked during the previous fix.

corepointer commented 2 years ago

I have tried out your SpGEMM example and pushed it in a git repo [1]. This works fine with the empty Pointer you provided to the library calls. So I might have missed the documentation about java null needing replacement with new Pointer(). Supporting invocation of the library calls with null from Java would be a nice to have. To test jcuda/jcusparse/issues/3, I have also added a dummy invocation to cusparseSpMM_bufferSize() which makes the whole thing crash with a segfault.

[1] https://github.com/corepointer/jcuda-binaries/tree/main/jcusparse_spgemm_example

jcuda commented 2 years ago

So I might have missed the documentation about java null needing replacement with new Pointer().

No, that's indeed something that I consider as a "bug". It should (in the best case) never be necessary to use new Pointer() as a replacement for a C NULL pointer. When CUDA allows NULL, then the corresponding JCuda argument should be allowed to be null.

I'll try to fix this for the functions in question, but it's hard or impossible to derive a general rule here, I even considered to simply never do any null checks on Java side: When someone passes in a null pointer where CUDA expects a non-NULL pointer, then it should just be left to CUDA to do the proper error reporting (i.e. return an error code). The problem is... CUDA is not very good at doing this sort of sanity checks. (I'm sure they justify this somehow, most likely with ~"performance" 🙄 ). At least, I remember many cases where passing in a NULL pointer just caused a segfault, and that's something that I'd like to avoid on Java side...

jcuda commented 2 years ago

(An aside: I'll also add the JCusparseSgemmExample to https://github.com/jcuda/jcuda-samples/ at some point. I have a few of these samples (usually ported from the CUDA samples) that should sooner or later make their way into jcuda-samples...)

corepointer commented 2 years ago

Thx for looking into it :+1: I can imagine that it's not easy keeping up with all those API changes.

jcuda commented 2 years ago

The null checks for all parameters called externalBuffer.* are now omitted as of https://github.com/jcuda/jcusparse/commit/9e7808d7c40b7478f3683bfc747e57968c113cab

According to a search in the documentation, all of them are device pointers, and even though the documentation does not say whether they may be NULL or not, I'm just omitting the check. Iff such a pointer may not be NULL under whatever conditions, I'll have to leave that check to the CUDA side. (And if it causes a segfault, then it causes a segfault - I simply cannot carry this responsibility here)