cuda 10 - Githubissues

hrheydarian commented 5 years ago

Dear @benvanwerkhoven,

Recently, we got new GPU card and an update for the CUDA drivers. We now have CUDA 10.1. Fortunately, running MakeFile for 3D code works fine without any error (only some warning about gcc version). However, when I run our matlab script I get the following error:

invalid MEX-file '/home/.../MATLAB/all2all/mex_expdist.mexa64': lib//expdist.so: undefined symbol: cudaSetupArgument.

Could you please tell me if we need to change something in the MakeFile to adapt it to CUDA10?

Thanks

hrheydarian commented 5 years ago

@benvanwerkhoven Dear Ben,

Do you have time to look at this issue?

Bests, Hamidreza

benvanwerkhoven commented 5 years ago

Hi Hamidreza,

It would help me a lot to have access to the HPC servers to be able to reproduce the problem. I finally have a TUDelft guest account, but it seems Ronald still needs to add me to the hpc servers. Did you check my suggestion that it may be the case that you are still using the old mexfile from matlab? Given this error, I would expect that the build system sends the compiled mexfile to a location that is different from where matlab is looking for it.

Best, Ben

hrheydarian commented 5 years ago

Hi @ronligt

Would it be possible for you to give access to Ben for the HPC servers?

@benvanwerkhoven Yes, I did that. On the same machine, I load CUDA 8.0 and it works fine on a fresh copy of the codes and it also works fine but I get this error at runtime.

Bests, Hamidreza

ronligt commented 5 years ago

@hrheydarian , the account for @benvanwerkhoven is created and he should be able to login into the hpc24, hpc29 and hpc30

benvanwerkhoven commented 5 years ago

Hi Hamidreza,

I'm currently failing to reproduce the error that you are receiving. I've built everything on the hpc18 machine under cuda80 (typing cmake ., make, and make install). Then I login to hpc29 (typing module load cuda/10.1) and run the demo script using: matlab -nodesktop -nodisplay -nosplash -r "demo_all2all"

And I get the output:

all2all registration started !
There are 255 rows !
row 1 started!
Starting parallel pool (parpool) using the 'local' profile ...
connected to 12 workers.
row 1 done in 34.0614 seconds
row 2 started!
row 2 done in 6.9226 seconds
row 3 started!
row 3 done in 7.1196 seconds
...

I did have to add the shared libraries generated by make to my LD_LIBRARY_PATH variable, a step that is currently missing in the documentation on the README. Perhaps that's where things go wrong. Could it be that the shared library loader (which follows LD_LIBRARY_PATH) picks up an old version of the shared library somewhere on your system?

hrheydarian commented 5 years ago

Hi @benvanwerkhoven ,

Thanks for checking that.

I did the same procedure as you did and there is no problem with that. However, the problem is when you also compile the code with cuda/10.1. In this situation, the code is compiling again without error but when you run the script the error that I mentioned occurs.

Bests, Hamidreza

benvanwerkhoven commented 5 years ago

Hi @hrheydarian,

I noticed that when I ran CMake for the first time on the hpc29 (with module cuda/10.1 loaded) that CMake still finds and uses the CUDA 8 installation instead of the CUDA 10 installation. You can force CMake to use a specific version by specifying the path to the cuda root dir: cmake -D CUDA_TOOLKIT_ROOT_DIR=//usr/local/cuda-10.1 . make sure to also run make and make install.

If you build the code like that, and make sure that only the newly build shared libraries can be loaded using LD_LIBRARY_PATH, do you still run into the error? Because for me it runs on hpc29 if I build like this with CUDA 10.

Best, Ben

imphys / smlm_datafusion2d

cuda 10 #5