Bug in CCMpred (CUDA) - Githubissues

soedinglab / CCMpred

Protein Residue-Residue Contacts from Correlated Mutations predicted quickly and accurately.

http://www.ncbi.nlm.nih.gov/pubmed/25064567

GNU Affero General Public License v3.0

107 stars 25 forks source link

Bug in CCMpred (CUDA) #6

Open fsimkovic opened 7 years ago

fsimkovic commented 7 years ago

Running CCMpred with a sequence alignment in a CUDA compiled version of CCMpred gives crashes sometimes. Error give:

adenine: felix > ccmpred alignments/1bdo.jones 1bdo.mat
Found 1 CUDA devices, using device #0: Quadro K4000
Total GPU RAM:      3,217,752,064
Free GPU RAM:       2,617,708,544
Needed GPU RAM:       792,606,940 ✓
CUDA error No. 0 in /opt/CCMpred/src/evaluate_cuda_kernels.cu at line 819

Running the same command with flag -t 2 runs fine.

sseemayer commented 7 years ago

Hi Felix, I don't have access to a suitable GPU/computer combination to debug this at the moment so I'm afraid that I will not be able to help 😞

fsimkovic commented 7 years ago

No worries, the CPU version works fine so there's no rush. Just thought I'd report it ...

tianmingzhou commented 7 years ago

I encountered a similar error. The reason seems to be that I fed CCMpred with too much sequences (~70k). (The error code I got was 6.) Besides, the macro CHECK_ERR(err) defined in include/evaluate_cuda_kernels.h and lib/libconjugrad/include/conjugrad_kernels.h (and maybe other files) may call cudaGetLastError() multiple times, like those in src/evaluate_cuda_kernels.cu, after expansion. The problem is, referring to http://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__ERROR.html#group__CUDART__ERROR_1g3529f94cb530a83a76613616782bd233, the error code will have been reset to cudaSuccess when output. So we always get "CUDA error No. 0". Something like https://codeyarns.com/2011/03/02/how-to-do-error-checking-in-cuda/ may be a solution.

kWeissenow commented 4 years ago

This issue is still present, hiding error codes and always showing No. 0. The reason being the error checking via CHECK_ERR(cudaGetLastError()); which is not a function but a preprocessor macro defined as #define CHECK_ERR(err) {if (cudaSuccess != (err)) { printf("CUDA error No. %d in %s at line %d\n", (err), __FILE__, __LINE__); exit(EXIT_FAILURE); } } in evaluate_cuda_kernels.h, line 9. It therefore expands to call cudaGetLastError() two times, consuming the actual error code before displaying it.

I suggest to change the macro to #define CHECK_ERR(err) { int e = (err); if (cudaSuccess != e) { printf("CUDA error No. %d in %s at line %d\n", e, __FILE__, __LINE__); exit(EXIT_FAILURE); } }