lattice / quda

QUDA is a library for performing calculations in lattice QCD on GPUs.
https://lattice.github.io/quda
Other
289 stars 97 forks source link

cuda-memcheck errors #222

Closed alexstrel closed 9 years ago

alexstrel commented 9 years ago

cuda-memcheck utility returns CUDA_ERROR_INVALID_VALUE while the application executed successfully, probably a QMP-related issue. This is an example of the single-GPU execution with cuda-memcheck (code was built with QMP): ========= Program hit CUDA_ERROR_INVALID_VALUE (error 1) due to "invalid argument" on CUDA API call to cuPointerGetAttribute. ========= Saved host backtrace up to driver entry point at error ========= Host Frame:/usr/lib64/libcuda.so.1 (cuPointerGetAttribute + 0x174) [0x13d374] ========= Host Frame:./tests/invert_test_orig [0xb2b9d0] ========= Host Frame:./tests/invert_test_orig [0xca660d] ========= Host Frame:./tests/invert_test_orig [0xca6051] ========= Host Frame:./tests/invert_test_orig (mca_coll_self_allreduce_intra + 0x6f) [0xb601bf] ========= Host Frame:./tests/invert_test_orig [0xac009c] ========= Host Frame:./tests/invert_test_orig [0xaa54ab] ========= Host Frame:./tests/invert_test_orig [0x30e1be] ========= Host Frame:./tests/invert_test_orig [0x2ca6a2] ========= Host Frame:./tests/invert_test_orig [0x2d282e] ========= Host Frame:./tests/invert_test_orig [0x1e6123] ========= Host Frame:./tests/invert_test_orig [0x78062e] ========= Host Frame:./tests/invert_test_orig [0x7a401] ========= Host Frame:./tests/invert_test_orig [0x32a3f] ========= Host Frame:/lib64/libc.so.6 (__libc_start_main + 0xfd) [0x1ed1d] ========= Host Frame:./tests/invert_test_orig [0x315f1]

mathiaswagner commented 9 years ago

Some questions:

alexstrel commented 9 years ago

Yes, I used current muster (quda-0.7 release), and the invert_test application. I'll check other options, i.e., pure MPI and pure single-GPU builds. Yes , no errors without cuda-memcheck.

mathiaswagner commented 9 years ago

A single GPU build completes cuda-memcheck ./invert_test without errors for me (using CUDA 7.0).

I have not yet tried MPI or QMP. It might also help to enable HOST_DEBUG for compilation and tracking down the location of the error.

nmrcardoso commented 9 years ago

Did you run your program with mpi? QMP was build with mpi? The CUDA-aware MPI env flag is active?

I also got that kind of errors but only using mpi and if I set MV2_USE_CUDA in MVAPICH2. Also if I run a non CUDA-aware MPI program, with MV2_USE_CUDA active this gives a lot of that errors in cuda-memcheck, and off course with MV2_USE_CUDA=0, there is no cuda error.

nmrcardoso commented 9 years ago

I made a simple test only using cuPointerGetAttribute and passing a device and host pointers, no MPI here. if cuPointerGetAttribute is called and if if the pointer is not a device pointer then cuda-memcheck always returns errors. This is very annoying if we want to run cuda-memcheck and somewhere in the code there is a call to this function.

mathiaswagner commented 9 years ago

I have not checked in detail but does

http://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__UNIFIED.html#group__CUDA__UNIFIED_1g0c28ed0aff848042bc0533110e45820c

maybe explains the issue?

On Apr 22, 2015, at 14:20, nmrcardoso notifications@github.com<mailto:notifications@github.com> wrote:

I made a simple test only using cuPointerGetAttribute and passing a device and host pointers, no MPI here. if cuPointerGetAttribute is called and if if the pointer is not a device pointer then cuda-memcheck always returns errors. This is very annoying if we want to run cuda-memcheck and somewhere in the code there is a call to this function.

— Reply to this email directly or view it on GitHubhttps://github.com/lattice/quda/issues/222#issuecomment-95290359.


Mathias Wagner Department of Physics SW 117 - Indiana University Bloomington, IN 47405 email: mathwagn@indiana.edumailto:mathwagn@indiana.edu

nmrcardoso commented 9 years ago

cuda-memcheck errors from cuPointerGetAttribute are benign errors. cuPointerGetAttribute is used to test whether the pointer is part of a cuda 'unified memory' or cuda managed memory object, however if the pointer passed is a "non cuda pointer" then cuda-memcheck triggers this as an error. I think that there is no way to tell cuda-memcheck to ignore this kind of errors, just ignore it.