Closed alexstrel closed 9 years ago
Some questions:
Yes, I used current muster (quda-0.7 release), and the invert_test application. I'll check other options, i.e., pure MPI and pure single-GPU builds. Yes , no errors without cuda-memcheck.
A single GPU build completes
cuda-memcheck ./invert_test
without errors for me (using CUDA 7.0).
I have not yet tried MPI or QMP.
It might also help to enable HOST_DEBUG
for compilation and tracking down the location of the error.
Did you run your program with mpi? QMP was build with mpi? The CUDA-aware MPI env flag is active?
I also got that kind of errors but only using mpi and if I set MV2_USE_CUDA in MVAPICH2. Also if I run a non CUDA-aware MPI program, with MV2_USE_CUDA active this gives a lot of that errors in cuda-memcheck, and off course with MV2_USE_CUDA=0, there is no cuda error.
I made a simple test only using cuPointerGetAttribute and passing a device and host pointers, no MPI here. if cuPointerGetAttribute is called and if if the pointer is not a device pointer then cuda-memcheck always returns errors. This is very annoying if we want to run cuda-memcheck and somewhere in the code there is a call to this function.
I have not checked in detail but does
maybe explains the issue?
On Apr 22, 2015, at 14:20, nmrcardoso notifications@github.com<mailto:notifications@github.com> wrote:
I made a simple test only using cuPointerGetAttribute and passing a device and host pointers, no MPI here. if cuPointerGetAttribute is called and if if the pointer is not a device pointer then cuda-memcheck always returns errors. This is very annoying if we want to run cuda-memcheck and somewhere in the code there is a call to this function.
— Reply to this email directly or view it on GitHubhttps://github.com/lattice/quda/issues/222#issuecomment-95290359.
Mathias Wagner Department of Physics SW 117 - Indiana University Bloomington, IN 47405 email: mathwagn@indiana.edumailto:mathwagn@indiana.edu
cuda-memcheck errors from cuPointerGetAttribute are benign errors. cuPointerGetAttribute is used to test whether the pointer is part of a cuda 'unified memory' or cuda managed memory object, however if the pointer passed is a "non cuda pointer" then cuda-memcheck triggers this as an error. I think that there is no way to tell cuda-memcheck to ignore this kind of errors, just ignore it.
cuda-memcheck utility returns CUDA_ERROR_INVALID_VALUE while the application executed successfully, probably a QMP-related issue. This is an example of the single-GPU execution with cuda-memcheck (code was built with QMP): ========= Program hit CUDA_ERROR_INVALID_VALUE (error 1) due to "invalid argument" on CUDA API call to cuPointerGetAttribute. ========= Saved host backtrace up to driver entry point at error ========= Host Frame:/usr/lib64/libcuda.so.1 (cuPointerGetAttribute + 0x174) [0x13d374] ========= Host Frame:./tests/invert_test_orig [0xb2b9d0] ========= Host Frame:./tests/invert_test_orig [0xca660d] ========= Host Frame:./tests/invert_test_orig [0xca6051] ========= Host Frame:./tests/invert_test_orig (mca_coll_self_allreduce_intra + 0x6f) [0xb601bf] ========= Host Frame:./tests/invert_test_orig [0xac009c] ========= Host Frame:./tests/invert_test_orig [0xaa54ab] ========= Host Frame:./tests/invert_test_orig [0x30e1be] ========= Host Frame:./tests/invert_test_orig [0x2ca6a2] ========= Host Frame:./tests/invert_test_orig [0x2d282e] ========= Host Frame:./tests/invert_test_orig [0x1e6123] ========= Host Frame:./tests/invert_test_orig [0x78062e] ========= Host Frame:./tests/invert_test_orig [0x7a401] ========= Host Frame:./tests/invert_test_orig [0x32a3f] ========= Host Frame:/lib64/libc.so.6 (__libc_start_main + 0xfd) [0x1ed1d] ========= Host Frame:./tests/invert_test_orig [0x315f1]