lattice / quda

QUDA is a library for performing calculations in lattice QCD on GPUs.
https://lattice.github.io/quda
Other
286 stars 94 forks source link

Multi-GPU tests with nonzero "partition" fail when run on a single GPU #93

Closed rbabich closed 11 years ago

rbabich commented 11 years ago

I'm guessing that this was introduced by Mike's comms merge. With an MPI build, running "dslash_test --partition 8" gives

ERROR: (CUDA) invalid resource handle (rank 0, host blast, dslash_quda.cu:1434 in wilsonDslashCuda())
maddyscientist commented 11 years ago

Ok, I have reproduced this now. Ron, I think you introduced the bug when you altered the makefile to always enable texture objects on Kepler. You have introduced a "USE_TEXTURE_OBJECTS" macro, but haven't updated the library to use this uniformly.

rbabich commented 11 years ago

Thanks for fixing that, but I'm still getting this error with the latest master (MPI build).

maddyscientist commented 11 years ago

Ron, what type of system are you testing on?

rbabich commented 11 years ago

This is on blast (GTX 480, Harpertown Xeon). I guess you're not seeing it?

maddyscientist commented 11 years ago

Nope, I've never repo-ed this problem. I did have a problem with the macro definition, but I didn't get this partition error. Can you paste your configure params?

rbabich commented 11 years ago

I just noticed that I have Dslash profiling turned on. Maybe that's it.

./configure \
    --enable-cpu-arch=x86_64 \
    --enable-gpu-arch=sm_20 \
    --disable-twisted-mass-dirac \
    --disable-domain-wall-dirac \
    --with-cuda=/usr/local/cuda \
    --enable-dslash-profiling \
    --enable-multi-gpu \
    --with-mpi=/usr/lib64/openmpi/1.4-gcc
maddyscientist commented 11 years ago

Can you try without the profiling enabled then? I was actually considering removing the profiling since nvvp does such a good job now, and it gets in the way of the rewrite required to use GPU Direct RDMA / P2P.

rbabich commented 11 years ago

That fixed it. Now to figure out why...

maddyscientist commented 11 years ago

We could just delete the profiling and call it a day.

rbabich commented 11 years ago

No objections here.

maddyscientist commented 11 years ago

I'm closing this issue with commit 074a130119fc2af953a9b1e0f5936d6dc5b9f851. The underlying cause is still there, but this is addressed with #95.