lattice / quda

QUDA is a library for performing calculations in lattice QCD on GPUs.
https://lattice.github.io/quda
Other
289 stars 97 forks source link

problem with staggered_dslash_test #200

Closed stevengottlieb closed 9 years ago

stevengottlieb commented 9 years ago

Rich has requested a weak scaling study, so I am trying staggered_dslash_test. I am having trouble running multiple gpu jobs. I compiled QUDA for multi-gpu. Here is how I launch a job on a cray: aprun -n 4 -N 1 /N/u/sg/BigRed2/quda-0.7.0/tests//staggered_dslash_test --prec double --xdi m $nx --ydim $ny --zdim $nz --tdim $nt --xgridsize 1 --ygridsize 1 --zgridsize 2 --tgridsiz e 2 >>out_4procs.$ns

Here is what shows up in output: Tue Dec 16 09:11:07 EST 2014 -rwxr-xr-x 1 sg phys 192887697 Dec 4 02:25 /N/u/sg/BigRed2/quda-0.7.0/tests/staggered_dsla sh_test running the following test: prec recon test_type dagger S_dim T_dimension double 18 0 0 24/24/48 48 Grid partition info: X Y Z T 0 0 1 1 Found device 0: Tesla K20 Using device 0: Tesla K20 Setting NUMA affinity for device 0 to CPU core 0 WARNING: Environment variable QUDA_RESOURCE_PATH is not set. WARNING: Caching of tuned parameters will be disabled. Randomizing fields ...

mathiaswagner commented 9 years ago

Strange. It seems to work for me when using

mathwagn@aprun7:~/quda/tests> aprun -n4 -N1 ./staggered_dslash_test --prec double --xdim 24 --ydim 24 --zdim 48 --tdim 48 --xgridsize 1 --ygridsize 1 --zgridsize 2 --tgridsize 2
AlexVaq commented 9 years ago

You should set QUDA_RESOURCE_PATH to any path you wish (and that you have write permissions), otherwise QUDA will retune all the kernels every time you launch the job, and this can be extremely slow, so your results for the scaling test will be completely spoilt. Actually, if you don’t see anything for a while, it an mean that QUDA is tuning some kernels, but I would expect to see this a bit later, not right after “Randomizing fields…”.

El 16/12/2014, a las 18:51, stevengottlieb notifications@github.com escribió:

Rich has requested a weak scaling study, so I am trying staggered_dslash_test. I am having trouble running multiple gpu jobs. I compiled QUDA for multi-gpu. Here is how I launch a job on a cray: aprun -n 4 -N 1 /N/u/sg/BigRed2/quda-0.7.0/tests//staggered_dslash_test --prec double --xdi m $nx --ydim $ny --zdim $nz --tdim $nt --xgridsize 1 --ygridsize 1 --zgridsize 2 --tgridsiz e 2 >>out_4procs.$ns

Here is what shows up in output: Tue Dec 16 09:11:07 EST 2014 -rwxr-xr-x 1 sg phys 192887697 Dec 4 02:25 /N/u/sg/BigRed2/quda-0.7.0/tests/staggered_dsla sh_test running the following test: prec recon test_type dagger S_dim T_dimension double 18 0 0 24/24/48 48 Grid partition info: X Y Z T 0 0 1 1 Found device 0: Tesla K20 Using device 0: Tesla K20 Setting NUMA affinity for device 0 to CPU core 0 WARNING: Environment variable QUDA_RESOURCE_PATH is not set. WARNING: Caching of tuned parameters will be disabled. Randomizing fields ...

— Reply to this email directly or view it on GitHub https://github.com/lattice/quda/issues/200.

maddyscientist commented 9 years ago

So Steve, are you finding that it is hanging for you or is this the last thing it prints and then exits?

stevengottlieb commented 9 years ago

Hi Mike,

I have been traveling back from Mountain View. The job was not hanging, it just did not seem to use more than one GPU. Some of the larger volumes did fail when memory could not be allocated.

Here is a snippet from the compilation. If it looks like some flag is missing or if there are incompatible ones, please let me know.

CC -Wall -O3 -D__COMPUTE_CAPABILITY__=350 -DMULTI_GPU -DGPU_STAGGERED _DIRAC -DGPU_FATLINK -DGPU_UNITARIZE -DGPU_GAUGE_TOOLS -DGPU_GAUGE_FORCE -DGPU_GAUGE_TOOLS -DGPU_HISQ_FORCE -DGPU_STAGGERED_OPROD -DGPU_GAUGE_TOOLS -DGPU_DIRECT -DBUILD_QDP_INTERFACE -DBUILD_MILC_INTERFACE -DNUMA_AFFINITY -I/opt/nvidia/cudatoolkit/default/include -DMPI_COMMS -I/opt/cray/mpt/7.0.4/gni/mpich2-cray/83/include -I../include -Idslash_core -I. gauge_field.cpp -c -o gauge_field.o

Thanks, Steve

On Tue, 2014-12-16 at 10:56 -0800, mikeaclark wrote:

So Steve, are you finding that it is hanging for you or is this the last thing it prints and then exits?

— Reply to this email directly or view it on GitHub.

maddyscientist commented 9 years ago

Hi Steve,

I didn't see any reported dslash gflops which is why I asked if it was hanging. If you're telling me it is running to completion then I suspect all is running fine. One thing to note is that the test only prints out the performance for one gpu and not the aggregate performance. Could this be why you think it running on one gpu only? I guess we could update the tests to print aggregate performance as well, and also the number of GPUs it is running on.

This email message is for the sole use of the intended recipient(s) and may contain confidential information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by

reply email and destroy all copies of the original message.

stevengottlieb commented 9 years ago

Hi Mike,

Thanks for getting back to me. I was only showing a snippet of the output. I don't have a lot of experience running this test. What make me suspicious is this part of the output: running the following test: prec recon test_type dagger S_dim T_dimension double 18 0 0 24/24/48 48 Grid partition info: X Y Z T 0 0 1 1

I expected grid partition info to have 2 under Z and T, but now that I look and see that X and Y are zero, I was probably misinterpreting the meaning. Is the only possible value 0 or 1 depending on whether that dimension is cut?

The other issue that is causing me to worry is that several of the jobs are failing because they are running out of memory. On one node, I can run 40^4 with this information at the end of the run: Device memory used = 3155.7 MB Page-locked host memory used = 1591.2 MB Total host memory used >= 1986.8 MB

If I try to run 40 X 80^3 on 8 nodes, the job runs out of memory: running the following test: prec recon test_type dagger S_dim T_dimension double 18 0 0 40/80/80 80 Grid partition info: X Y Z T 0 1 1 1 Found device 0: Tesla K20 Using device 0: Tesla K20 Setting NUMA affinity for device 0 to CPU core 0 WARNING: Environment variable QUDA_RESOURCE_PATH is not set. WARNING: Caching of tuned parameters will be disabled. Randomizing fields ... Fat links sending...ERROR: Aborting (rank 6, host nid00193, malloc.cpp:156 in devicemalloc ()) last kernel called was (name=,volume=,aux=) ERROR: Failed to allocate device memory (cuda_gauge_field.cu:42 in cudaGaugeField())

In fact, on 8 nodes, only 24 X 48^3 runs. The larger volumes fail. Here is the report about memory usage for this run: Device memory used = 3488.3 MB Page-locked host memory used = 1974.7 MB Total host memory used >= 2384.7 MB

When I look at the 24^4 run on a single GPU, memory report is: Device memory used = 437.6 MB Page-locked host memory used = 231.3 MB Total host memory used >= 306.6 MB

This makes me wonder if the job is running only only 1 GPU or I don't understand the input parameters, i.e., are --xdim, -ydim, etc. the per GPU grid size or the total grid size?

I can make all the scripts and output available if we can find a convenient space. That might result in faster time to solution. I can put them on my webserver, dropbox (or IU's version of it) or Blue Waters. Any preference?

Thanks, Steve

On Wed, 2014-12-17 at 07:25 -0800, mikeaclark wrote:

Hi Steve,

I didn't see any reported dslash gflops which is why I asked if it was hanging. If you're telling me it is running to completion then I suspect all is running fine. One thing to note is that the test only prints out the performance for one gpu and not the aggregate performance. Could this be why you think it running on one gpu only? I guess we could update the tests to print aggregate performance as

well, and also the number of GPUs it is running on.

This email message is for the sole use of the intended recipient(s) and may contain confidential information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by

reply email and destroy all copies of the original message.

— Reply to this email directly or view it on GitHub.

maddyscientist commented 9 years ago

Is the only possible value 0 or 1 depending on whether that dimension is cut?

Yes.

With respect to memory, when running 40^4 on a single GPU, are you partitioning the dimensions (loop back communication)? You need to do this to have the same effective per GPU memory usage as the multi-GPU runs. The force routines in particular use so-called extended fields where we allocate a local field size that includes the halo regions. I suspect you are running afoul of these issues.

Having said that, I believe there's optimization that can be done there with respect to memory usage in particular in multi-gpu mode, in particular I believe some extended gauge fields allocate a non-zero halo region in dimensions that are not partitioned, this obviously leave scope for improvement.

We can open additional bugs where necessary to reduce memory consumption (post 0.7.0) if this is a priority for you.

mathiaswagner commented 9 years ago

I never happened used loopback communication. Could you give me some instructions to use it with QUDA?

maddyscientist commented 9 years ago

With the unit tests it's easy, you just use the --partition flag to force communication on. This is a 4-bit integer where each bit signifies communication in a different dimension, with X the least significant bit and T the most significant bit. E.g. --partition 11 has X, Y, T partitioning switched on.

mathiaswagner commented 9 years ago

Will try that. Thanks.

maddyscientist commented 9 years ago

Closing.