Improved staggered dslash is broken

maddyscientist commented 9 years ago

For certain volumes and grid partitioning, the asqtad/HISQ dslash fails to work correctly. This is bug is present in the 0.7.0 release and the current develop branch. To reproduce the bug, a multi-GPU build is required, though only a single GPU is needed.

./staggered_dslash_test --sdim 6 --tdim 6 --partition 3 --dslash_type asqtad --prec single

So far I have only tested this on a K40/K80. Observations so far:

For V=6^4, partitions 1, 3, 5, 6, 7, 11, 13, 14 fail
For V=16^4, only partition 15 fails
Setting CUDA_LAUNCH_BLOCKING=1 makes no difference, so this is likely not a stream race condition.
The bug occurs with or without GPU_COMMS.
The bug occurs with QUDA_FUSED_DSLASH or the default QUDA_DSLASH2
The bug seems to only occur with single precision and not double or half precisions (at least for the volume / partitions tested).
The bug does not occur for unimproved staggered fermions.

Once this bug is fixed we should issue a 0.7.1 release, as this is a critical bug that must be fixed.

mathiaswagner commented 9 years ago

I am trying to reproduce this but had no luck so far. My build uses QMP and I get

Results: CPU = 1404591.946479, CUDA=1404591.938202, CPU-CUDA = 1404591.937778
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from dslash
[ RUN      ] dslash.verify
0 fails = 0
1 fails = 0
2 fails = 0
3 fails = 0
4 fails = 0
5 fails = 0
1.000000e-01 Failures: 0 / 3888  = 0.000000e+00
1.000000e-02 Failures: 0 / 3888  = 0.000000e+00
1.000000e-03 Failures: 0 / 3888  = 0.000000e+00
1.000000e-04 Failures: 0 / 3888  = 0.000000e+00
1.000000e-05 Failures: 0 / 3888  = 0.000000e+00
1.000000e-06 Failures: 1555 / 3888  = 3.999486e-01
1.000000e-07 Failures: 3604 / 3888  = 9.269547e-01
1.000000e-08 Failures: 3855 / 3888  = 9.915123e-01
1.000000e-09 Failures: 3886 / 3888  = 9.994856e-01
1.000000e-10 Failures: 3888 / 3888  = 1.000000e+00
1.000000e-11 Failures: 3888 / 3888  = 1.000000e+00
1.000000e-12 Failures: 3888 / 3888  = 1.000000e+00
1.000000e-13 Failures: 3888 / 3888  = 1.000000e+00
1.000000e-14 Failures: 3888 / 3888  = 1.000000e+00
1.000000e-15 Failures: 3888 / 3888  = 1.000000e+00
1.000000e-16 Failures: 3888 / 3888  = 1.000000e+00
[       OK ] dslash.verify (8 ms)
[----------] 1 test from dslash (8 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (8 ms total)
[  PASSED  ] 1 test.

I ran the test on a XK node (K20) in Bloomington. I will try on our K40 and also the Titan X later. I have no Multi-GPU build ready there.

mathiaswagner commented 9 years ago

Using CUDA 7.0 on a K40 with an MPI build I get:

[mwagner@cream tests]$ ./staggered_dslash_test --sdim 6 --tdim 6 --partition 3 --dslash_type asqtad --prec single
running the following test:
prec recon   test_type     dagger   S_dim         T_dimension
single   18       0           0       6/6/6        6 
Grid partition info:     X  Y  Z  T
                         1  1  0  0
QUDA 0.7.0 (git v0.7.0-4-gb675841-dirty)
Found device 0: Tesla K40c
Using device 0: Tesla K40c
WARNING: Failed to determine NUMA affinity for device 0 (possibly not applicable)
Loaded 5 sets of cached parameters from ./mpi//tunecache.tsv
Randomizing fields ...
Fat links sending...Fat links sent
Long links sending...Long links sent...
Sending fields to GPU...Creating cudaSpinor
Creating cudaSpinorOut
Sending spinor field to GPU
Source CPU = 1268.360232, CUDA=1268.360233
Creating a DiracImprovedStaggeredPC operator
Tuning...
Executing 100 kernel loops...
10.547232ms per loop
Calculating reference implementation...done.
GFLOPS = 7.111194
GB/s = 7.077649

Results: CPU = 1404591.946479, CUDA=1404591.938202, CPU-CUDA = 1404591.937778
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from dslash
[ RUN      ] dslash.verify
0 fails = 0
1 fails = 0
2 fails = 0
3 fails = 0
4 fails = 0
5 fails = 0
1.000000e-01 Failures: 0 / 3888  = 0.000000e+00
1.000000e-02 Failures: 0 / 3888  = 0.000000e+00
1.000000e-03 Failures: 0 / 3888  = 0.000000e+00
1.000000e-04 Failures: 0 / 3888  = 0.000000e+00
1.000000e-05 Failures: 0 / 3888  = 0.000000e+00
1.000000e-06 Failures: 1555 / 3888  = 3.999486e-01
1.000000e-07 Failures: 3604 / 3888  = 9.269547e-01
1.000000e-08 Failures: 3855 / 3888  = 9.915123e-01
1.000000e-09 Failures: 3886 / 3888  = 9.994856e-01
1.000000e-10 Failures: 3888 / 3888  = 1.000000e+00
1.000000e-11 Failures: 3888 / 3888  = 1.000000e+00
1.000000e-12 Failures: 3888 / 3888  = 1.000000e+00
1.000000e-13 Failures: 3888 / 3888  = 1.000000e+00
1.000000e-14 Failures: 3888 / 3888  = 1.000000e+00
1.000000e-15 Failures: 3888 / 3888  = 1.000000e+00
1.000000e-16 Failures: 3888 / 3888  = 1.000000e+00
[       OK ] dslash.verify (5 ms)
[----------] 1 test from dslash (5 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (6 ms total)
[  PASSED  ] 1 test.

               initQuda Total time = 1.57358 secs

          loadGaugeQuda Total time = 0.003076 secs
              download     = 0.002340 secs (  76.1%), with        2 calls at 1.170000e+03 us per call
                  init     = 0.000713 secs (  23.2%), with        2 calls at 3.565000e+02 us per call
               compute     = 0.000000 secs (     0%), with        2 calls at 0.000000e+00 us per call
                  free     = 0.000011 secs ( 0.358%), with        2 calls at 5.500000e+00 us per call
     total accounted       = 0.003064 secs (  99.6%)
     total missing         = 0.000012 secs (  0.39%)

                endQuda Total time = 0.36417 secs

Device memory used = 19.2 MB
Page-locked host memory used = 19.1 MB
Total host memory used >= 20.0 MB

accuracy_level =0

maddyscientist commented 9 years ago

Ok, that's a good data point. I tested using mvapich 2, gcc 4.8, CUDA 6.5. This may be a real pain to debug. :(

This email message is for the sole use of the intended recipient(s) and may contain confidential information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by

reply email and destroy all copies of the original message.

mathiaswagner commented 9 years ago

Can you share your make.inc ? Really looks like a pain. Of course my setup differed in gcc, CUDA and MPI. Would have been boring otherwise ...

azrael417 commented 9 years ago

I have a cuda 6.5 machine available but I do not know the mpi by heart. I can run the test as well as soon as I am in my office.

Am 06.05.2015 um 10:35 schrieb Mathias Wagner notifications@github.com:

Can you share your make.inc ? Really looks like a pain. Of course my setup differed in gcc, CUDA and MPI. Would have been boring otherwise ...

— Reply to this email directly or view it on GitHub.

azrael417 commented 9 years ago

Hi all,

the staggered test works for me as well:

tkurth@webhipgisaxs:~/src/USQCD/enrico_src/quda-code-12132013/install/quda/tests$ ./staggered_dslash_test --sdim 6 --tdim 6 --partition 3 --dslash_type asqtad --prec single running the following test: prec recon test_type dagger S_dim T_dimension single 18 0 0 6/6/6 6 Grid partition info: X Y Z T 1 1 0 0 Found device 0: Tesla M2090 Found device 1: Tesla M2090 Using device 0: Tesla M2090 WARNING: Environment variable QUDA_RESOURCE_PATH is not set. WARNING: Caching of tuned parameters will be disabled. Randomizing fields ... Fat links sending...Fat links sent Long links sending...Long links sent... Sending fields to GPU...Creating cudaSpinor Creating cudaSpinorOut Sending spinor field to GPU Source CPU = 1268.360232, CUDA=1268.360233 Creating a DiracImprovedStaggeredPC operator Tuning... Tuned block=(128,1,1), grid=(11,1,1), shared=24577 giving 0.00 Gflop/s, 7.33 GB/s for N4quda17PackFaceStaggeredI6float2fEE with vol=648,stride=864,precision=4,comm=1100 Tuned block=(64,1,1), shared=12289 giving 20.87 Gflop/s, 0.00 GB/s for N4quda19StaggeredDslashCudaI6float2S1_6float4fEE with type=interior,comm=1100,ghost=1100,reconstruct=18 Tuned block=(96,1,1), shared=7022 giving 49.27 Gflop/s, 0.00 GB/s for N4quda19StaggeredDslashCudaI6float2S1_6float4fEE with type=exterior_y,comm=1100,reconstruct=18 Tuned block=(96,1,1), shared=7022 giving 48.55 Gflop/s, 0.00 GB/s for N4quda19StaggeredDslashCudaI6float2S1_6float4fEE with type=exterior_x,comm=1100,reconstruct=18 Executing 100 kernel loops... 11.909440ms per loop Calculating reference implementation...done. GFLOPS = 6.297811 GB/s = 6.268103

Tuned block=(64,1,1), grid=(26,1,1), shared=512 giving 0.57 Gflop/s, 1.15 GB/s for N4quda5Norm2Id6float2S1_EE with vol=648,stride=864,precision=4 Results: CPU = 1404591.946479, CUDA=1404591.938202, CPU-CUDA = 1404591.937778 [==========] Running 1 test from 1 test case. [----------] Global test environment set-up. [----------] 1 test from dslash [ RUN ] dslash.verify 0 fails = 0 1 fails = 0 2 fails = 0 3 fails = 0 4 fails = 0 5 fails = 0 1.000000e-01 Failures: 0 / 3888 = 0.000000e+00 1.000000e-02 Failures: 0 / 3888 = 0.000000e+00 1.000000e-03 Failures: 0 / 3888 = 0.000000e+00 1.000000e-04 Failures: 0 / 3888 = 0.000000e+00 1.000000e-05 Failures: 0 / 3888 = 0.000000e+00 1.000000e-06 Failures: 1555 / 3888 = 3.999486e-01 1.000000e-07 Failures: 3604 / 3888 = 9.269547e-01 1.000000e-08 Failures: 3855 / 3888 = 9.915123e-01 1.000000e-09 Failures: 3886 / 3888 = 9.994856e-01 1.000000e-10 Failures: 3888 / 3888 = 1.000000e+00 1.000000e-11 Failures: 3888 / 3888 = 1.000000e+00 1.000000e-12 Failures: 3888 / 3888 = 1.000000e+00 1.000000e-13 Failures: 3888 / 3888 = 1.000000e+00 1.000000e-14 Failures: 3888 / 3888 = 1.000000e+00 1.000000e-15 Failures: 3888 / 3888 = 1.000000e+00 1.000000e-16 Failures: 3888 / 3888 = 1.000000e+00 [ OK ] dslash.verify (5 ms) [----------] 1 test from dslash (5 ms total)

[----------] Global test environment tear-down [==========] 1 test from 1 test case ran. (5 ms total) [ PASSED ] 1 test.

           initQuda Total time = 1.03479 secs

      loadGaugeQuda Total time = 0.002315 secs
          download     = 0.001579 secs (  68.2%), with        2 calls at 7.895000e+02 us per call
              init     = 0.000718 secs (    31%), with        2 calls at 3.590000e+02 us per call
           compute     = 0.000001 secs (0.0432%), with        2 calls at 5.000000e-01 us per call
              free     = 0.000009 secs ( 0.389%), with        2 calls at 4.500000e+00 us per call
 total accounted       = 0.002307 secs (  99.7%)
 total missing         = 0.000008 secs ( 0.346%)

            endQuda Total time = 0.384564 secs

Device memory used = 19.2 MB Page-locked host memory used = 19.1 MB Total host memory used >= 20.0 MB

accuracy_level =0

I use: cuda 6.0, openmpi 1.6.1. The two cards I tested it on are ancient m2090 though.

maddyscientist commented 9 years ago

Thanks for the data points. I won't be able to get the make.inc until tomorrow now. I"ll plan on doing more tests comparing different toolkits and MPI versions to see if I can help to narrow this down.

maddyscientist commented 9 years ago

Ok, getting back to this now. The make.in was generated from the following configure:

./configure --enable-multi-gpu --disable-domain-wall-dirac --enable-staggered-dirac --disable-wilson-dirac --disable-twisted-mass-dirac --disable-clover-dirac --with-cuda=$CUDA_HOME --with-mpi=$MPI_HOME cc=mpicc CC=mpicxx --disable-milc-interface --disable-gpu-comms

[mclark@dt02 quda]$ gcc --version gcc (GCC) 4.8.2 Copyright (C) 2013 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

[mclark@dt02 quda]$ mpichversion MVAPICH2 Version: 2.0 MVAPICH2 Release date: Fri Jun 20 20:00:00 EDT 2014 MVAPICH2 Device: ch3:mrail MVAPICH2 configure: --prefix=/shared/devtechapps/mpi/mvapich2-2.0/gnu --enable-shared --enable-cuda --with-cuda=/shared/apps/cuda/CUDA-v6.5.14 MVAPICH2 CC: /shared/apps/rhel-6.2/tools/gcc-4.7.3/bin/gcc -DNDEBUG -DNVALGRIND -O2 MVAPICH2 CXX: /shared/apps/rhel-6.2/tools/gcc-4.7.3/bin/g++ -DNDEBUG -DNVALGRIND MVAPICH2 F77: gfortran -L/lib -L/lib -O2 MVAPICH2 FC: /shared/apps/rhel-6.2/tools/gcc-4.7.3/bin/gfortran

maddyscientist commented 9 years ago

Bug is confirmed on openmpi as well, so MPI version is not the issue.

maddyscientist commented 9 years ago

Ok, progress in diagnosing this strange bug:

tests/staggered_dslash_test .... --tune true

FAIL

tests/staggered_dslash_test .... --tune false

PASS

tests/staggered_dslash_test .... --tune true

PASS If I change the partitioning, or volume, then we get the same behaviour again. I would hazard a guess to say that the problem is likely related to uninitialized memory in the tuner? Continuing to investigate.

maddyscientist commented 9 years ago

Ok, I've found the problem. The number of faces per direction packed in the dslash face packer was not used to label the tuned parameters. As a result, the same parameters were used for unimproved staggered and improved staggered fermions, giving the wrong answer for improved staggered fermions if unimproved was tuned prior.

I've fixed this bug in the branch hotfix/dslash_pack_nface by simply adding the nFace parameter to the tuning label.

maddyscientist commented 9 years ago

Fixed with pull #236 .

lattice / quda

Improved staggered dslash is broken #230

Ok, that's a good data point. I tested using mvapich 2, gcc 4.8, CUDA 6.5. This may be a real pain to debug. :(

reply email and destroy all copies of the original message.