Multi-dimensional Wilson parallelization appears buggy

maddyscientist commented 12 years ago

Using multiple GPUs in anything other than the T dimension seems to get the wrong answer, and the wilson_dslash_test fails, e.g.,

$mpirun -np 2 ./wilson_dslash_test --xdim 16 --ydim 16 --zdim 8 --tdim 64 --recon 12 --zgridsize 2 
running the following test:
prec recon   test_type     dagger   S_dim         T_dimension   dslash_type
single   12       1           0       16/16/8        64            wilson
Grid partition info:     X  Y  Z  T
                         0  0  1  0
Randomizing fields... done.
QUDA: Found device 0: GeForce GTX 480
QUDA: Found device 1: GeForce GTX 480
QUDA: Found device 2: GeForce GTX 480
QUDA: Found device 3: GeForce GTX 480
QUDA: Using device 0: GeForce GTX 480
Sending gauge field to GPU
Creating cudaSpinor
Creating cudaSpinorOut
Sending spinor field to GPU
Source: CPU = 1.0497e+06, CUDA = 1.0497e+06
Source: CPU = 1.0497e+06, CUDA = 1.0497e+06
Creating a DiracWilsonPC operator

Spinor mem: 0.006 GiB
Gauge mem: 0.026 GiB
Calculating reference implementation...done.
Executing 100 kernel loops...
done.

206.379265ms per loop
GFLOPS = 85.357785
GiB/s = 70.978243

Results: CPU = 875986.745504, CUDA=876005.116754, CPU-CUDA = 876005.116689
0 fails = 15690
1 fails = 15707
2 fails = 15680
3 fails = 15756
4 fails = 15685
5 fails = 15749
6 fails = 15888
7 fails = 15838
8 fails = 15838
9 fails = 15862
10 fails = 15863
11 fails = 15848
12 fails = 15870
13 fails = 15828
14 fails = 15832
15 fails = 15852
16 fails = 15844
17 fails = 15839
18 fails = 15681
19 fails = 15725
20 fails = 15697
21 fails = 15719
22 fails = 15717
23 fails = 15761
1.000000e-01 Failures: 9630 / 1572864  = 6.122589e-03
1.000000e-02 Failures: 253078 / 1572864  = 1.609027e-01
1.000000e-03 Failures: 378769 / 1572864  = 2.408148e-01
1.000000e-04 Failures: 392425 / 1572864  = 2.494971e-01
1.000000e-05 Failures: 393840 / 1572864  = 2.503967e-01
1.000000e-06 Failures: 393965 / 1572864  = 2.504762e-01
1.000000e-07 Failures: 394184 / 1572864  = 2.506154e-01
1.000000e-08 Failures: 1168774 / 1572864  = 7.430865e-01
1.000000e-09 Failures: 1530791 / 1572864  = 9.732507e-01
1.000000e-10 Failures: 1568667 / 1572864  = 9.973316e-01
1.000000e-11 Failures: 1572459 / 1572864  = 9.997425e-01
1.000000e-12 Failures: 1572821 / 1572864  = 9.999727e-01
1.000000e-13 Failures: 1572862 / 1572864  = 9.999987e-01
1.000000e-14 Failures: 1572864 / 1572864  = 1.000000e+00
1.000000e-15 Failures: 1572864 / 1572864  = 1.000000e+00
1.000000e-16 Failures: 1572864 / 1572864  = 1.000000e+00

This problem is present in the latest master commit a64abf95ba52eefab659 on CUDA 4.0, and is likely the same issue that Balint reported, hence is probably a bug introduced at around commit 99b16e1058ecfb3458e7.

maddyscientist commented 12 years ago

This bug only appears when the local spatial sizes diifer, e.g., local volume of 16^3x64 is fine, but 16^2x8x64 fails. This means that likely a lattice dimension has likely been swapped accidentally.

gshi commented 12 years ago

yeah, there is that type of bug in staggered before. It turns out be something like X1 is used as X2 in the kernel core file.

lattice / quda

Multi-dimensional Wilson parallelization appears buggy #30