lattice / quda

QUDA is a library for performing calculations in lattice QCD on GPUs.
https://lattice.github.io/quda
Other
287 stars 94 forks source link

staggered_dslash_test with operator type `Mat` currently fails in develop #1391

Closed weinbe2 closed 1 year ago

weinbe2 commented 1 year ago

Reproducer:

> ./staggered_dslash_test --test Mat
[...]
[ RUN      ] StaggeredDslashTest.verify
Sending fat links to GPU
Sending long links to GPU
running the following test:
prec recon   test_type     dagger   S_dim         T_dimension
single   18       Mat           0       24/24/24        24
Grid partition info:     X  Y  Z  T
                         0  0  0  0
Calculating reference implementation...Tuning...
Results: reference = 398874120.259645, QUDA = 751573605.663725, L2 relative deviation = -3.726753e-01, max deviation = 8.152082e+01
0 fails = 330879
1 fails = 331776
2 fails = 330875
3 fails = 331776
4 fails = 330915
5 fails = 331776
1.000000e-01 Failures: 1685314 / 1990656  = 8.466124e-01
1.000000e-02 Failures: 1964152 / 1990656  = 9.866858e-01
1.000000e-03 Failures: 1987997 / 1990656  = 9.986643e-01
1.000000e-04 Failures: 1990398 / 1990656  = 9.998704e-01
1.000000e-05 Failures: 1990621 / 1990656  = 9.999824e-01
1.000000e-06 Failures: 1990651 / 1990656  = 9.999975e-01
1.000000e-07 Failures: 1990656 / 1990656  = 1.000000e+00
1.000000e-08 Failures: 1990656 / 1990656  = 1.000000e+00
1.000000e-09 Failures: 1990656 / 1990656  = 1.000000e+00
1.000000e-10 Failures: 1990656 / 1990656  = 1.000000e+00
1.000000e-11 Failures: 1990656 / 1990656  = 1.000000e+00
1.000000e-12 Failures: 1990656 / 1990656  = 1.000000e+00
1.000000e-13 Failures: 1990656 / 1990656  = 1.000000e+00
1.000000e-14 Failures: 1990656 / 1990656  = 1.000000e+00
1.000000e-15 Failures: 1990656 / 1990656  = 1.000000e+00
1.000000e-16 Failures: 1990656 / 1990656  = 1.000000e+00
/quda/tests/staggered_dslash_test.cpp:54: Failure
Expected: (deviation) <= (tol), actual: 1 vs 0.0001
reference and QUDA implementations do not agree
[  FAILED  ] StaggeredDslashTest.verify (2356 ms)
[...]

Other tests (Dslash, MatPC) are passing without issue.

I'm actively investigating now, but I wanted this down in writing.

weinbe2 commented 1 year ago

~...potential false alarm, was pointing my build to the wrong version of QUDA... testing now~

Confirmed issue is indeed in develop

weinbe2 commented 1 year ago

A particularly minimal cmake command suffices:

cmake ../quda -DQUDA_DIRAC_DEFAULT_OFF=ON -DQUDA_DIRAC_STAGGERED=ON -DQUDA_PRECISION=4 -DQUDA_RECONSTRUCT=4 -DQUDA_GPU_ARCH=sm_80 -DQUDA_FAST_COMPILE_DSLASH=ON -DQUDA_FAST_COMPILE_REDUCE=ON
weinbe2 commented 1 year ago

last good commit: 103c4ff25 first bad commit: 931680a5003b135d4222cb0c1737e2516a9774a6

Unfortunately, this is when the max deviation check was introduced, so tbd where exactly things went awry...

weinbe2 commented 1 year ago

The L2 deviation from the good commit is sane:

Results: CPU = 1343272.597656, QUDA = 1343272.605499, L2 relative deviation = -2.919503e-09

while it has issues in the bad commit:

Results: reference = 1458908.264166, QUDA = 1342092.890616, L2 relative deviation = 4.087040e-02, max deviation = 8.428711e+04

The sources aren't guaranteed to be the same but should have consistent norms, based on those outputs it looks like something went weird with the host verify, tentative phew

weinbe2 commented 1 year ago

If I incrementally add bits of the "bad" commit into the last "good" commit, everything seems fine... I'm a bit confused

weinbe2 commented 1 year ago

Looks like I found it---misplaced curly bracket in the update to the host reference.