lattice / quda

QUDA is a library for performing calculations in lattice QCD on GPUs.
https://lattice.github.io/quda
Other
287 stars 94 forks source link

`staggered_dslash_test` asqtad verify fails with recon-9/13, partitioning enabled, and computing the fat/long gauge links #1394

Open weinbe2 opened 1 year ago

weinbe2 commented 1 year ago

Minimal cmake build, config needed for now

cmake ../quda/ -DQUDA_DIRAC_DEFAULT_OFF=ON -DQUDA_DIRAC_STAGGERED=ON -DQUDA_GPU_ARCH=sm_80 -DQUDA_MPI=ON -DQUDA_FAST_COMPILE_DSLASH=ON -DQUDA_FAST_COMPILE_REDUCE=ON

Representative command:

mpirun -np 1 ./staggered_dslash_test --verbosity verbose --dim 16 16 16 16 --niter 100 --dslash-type asqtad --partition 4 --prec double --compute-fat-long true --recon 9

I note that this is nothing special to partitioning (or not partitioning) in the t direction, so temporal boundary conditions aren't the (only?) issue.

This was missed because for various reasons downstream of headaches I should've solved a long time ago, recon-13 and recon-9 tests are skipped in the staggered_dslash_ctest. --compute-fat-long true is indeed included in the ctest commands. The likely solution to this is to begin homogenizing the logic for loading gauge fields and verifying staggered dslash calls between the dslash test and the invert test, where there doesn't seem to be an issue.

weinbe2 commented 1 year ago

Incremental progress is being made in https://github.com/lattice/quda/tree/hotfix/stag-dslash-test-recon-partition-failure ; no clear resolution yet.