`dslash_test` failing with `QUDA_REORDER_LOCATION=CPU`

In the current version of the develop, dslash_test is failing when QUDA_REORDER_LOCATION=CPU is set.

Reporducible:

export QUDA_REORDER_LOCATION=CPU
dslash_test --xdim 4 --ydim 4 --zdim 4 --tdim 4

Output:

Disabling GPU-Direct RDMA access Enabling peer-to-peer copy engine and direct load/store access Rank order is column major (t running fastest) [==========] Running 2 tests from 1 test suite. [----------] Global test environment set-up. [----------] 2 tests from DslashTest QUDA 1.1.0 (git 1.1.0-f8855bbec-sm_80) CUDA Driver version = 12010 CUDA Runtime version = 11080 Graphic driver version = 530.30.02 Found device 0: NVIDIA A100-SXM-64GB Using device 0: NVIDIA A100-SXM-64GB WARNING: Data reordering done on CPU (set with QUDA_REORDER_LOCATION=GPU/CPU) WARNING: Environment variable QUDA_RESOURCE_PATH is not set. WARNING: Caching of tuned parameters will be disabled. WARNING: Using device memory pool allocator WARNING: Using pinned memory pool allocator cublasCreated successfully [ RUN ] DslashTest.benchmark Randomizing fields... Sending gauge field to GPU Creating cudaSpinor with nParity = 1 Creating cudaSpinorOut with nParity = 1 Sending spinor field to GPU Source: CPU = 2.032196e+03, CUDA = 2.032196e+03 running the following test: prec recon dtest_type matpc_type dagger S_dim T_dimension Ls_dimension dslash_type niter single 18 Dslash even_even 0 4/ 4/ 4 8 16 wilson 100 Grid partition info: X Y Z T 0 0 0 0 Tuning... Executing 100 kernel loops... done. 10.659840us per kernel call 337920 flops per kernel call, 1320 flops per site 1344 bytes per site GFLOPS = 31.700288 GBYTES = 32.276656 Effective halo bi-directional bandwidth (GB/s) GPU = 0.000000 ( CPU = 0.000000, min = 0.000000 , max = 0.000000 ) for aggregate message size 0 bytes [ OK ] DslashTest.benchmark (128 ms) [ RUN ] DslashTest.verify Sending gauge field to GPU Creating cudaSpinor with nParity = 1 Creating cudaSpinorOut with nParity = 1 Sending spinor field to GPU Source: CPU = 2.032196e+03, CUDA = 2.032196e+03 running the following test: prec recon dtest_type matpc_type dagger S_dim T_dimension Ls_dimension dslash_type niter single 18 Dslash even_even 0 4/ 4/ 4 8 16 wilson 100 Grid partition info: X Y Z T 0 0 0 0 Calculating reference implementation...done. Tuning... Executing 2 kernel loops... done. 13.312000us per kernel call 337920 flops per kernel call, 1320 flops per site 1344 bytes per site GFLOPS = 25.384615 GBYTES = 25.846153 Effective halo bi-directional bandwidth (GB/s) GPU = 0.000000 ( CPU = 0.000000, min = 0.000000 , max = 0.000000 ) for aggregate message size 0 bytes Results: reference = 49469.949399, QUDA = 0.000000, L2 relative deviation = 1.000000e+00, max deviation = 9.964920e+00 0 fails = 255 1 fails = 254 2 fails = 256 3 fails = 256 4 fails = 256 5 fails = 255 6 fails = 256 7 fails = 256 8 fails = 254 9 fails = 255 10 fails = 256 11 fails = 255 12 fails = 255 13 fails = 256 14 fails = 253 15 fails = 256 16 fails = 255 17 fails = 255 18 fails = 254 19 fails = 254 20 fails = 255 21 fails = 256 22 fails = 255 23 fails = 254 1.000000e-01 Failures: 4494 / 6144 = 7.314453e-01 1.000000e-02 Failures: 5961 / 6144 = 9.702148e-01 1.000000e-03 Failures: 6122 / 6144 = 9.964193e-01 1.000000e-04 Failures: 6140 / 6144 = 9.993490e-01 1.000000e-05 Failures: 6144 / 6144 = 1.000000e+00 1.000000e-06 Failures: 6144 / 6144 = 1.000000e+00 1.000000e-07 Failures: 6144 / 6144 = 1.000000e+00 1.000000e-08 Failures: 6144 / 6144 = 1.000000e+00 1.000000e-09 Failures: 6144 / 6144 = 1.000000e+00 1.000000e-10 Failures: 6144 / 6144 = 1.000000e+00 1.000000e-11 Failures: 6144 / 6144 = 1.000000e+00 1.000000e-12 Failures: 6144 / 6144 = 1.000000e+00 1.000000e-13 Failures: 6144 / 6144 = 1.000000e+00 1.000000e-14 Failures: 6144 / 6144 = 1.000000e+00 1.000000e-15 Failures: 6144 / 6144 = 1.000000e+00 1.000000e-16 Failures: 6144 / 6144 = 1.000000e+00 /leonardo/home/userexternal/sbacchio/src/quda_new/tests/dslash_test.cpp:78: Failure Expected: (deviation) <= (tol), actual: 1 vs 0.0001 CPU and CUDA implementations do not agree [ FAILED ] DslashTest.verify (14 ms) initQuda Total time = 2.729 secs init = 2.729 secs (100.000%), with 2 calls at 1.364e+06 us per call total accounted = 2.729 secs (100.000%) total missing = 0.000 secs ( 0.000%) loadGaugeQuda Total time = 0.048 secs download = 0.045 secs ( 94.961%), with 2 calls at 2.264e+04 us per call init = 0.000 secs ( 0.369%), with 10 calls at 1.760e+01 us per call compute = 0.000 secs ( 0.002%), with 2 calls at 5.000e-01 us per call free = 0.000 secs ( 0.006%), with 73 calls at 4.110e-02 us per call total accounted = 0.045 secs ( 95.339%) total missing = 0.002 secs ( 4.661%) endQuda Total time = 0.002 secs free = 0.000 secs ( 0.120%), with 63 calls at 3.175e-02 us per call total accounted = 0.000 secs ( 0.120%) total missing = 0.002 secs ( 99.880%) initQuda-endQuda Total time = 2.874 secs QUDA Total time = 2.778 secs download = 0.045 secs ( 1.630%), with 2 calls at 2.264e+04 us per call init = 2.729 secs ( 98.230%), with 12 calls at 2.274e+05 us per call compute = 0.000 secs ( 0.000%), with 2 calls at 5.000e-01 us per call free = 0.000 secs ( 0.000%), with 136 calls at 4.412e-02 us per call total accounted = 2.774 secs ( 99.860%) total missing = 0.004 secs ( 0.140%) Device memory used = 0.4 MiB Pinned device memory used = 0.0 MiB Managed memory used = 0.0 MiB Shmem memory used = 0.0 MiB Page-locked host memory used = 0.3 MiB Total host memory used >= 1.1 MiB [----------] 2 tests from DslashTest (143 ms total) [----------] Global test environment tear-down [==========] 2 tests from 1 test suite ran. (2875 ms total) [ PASSED ] 1 test. [ FAILED ] 1 test, listed below: [ FAILED ] DslashTest.verify 1 FAILED TEST

lattice / quda

`dslash_test` failing with `QUDA_REORDER_LOCATION=CPU` #1466