Closed kostrzewa closed 3 years ago
Now I replace the call to the test function tmc_ndeg_mat
in dslash_test.cpp
by a call to tm_ndeg_mat
(where we've just verified that the GPU and host reference match)
690 case dslash_test_type::Mat:
691 if(inv_param.twist_flavor == QUDA_TWIST_SINGLET)
692 tmc_mat(spinorRef->V(), hostGauge, hostClover, spinor->V(), inv_param.kappa, inv_param.mu, inv_param.twist_flavor, dagger, inv_param .cpu_prec, gauge_param);
693 else {
694 //tmc_ndeg_mat(spinorRef->V(), hostGauge, hostClover, spinor->V(), inv_param.kappa, inv_param.mu, inv_param.epsilon,
695 // inv_param.twist_flavor, dagger, inv_param.cpu_prec, gauge_param);
696 printf("\nHacked test of NdegTwistedClover with tiny value of csw\n");
697 tm_ndeg_mat(spinorRef->V(), hostGauge, spinor->V(), inv_param.kappa, inv_param.mu, inv_param.epsilon,
698 dagger, inv_param.cpu_prec, gauge_param);
699 }
700
701 break;
and run the test with a tiny value of csw
as indicated.
1 np=2
2 exe=/home/bartek/build/quda-ndeg_twisted_clover-with_tests/tests/dslash_test
3
4 QUDA_ENABLE_TUNING=0 \
5 mpirun -np ${np} ${exe} \
6 --gridsize 1 1 1 ${np} \
7 --Lsdim 1 \
8 --dslash-type 'twisted-clover' \
9 --flavor 'nondeg-doublet' \
10 --mu 0.1 \
11 --epsilon 0.2 \
12 --kappa 0.125 \
13 --clover-coeff 0.000000000001 \
14 --test 'Mat'
tm_ndeg_mat
tested against NdegTwistedClover
with tiny value of csw
Hacked test of NdegTwistedClover with tiny value of csw
done.
Tuning...
Executing 100 kernel loops...
done.
6998.313069us per kernel call
1337720832 flops per kernel call, 2016 flops per site
GFLOPS = 191.149041
Effective halo bi-directional bandwidth (GB/s) GPU = 0.758528 ( CPU = 0.758536, min = 0.625624 , max = 0.767667 ) for aggregate message size 5308416 bytes
Results: CPU = 7080866.574979, CUDA=7153895.066485, CPU-CUDA = 7153895.066489
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from dslash
[ RUN ] dslash.verify
0 fails = 1140393
1 fails = 1139795
2 fails = 1140617
3 fails = 1139917
4 fails = 1140450
5 fails = 1140447
6 fails = 1140577
7 fails = 1140456
8 fails = 1140695
9 fails = 1140546
10 fails = 1140257
11 fails = 1140542
12 fails = 1140591
13 fails = 1139136
14 fails = 1140831
15 fails = 1140868
16 fails = 1141182
17 fails = 1139895
18 fails = 1139846
19 fails = 1140399
20 fails = 1140899
21 fails = 1139845
22 fails = 1140702
23 fails = 1140516
1.000000e-01 Failures: 165645 / 31850496 = 5.200704e-03
1.000000e-02 Failures: 3768843 / 31850496 = 1.183292e-01
1.000000e-03 Failures: 27369402 / 31850496 = 8.593085e-01
1.000000e-04 Failures: 31398277 / 31850496 = 9.858018e-01
1.000000e-05 Failures: 31805198 / 31850496 = 9.985778e-01
1.000000e-06 Failures: 31846017 / 31850496 = 9.998594e-01
1.000000e-07 Failures: 31850055 / 31850496 = 9.999862e-01
1.000000e-08 Failures: 31850462 / 31850496 = 9.999989e-01
1.000000e-09 Failures: 31850493 / 31850496 = 9.999999e-01
1.000000e-10 Failures: 31850495 / 31850496 = 1.000000e+00
1.000000e-11 Failures: 31850496 / 31850496 = 1.000000e+00
1.000000e-12 Failures: 31850496 / 31850496 = 1.000000e+00
1.000000e-13 Failures: 31850496 / 31850496 = 1.000000e+00
1.000000e-14 Failures: 31850496 / 31850496 = 1.000000e+00
1.000000e-15 Failures: 31850496 / 31850496 = 1.000000e+00
1.000000e-16 Failures: 31850496 / 31850496 = 1.000000e+00
/home/bartek/code/quda.ndeg-twisted-clover/tests/dslash_test.cpp:931: Failure
Expected: (deviation) <= (tol), actual: 1 vs 0.0001
CPU and CUDA implementations do not agree
[ FAILED ] dslash.verify (3890 ms)
Alright, so I must have been doing something stupid before as the test now passes, as we expected from our investigation yesterday:
1 np=2
2 exe=/home/bartek/build/quda-ndeg_twisted_clover-with_tests/tests/dslash_test
3
4 QUDA_ENABLE_TUNING=0 \
5 mpirun -np ${np} ${exe} \
6 --gridsize 1 1 1 ${np} \
7 --Lsdim 1 \
8 --dslash-type 'twisted-clover' \
9 --flavor 'nondeg-doublet' \
10 --mu 0.1 \
11 --epsilon 0.2 \
12 --kappa 0.127 \
13 --clover-coeff 1.0 \
14 --test 'Mat'
prec recon dtest_type matpc_type dagger S_dim T_dimension Ls_dimension dslash_type niter
single 18 Mat even-even 0 24/ 24/ 24 24 1 twisted-clover 100
Grid partition info: X Y Z T
0 0 0 1
QUDA 1.0.0 (git v0.9.0-3955-g4f356b5b3-dirty-sm_61)
CUDA Driver version = 11000
CUDA Runtime version = 11000
Found device 0: GeForce GTX 1060 6GB
Found device 1: GeForce GTX 1060 6GB
Using device 0: GeForce GTX 1060 6GB
WARNING: Data reordering done on GPU (set with QUDA_REORDER_LOCATION=GPU/CPU)
WARNING: Autotuning disabled
cublasCreated successfully
WARNING: Using device memory pool allocator
WARNING: Using pinned memory pool allocator
Kappa = 0.12700000 Mass = -0.06299213
Randomizing fields... Sending gauge field to GPU
Sending clover field to GPU
Creating cudaSpinor with nParity = 2
Creating cudaSpinorOut with nParity = 2
Sending spinor field to GPU
Source: CPU = 1.061711e+07, CUDA = 1.061711e+07
Calculating reference implementation...done.
Tuning...
Executing 100 kernel loops...
done.
7018.352747us per kernel call
1337720832 flops per kernel call, 2016 flops per site
GFLOPS = 190.603248
Effective halo bi-directional bandwidth (GB/s) GPU = 0.756362 ( CPU = 0.756356, min = 0.665883 , max = 0.769559 ) for aggregate message size 5308416 bytes
Results: CPU = 7103629.564046, CUDA=7103629.377404, CPU-CUDA = 7103629.377406
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from dslash
[ RUN ] dslash.verify
0 fails = 0
1 fails = 0
2 fails = 0
3 fails = 0
4 fails = 0
5 fails = 0
6 fails = 0
7 fails = 0
8 fails = 0
9 fails = 0
10 fails = 0
11 fails = 0
12 fails = 0
13 fails = 0
14 fails = 0
15 fails = 0
16 fails = 0
17 fails = 0
18 fails = 0
19 fails = 0
20 fails = 0
21 fails = 0
22 fails = 0
23 fails = 0
1.000000e-01 Failures: 0 / 31850496 = 0.000000e+00
1.000000e-02 Failures: 0 / 31850496 = 0.000000e+00
1.000000e-03 Failures: 0 / 31850496 = 0.000000e+00
1.000000e-04 Failures: 0 / 31850496 = 0.000000e+00
1.000000e-05 Failures: 0 / 31850496 = 0.000000e+00
1.000000e-06 Failures: 0 / 31850496 = 0.000000e+00
1.000000e-07 Failures: 189867 / 31850496 = 5.961194e-03
1.000000e-08 Failures: 22006361 / 31850496 = 6.909268e-01
1.000000e-09 Failures: 30826912 / 31850496 = 9.678629e-01
1.000000e-10 Failures: 31748126 / 31850496 = 9.967859e-01
1.000000e-11 Failures: 31840404 / 31850496 = 9.996831e-01
1.000000e-12 Failures: 31849430 / 31850496 = 9.999665e-01
1.000000e-13 Failures: 31850386 / 31850496 = 9.999965e-01
1.000000e-14 Failures: 31850481 / 31850496 = 9.999995e-01
1.000000e-15 Failures: 31850495 / 31850496 = 1.000000e+00
1.000000e-16 Failures: 31850496 / 31850496 = 1.000000e+00
[ OK ] dslash.verify (3877 ms)
@sbacchio DiracTwistedClover::M
for the doublet is ready to be checked against the tmLQCD version if Lyncs has reached that stage.
I begin by ensuring that the host reference implementation of
NdegTwistedMass
and the corresponding QUDA kernel forM
work correctly:input
output
Mat
output
MatPC