qcdcode / quda

QUDA is a library for performing calculations in lattice QCD on GPUs.
http://lattice.github.com/quda
Other
2 stars 0 forks source link

progress of testing the non-PC ndeg twisted clover operator #6

Closed kostrzewa closed 3 years ago

kostrzewa commented 3 years ago

I begin by ensuring that the host reference implementation of NdegTwistedMass and the corresponding QUDA kernel for M work correctly:

input

  1 np=2                                                                                                                                                     
  2 exe=/home/bartek/build/quda-ndeg_twisted_clover-with_tests/tests/dslash_test
  3 
  4 QUDA_ENABLE_TUNING=0 \
  5 mpirun -np ${np} ${exe} \
  6   --gridsize 1 1 1 2 \
  7   --Lsdim 1 \
  8   --dslash-type 'twisted-mass' \
  9   --flavor 'nondeg-doublet' \
 10   --mu 0.1 \
 11   --epsilon 0.2 \
 12   --kappa 0.125 \
 13   --test 'Mat'
 14 
 15 QUDA_ENABLE_TUNING=0 \
 16 mpirun -np ${np} ${exe} \
 17   --gridsize 1 1 1 2 \
 18   --Lsdim 1 \
 19   --dslash-type 'twisted-mass' \
 20   --flavor 'nondeg-doublet' \
 21   --mu 0.1 \
 22   --epsilon 0.2 \
 23   --kappa 0.125 \
 24   --test 'MatPC'

output Mat

[----------] Global test environment set-up.
[----------] 1 test from dslash
[ RUN      ] dslash.verify
0 fails = 0
1 fails = 0
2 fails = 0
3 fails = 0
4 fails = 0
5 fails = 0
6 fails = 0
7 fails = 0
8 fails = 0
9 fails = 0
10 fails = 0
11 fails = 0
12 fails = 0
13 fails = 0
14 fails = 0
15 fails = 0
16 fails = 0
17 fails = 0
18 fails = 0
19 fails = 0
20 fails = 0
21 fails = 0
22 fails = 0
23 fails = 0
1.000000e-01 Failures: 0 / 31850496  = 0.000000e+00
1.000000e-02 Failures: 0 / 31850496  = 0.000000e+00
1.000000e-03 Failures: 0 / 31850496  = 0.000000e+00
1.000000e-04 Failures: 0 / 31850496  = 0.000000e+00
1.000000e-05 Failures: 0 / 31850496  = 0.000000e+00
1.000000e-06 Failures: 0 / 31850496  = 0.000000e+00
1.000000e-07 Failures: 1005 / 31850496  = 3.155367e-05
1.000000e-08 Failures: 18861436 / 31850496  = 5.921866e-01
1.000000e-09 Failures: 30476947 / 31850496  = 9.568751e-01
1.000000e-10 Failures: 31713543 / 31850496  = 9.957001e-01
1.000000e-11 Failures: 31836662 / 31850496  = 9.995657e-01
1.000000e-12 Failures: 31849072 / 31850496  = 9.999553e-01
1.000000e-13 Failures: 31850372 / 31850496  = 9.999961e-01
1.000000e-14 Failures: 31850484 / 31850496  = 9.999996e-01
1.000000e-15 Failures: 31850493 / 31850496  = 9.999999e-01
1.000000e-16 Failures: 31850496 / 31850496  = 1.000000e+00
[       OK ] dslash.verify (3925 ms)
[----------] 1 test from dslash (3925 ms total)

output MatPC

[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from dslash
[ RUN      ] dslash.verify
0 fails = 0
1 fails = 0
2 fails = 0
3 fails = 0
4 fails = 0
5 fails = 0
6 fails = 0
7 fails = 0
8 fails = 0
9 fails = 0
10 fails = 0
11 fails = 0
12 fails = 0
13 fails = 0
14 fails = 0
15 fails = 0
16 fails = 0
17 fails = 0
18 fails = 0
19 fails = 0
20 fails = 0
21 fails = 0
22 fails = 0
23 fails = 0
1.000000e-01 Failures: 0 / 15925248  = 0.000000e+00
1.000000e-02 Failures: 0 / 15925248  = 0.000000e+00
1.000000e-03 Failures: 0 / 15925248  = 0.000000e+00
1.000000e-04 Failures: 0 / 15925248  = 0.000000e+00
1.000000e-05 Failures: 0 / 15925248  = 0.000000e+00
1.000000e-06 Failures: 0 / 15925248  = 0.000000e+00
1.000000e-07 Failures: 18153 / 15925248  = 1.139888e-03
1.000000e-08 Failures: 10492920 / 15925248  = 6.588858e-01
1.000000e-09 Failures: 15359181 / 15925248  = 9.644547e-01
1.000000e-10 Failures: 15868324 / 15925248  = 9.964256e-01
1.000000e-11 Failures: 15919636 / 15925248  = 9.996476e-01
1.000000e-12 Failures: 15924688 / 15925248  = 9.999648e-01
1.000000e-13 Failures: 15925192 / 15925248  = 9.999965e-01
1.000000e-14 Failures: 15925242 / 15925248  = 9.999996e-01
1.000000e-15 Failures: 15925247 / 15925248  = 9.999999e-01
1.000000e-16 Failures: 15925248 / 15925248  = 1.000000e+00
[       OK ] dslash.verify (1949 ms)
kostrzewa commented 3 years ago

Now I replace the call to the test function tmc_ndeg_mat in dslash_test.cpp by a call to tm_ndeg_mat (where we've just verified that the GPU and host reference match)

 690     case dslash_test_type::Mat:
 691       if(inv_param.twist_flavor == QUDA_TWIST_SINGLET)      
 692   tmc_mat(spinorRef->V(), hostGauge, hostClover, spinor->V(), inv_param.kappa, inv_param.mu, inv_param.twist_flavor, dagger, inv_param     .cpu_prec, gauge_param);
 693       else {
 694         //tmc_ndeg_mat(spinorRef->V(), hostGauge, hostClover, spinor->V(), inv_param.kappa, inv_param.mu, inv_param.epsilon,
 695         //  inv_param.twist_flavor, dagger, inv_param.cpu_prec, gauge_param);                                                         
 696         printf("\nHacked test of NdegTwistedClover with tiny value of csw\n");
 697         tm_ndeg_mat(spinorRef->V(), hostGauge, spinor->V(), inv_param.kappa, inv_param.mu, inv_param.epsilon,
 698                     dagger, inv_param.cpu_prec, gauge_param); 
 699       }
 700         
 701       break;

and run the test with a tiny value of csw as indicated.

  1 np=2                                                                                                                                                     
  2 exe=/home/bartek/build/quda-ndeg_twisted_clover-with_tests/tests/dslash_test
  3 
  4 QUDA_ENABLE_TUNING=0 \
  5 mpirun -np ${np} ${exe} \
  6   --gridsize 1 1 1 ${np} \
  7   --Lsdim 1 \
  8   --dslash-type 'twisted-clover' \
  9   --flavor 'nondeg-doublet' \
 10   --mu 0.1 \
 11   --epsilon 0.2 \
 12   --kappa 0.125 \
 13   --clover-coeff 0.000000000001 \
 14   --test 'Mat'

output tm_ndeg_mat tested against NdegTwistedClover with tiny value of csw

Hacked test of NdegTwistedClover with tiny value of csw
done.
Tuning...
Executing 100 kernel loops...
done.

6998.313069us per kernel call
1337720832 flops per kernel call, 2016 flops per site
GFLOPS = 191.149041
Effective halo bi-directional bandwidth (GB/s) GPU = 0.758528 ( CPU = 0.758536, min = 0.625624 , max = 0.767667 ) for aggregate message size 5308416 bytes
Results: CPU = 7080866.574979, CUDA=7153895.066485, CPU-CUDA = 7153895.066489
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from dslash
[ RUN      ] dslash.verify
0 fails = 1140393
1 fails = 1139795
2 fails = 1140617
3 fails = 1139917
4 fails = 1140450
5 fails = 1140447
6 fails = 1140577
7 fails = 1140456
8 fails = 1140695
9 fails = 1140546
10 fails = 1140257
11 fails = 1140542
12 fails = 1140591
13 fails = 1139136
14 fails = 1140831
15 fails = 1140868
16 fails = 1141182
17 fails = 1139895
18 fails = 1139846
19 fails = 1140399
20 fails = 1140899
21 fails = 1139845
22 fails = 1140702
23 fails = 1140516
1.000000e-01 Failures: 165645 / 31850496  = 5.200704e-03
1.000000e-02 Failures: 3768843 / 31850496  = 1.183292e-01
1.000000e-03 Failures: 27369402 / 31850496  = 8.593085e-01
1.000000e-04 Failures: 31398277 / 31850496  = 9.858018e-01
1.000000e-05 Failures: 31805198 / 31850496  = 9.985778e-01
1.000000e-06 Failures: 31846017 / 31850496  = 9.998594e-01
1.000000e-07 Failures: 31850055 / 31850496  = 9.999862e-01
1.000000e-08 Failures: 31850462 / 31850496  = 9.999989e-01
1.000000e-09 Failures: 31850493 / 31850496  = 9.999999e-01
1.000000e-10 Failures: 31850495 / 31850496  = 1.000000e+00
1.000000e-11 Failures: 31850496 / 31850496  = 1.000000e+00
1.000000e-12 Failures: 31850496 / 31850496  = 1.000000e+00
1.000000e-13 Failures: 31850496 / 31850496  = 1.000000e+00
1.000000e-14 Failures: 31850496 / 31850496  = 1.000000e+00
1.000000e-15 Failures: 31850496 / 31850496  = 1.000000e+00
1.000000e-16 Failures: 31850496 / 31850496  = 1.000000e+00
/home/bartek/code/quda.ndeg-twisted-clover/tests/dslash_test.cpp:931: Failure
Expected: (deviation) <= (tol), actual: 1 vs 0.0001
CPU and CUDA implementations do not agree
[  FAILED  ] dslash.verify (3890 ms)
kostrzewa commented 3 years ago

Alright, so I must have been doing something stupid before as the test now passes, as we expected from our investigation yesterday:

  1 np=2                                                                                                                                   
  2 exe=/home/bartek/build/quda-ndeg_twisted_clover-with_tests/tests/dslash_test
  3 
  4 QUDA_ENABLE_TUNING=0 \
  5 mpirun -np ${np} ${exe} \
  6   --gridsize 1 1 1 ${np} \
  7   --Lsdim 1 \
  8   --dslash-type 'twisted-clover' \
  9   --flavor 'nondeg-doublet' \
 10   --mu 0.1 \
 11   --epsilon 0.2 \
 12   --kappa 0.127 \
 13   --clover-coeff 1.0 \
 14   --test 'Mat'

output

prec    recon   dtest_type     matpc_type   dagger   S_dim         T_dimension   Ls_dimension dslash_type    niter
single   18       Mat              even-even    0     24/ 24/ 24         24              1   twisted-clover   100
Grid partition info:     X  Y  Z  T
                         0  0  0  1
QUDA 1.0.0 (git v0.9.0-3955-g4f356b5b3-dirty-sm_61)
CUDA Driver version = 11000
CUDA Runtime version = 11000
Found device 0: GeForce GTX 1060 6GB
Found device 1: GeForce GTX 1060 6GB
Using device 0: GeForce GTX 1060 6GB
WARNING: Data reordering done on GPU (set with QUDA_REORDER_LOCATION=GPU/CPU)
WARNING: Autotuning disabled
cublasCreated successfully
WARNING: Using device memory pool allocator
WARNING: Using pinned memory pool allocator
Kappa = 0.12700000 Mass = -0.06299213
Randomizing fields... Sending gauge field to GPU
Sending clover field to GPU
Creating cudaSpinor with nParity = 2
Creating cudaSpinorOut with nParity = 2
Sending spinor field to GPU
Source: CPU = 1.061711e+07, CUDA = 1.061711e+07
Calculating reference implementation...done.
Tuning...
Executing 100 kernel loops...
done.

7018.352747us per kernel call
1337720832 flops per kernel call, 2016 flops per site
GFLOPS = 190.603248
Effective halo bi-directional bandwidth (GB/s) GPU = 0.756362 ( CPU = 0.756356, min = 0.665883 , max = 0.769559 ) for aggregate message size 5308416 bytes
Results: CPU = 7103629.564046, CUDA=7103629.377404, CPU-CUDA = 7103629.377406
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from dslash
[ RUN      ] dslash.verify
0 fails = 0
1 fails = 0
2 fails = 0
3 fails = 0
4 fails = 0
5 fails = 0
6 fails = 0
7 fails = 0
8 fails = 0
9 fails = 0
10 fails = 0
11 fails = 0
12 fails = 0
13 fails = 0
14 fails = 0
15 fails = 0
16 fails = 0
17 fails = 0
18 fails = 0
19 fails = 0
20 fails = 0
21 fails = 0
22 fails = 0
23 fails = 0
1.000000e-01 Failures: 0 / 31850496  = 0.000000e+00
1.000000e-02 Failures: 0 / 31850496  = 0.000000e+00
1.000000e-03 Failures: 0 / 31850496  = 0.000000e+00
1.000000e-04 Failures: 0 / 31850496  = 0.000000e+00
1.000000e-05 Failures: 0 / 31850496  = 0.000000e+00
1.000000e-06 Failures: 0 / 31850496  = 0.000000e+00
1.000000e-07 Failures: 189867 / 31850496  = 5.961194e-03
1.000000e-08 Failures: 22006361 / 31850496  = 6.909268e-01
1.000000e-09 Failures: 30826912 / 31850496  = 9.678629e-01
1.000000e-10 Failures: 31748126 / 31850496  = 9.967859e-01
1.000000e-11 Failures: 31840404 / 31850496  = 9.996831e-01
1.000000e-12 Failures: 31849430 / 31850496  = 9.999665e-01
1.000000e-13 Failures: 31850386 / 31850496  = 9.999965e-01
1.000000e-14 Failures: 31850481 / 31850496  = 9.999995e-01
1.000000e-15 Failures: 31850495 / 31850496  = 1.000000e+00
1.000000e-16 Failures: 31850496 / 31850496  = 1.000000e+00
[       OK ] dslash.verify (3877 ms)
kostrzewa commented 3 years ago

@sbacchio DiracTwistedClover::M for the doublet is ready to be checked against the tmLQCD version if Lyncs has reached that stage.