Closed SaltyChiang closed 5 months ago
It might be introduced by #1434. I run the invert_test with some specific parameters:
QUDA_ENABLE_TUNING=0 ./invert_test \ --prec double --prec-sloppy single --prec-precondition half --prec-null half \ --dim 16 16 16 16 --dslash-type clover --tol 2e-15 \ --solve-type direct --solution-type mat --inv-type gcr --inv-multigrid true
Result of commit a979a4d:
Disabling GPU-Direct RDMA access Enabling peer-to-peer copy engine and direct load/store access Rank order is column major (t running fastest) running the following test: prec prec_sloppy multishift matpc_type recon recon_sloppy solve_type S_dimension T_dimension Ls_dimension dslash_type normalization double single 1 even_even 18 18 direct 16/ 16/ 16 16 16 clover kappa MG parameters - number of levels 2 - level 1 number of null-space vectors 0 - level 1 number of pre-smoother applications 2 - level 1 number of post-smoother applications 2 MG Eigensolver parameters Grid partition info: X Y Z T 0 0 0 0 QUDA 1.1.0 (git 1.1.0-a979a4d69-sm_60) CUDA Driver version = 12000 CUDA Runtime version = 11080 Graphic driver version = 525.147.05 Found device 0: Tesla P100-PCIE-16GB Found device 1: Tesla P100-PCIE-16GB Found device 2: Tesla P100-PCIE-16GB Found device 3: Tesla P100-PCIE-16GB Using device 0: Tesla P100-PCIE-16GB WARNING: Data reordering done on GPU (set with QUDA_REORDER_LOCATION=GPU/CPU) WARNING: Autotuning disabled WARNING: Autotuning disabled WARNING: Using device memory pool allocator WARNING: Using pinned memory pool allocator cublasCreated successfully Computed plaquette is 1.223368e-01 (spatial = 1.224580e-01, temporal = 1.222156e-01) Solution = mat, Solve = direct, Solver = gcr, Sloppy precision = single MG level 0 (GPU): BiCGstab: Convergence at 6 iterations, L2 relative residual: iterated = 4.055661e-06, true = 4.055661e-06 (requested = 5.000000e-06) MG level 0 (GPU): BiCGstab: Convergence at 6 iterations, L2 relative residual: iterated = 3.853343e-06, true = 3.853343e-06 (requested = 5.000000e-06) MG level 0 (GPU): BiCGstab: Convergence at 6 iterations, L2 relative residual: iterated = 4.675424e-06, true = 4.675424e-06 (requested = 5.000000e-06) MG level 0 (GPU): BiCGstab: Convergence at 6 iterations, L2 relative residual: iterated = 4.190422e-06, true = 4.190422e-06 (requested = 5.000000e-06) MG level 0 (GPU): BiCGstab: Convergence at 6 iterations, L2 relative residual: iterated = 4.105885e-06, true = 4.105885e-06 (requested = 5.000000e-06) MG level 0 (GPU): BiCGstab: Convergence at 6 iterations, L2 relative residual: iterated = 3.707929e-06, true = 3.707929e-06 (requested = 5.000000e-06) MG level 0 (GPU): BiCGstab: Convergence at 6 iterations, L2 relative residual: iterated = 4.662438e-06, true = 4.662438e-06 (requested = 5.000000e-06) MG level 0 (GPU): BiCGstab: Convergence at 6 iterations, L2 relative residual: iterated = 3.711373e-06, true = 3.711373e-06 (requested = 5.000000e-06) MG level 0 (GPU): BiCGstab: Convergence at 6 iterations, L2 relative residual: iterated = 4.176259e-06, true = 4.176259e-06 (requested = 5.000000e-06) MG level 0 (GPU): BiCGstab: Convergence at 6 iterations, L2 relative residual: iterated = 4.005173e-06, true = 4.005173e-06 (requested = 5.000000e-06) MG level 0 (GPU): BiCGstab: Convergence at 6 iterations, L2 relative residual: iterated = 3.998309e-06, true = 3.998309e-06 (requested = 5.000000e-06) MG level 0 (GPU): BiCGstab: Convergence at 6 iterations, L2 relative residual: iterated = 4.101013e-06, true = 4.101013e-06 (requested = 5.000000e-06) MG level 0 (GPU): BiCGstab: Convergence at 6 iterations, L2 relative residual: iterated = 3.715315e-06, true = 3.715315e-06 (requested = 5.000000e-06) MG level 0 (GPU): BiCGstab: Convergence at 6 iterations, L2 relative residual: iterated = 3.816766e-06, true = 3.816766e-06 (requested = 5.000000e-06) MG level 0 (GPU): BiCGstab: Convergence at 6 iterations, L2 relative residual: iterated = 4.744014e-06, true = 4.744014e-06 (requested = 5.000000e-06) MG level 0 (GPU): BiCGstab: Convergence at 6 iterations, L2 relative residual: iterated = 4.293564e-06, true = 4.293564e-06 (requested = 5.000000e-06) MG level 0 (GPU): BiCGstab: Convergence at 6 iterations, L2 relative residual: iterated = 3.802038e-06, true = 3.802038e-06 (requested = 5.000000e-06) MG level 0 (GPU): BiCGstab: Convergence at 6 iterations, L2 relative residual: iterated = 3.917147e-06, true = 3.917147e-06 (requested = 5.000000e-06) MG level 0 (GPU): BiCGstab: Convergence at 6 iterations, L2 relative residual: iterated = 4.867327e-06, true = 4.867327e-06 (requested = 5.000000e-06) MG level 0 (GPU): BiCGstab: Convergence at 6 iterations, L2 relative residual: iterated = 4.107894e-06, true = 4.107894e-06 (requested = 5.000000e-06) MG level 0 (GPU): BiCGstab: Convergence at 6 iterations, L2 relative residual: iterated = 4.279538e-06, true = 4.279538e-06 (requested = 5.000000e-06) MG level 0 (GPU): BiCGstab: Convergence at 6 iterations, L2 relative residual: iterated = 4.296545e-06, true = 4.296545e-06 (requested = 5.000000e-06) MG level 0 (GPU): BiCGstab: Convergence at 6 iterations, L2 relative residual: iterated = 3.825819e-06, true = 3.825819e-06 (requested = 5.000000e-06) MG level 0 (GPU): BiCGstab: Convergence at 6 iterations, L2 relative residual: iterated = 4.644150e-06, true = 4.644150e-06 (requested = 5.000000e-06) MG level 0 (GPU): Computing Y field...... MG level 0 (GPU): ....done computing Y field MG level 0 (GPU): Computing Yhat field...... MG level 0 (GPU): ....done computing Yhat field MG level 0 (GPU): Checking 0 = (1 - P P^\dagger) v_k for 24 vectors MG level 0 (GPU): Checking 0 = (1 - P^\dagger P) eta_c MG level 0 (GPU): Checking 0 = (D_c - P^\dagger D P) (native coarse operator to emulated operator) MG level 0 (GPU): Checking normality of residual operator MG Setup Done: 0.321643 secs, 411.946 Gflops MG level 0 (GPU): MR: Convergence at 4 iterations, L2 relative residual: iterated = 1.003112e-02 (requested = 2.500000e-01) MG level 1 (GPU): GCR: Convergence at 2 iterations, L2 relative residual: iterated = 1.320388e-01 (requested = 2.500000e-01) MG level 0 (GPU): MR: Convergence at 4 iterations, L2 relative residual: iterated = 1.780842e-04 (requested = 2.500000e-01) MG level 0 (GPU): MR: Convergence at 4 iterations, L2 relative residual: iterated = 2.481098e-02 (requested = 2.500000e-01) MG level 1 (GPU): GCR: Convergence at 2 iterations, L2 relative residual: iterated = 1.428037e-01 (requested = 2.500000e-01) MG level 0 (GPU): MR: Convergence at 4 iterations, L2 relative residual: iterated = 4.510930e-04 (requested = 2.500000e-01) MG level 0 (GPU): MR: Convergence at 4 iterations, L2 relative residual: iterated = 2.752705e-02 (requested = 2.500000e-01) MG level 1 (GPU): GCR: Convergence at 2 iterations, L2 relative residual: iterated = 1.459642e-01 (requested = 2.500000e-01) MG level 0 (GPU): MR: Convergence at 4 iterations, L2 relative residual: iterated = 4.868628e-04 (requested = 2.500000e-01) MG level 0 (GPU): MR: Convergence at 4 iterations, L2 relative residual: iterated = 2.839459e-02 (requested = 2.500000e-01) MG level 1 (GPU): GCR: Convergence at 2 iterations, L2 relative residual: iterated = 1.471898e-01 (requested = 2.500000e-01) MG level 0 (GPU): MR: Convergence at 4 iterations, L2 relative residual: iterated = 4.927556e-04 (requested = 2.500000e-01) MG level 0 (GPU): MR: Convergence at 4 iterations, L2 relative residual: iterated = 2.893779e-02 (requested = 2.500000e-01) MG level 1 (GPU): GCR: Convergence at 2 iterations, L2 relative residual: iterated = 1.482350e-01 (requested = 2.500000e-01) MG level 0 (GPU): MR: Convergence at 4 iterations, L2 relative residual: iterated = 4.943840e-04 (requested = 2.500000e-01) GCR: Convergence at 5 iterations, L2 relative residual: iterated = 2.514618e-16, true = 2.514618e-16 (requested = 2.000000e-15) Done: 5 iter / 0.037155 secs = 303.209 Gflops Residuals: (L2 relative) tol 2.000000e-15, QUDA = 2.514618e-16, host = 3.145970e-16; (heavy-quark) tol 0.000000e+00, QUDA = 0.000000e+00 initQuda Total time = 0.739 secs init = 0.739 secs ( 99.998%), with 2 calls at 3.697e+05 us per call total accounted = 0.739 secs ( 99.998%) total missing = 0.000 secs ( 0.002%) loadGaugeQuda Total time = 0.012 secs download = 0.010 secs ( 88.409%), with 1 calls at 1.029e+04 us per call init = 0.001 secs ( 6.754%), with 7 calls at 1.123e+02 us per call compute = 0.000 secs ( 2.921%), with 5 calls at 6.800e+01 us per call free = 0.000 secs ( 0.026%), with 28 calls at 1.071e-01 us per call total accounted = 0.011 secs ( 98.110%) total missing = 0.000 secs ( 1.890%) loadCloverQuda Total time = 0.011 secs download = 0.004 secs ( 37.433%), with 1 calls at 4.136e+03 us per call upload = 0.004 secs ( 35.515%), with 1 calls at 3.924e+03 us per call init = 0.001 secs ( 11.657%), with 8 calls at 1.610e+02 us per call compute = 0.001 secs ( 5.195%), with 1 calls at 5.740e+02 us per call free = 0.000 secs ( 0.018%), with 10 calls at 2.000e-01 us per call total accounted = 0.010 secs ( 89.818%) total missing = 0.001 secs ( 10.182%) invertQuda Total time = 0.359 secs download = 0.002 secs ( 0.518%), with 1 calls at 1.858e+03 us per call upload = 0.001 secs ( 0.388%), with 1 calls at 1.393e+03 us per call init = 0.026 secs ( 7.221%), with 1328 calls at 1.951e+01 us per call preamble = 0.000 secs ( 0.054%), with 31 calls at 6.258e+00 us per call compute = 0.141 secs ( 39.381%), with 1909 calls at 7.402e+01 us per call epilogue = 0.001 secs ( 0.233%), with 32 calls at 2.616e+01 us per call free = 0.000 secs ( 0.056%), with 3111 calls at 6.429e-02 us per call total accounted = 0.172 secs ( 47.852%) total missing = 0.187 secs ( 52.148%) plaqQuda Total time = 0.001 secs init = 0.000 secs ( 30.820%), with 1 calls at 1.880e+02 us per call compute = 0.000 secs ( 33.607%), with 1 calls at 2.050e+02 us per call comms = 0.000 secs ( 25.574%), with 1 calls at 1.560e+02 us per call free = 0.000 secs ( 0.000%), with 1 calls at 0.000e+00 us per call total accounted = 0.001 secs ( 90.000%) total missing = 0.000 secs ( 10.000%) endQuda Total time = 0.035 secs free = 0.000 secs ( 0.020%), with 50 calls at 1.400e-01 us per call total accounted = 0.000 secs ( 0.020%) total missing = 0.035 secs ( 99.980%) initQuda-endQuda Total time = 1.433 secs QUDA Total time = 1.156 secs download = 0.016 secs ( 1.409%), with 3 calls at 5.428e+03 us per call upload = 0.005 secs ( 0.460%), with 2 calls at 2.658e+03 us per call init = 0.768 secs ( 66.388%), with 1346 calls at 5.702e+02 us per call preamble = 0.000 secs ( 0.017%), with 31 calls at 6.355e+00 us per call compute = 0.142 secs ( 12.320%), with 1916 calls at 7.434e+01 us per call comms = 0.000 secs ( 0.014%), with 1 calls at 1.570e+02 us per call epilogue = 0.001 secs ( 0.072%), with 32 calls at 2.616e+01 us per call free = 0.000 secs ( 0.021%), with 3200 calls at 7.562e-02 us per call total accounted = 0.933 secs ( 80.700%) total missing = 0.223 secs ( 19.300%) Device memory used = 750.0 MiB Pinned device memory used = 0.0 MiB Managed memory used = 0.0 MiB Shmem memory used = 0.0 MiB Page-locked host memory used = 9.5 MiB Total host memory used >= 68.1 MiB
Result of commit ed6160e:
Disabling GPU-Direct RDMA access Enabling peer-to-peer copy engine and direct load/store access Rank order is column major (t running fastest) running the following test: prec prec_sloppy multishift matpc_type recon recon_sloppy solve_type S_dimension T_dimension Ls_dimension dslash_type normalization double single 1 even_even 18 18 direct 16/ 16/ 16 16 16 clover kappa MG parameters - number of levels 2 - level 1 number of null-space vectors 0 - level 1 number of pre-smoother applications 2 - level 1 number of post-smoother applications 2 MG Eigensolver parameters Grid partition info: X Y Z T 0 0 0 0 QUDA 1.1.0 (git 1.1.0-ed6160eb5-sm_60) CUDA Driver version = 12000 CUDA Runtime version = 11080 Graphic driver version = 525.147.05 Found device 0: Tesla P100-PCIE-16GB Found device 1: Tesla P100-PCIE-16GB Found device 2: Tesla P100-PCIE-16GB Found device 3: Tesla P100-PCIE-16GB Using device 0: Tesla P100-PCIE-16GB WARNING: Data reordering done on GPU (set with QUDA_REORDER_LOCATION=GPU/CPU) WARNING: Autotuning disabled WARNING: Autotuning disabled WARNING: Using device memory pool allocator WARNING: Using pinned memory pool allocator cublasCreated successfully Computed plaquette is 1.223368e-01 (spatial = 1.224580e-01, temporal = 1.222156e-01) Solution = mat, Solve = direct, Solver = gcr, Sloppy precision = single MG level 0 (GPU): BiCGstab: Convergence at 6 iterations, L2 relative residual: iterated = 4.055661e-06, true = 4.055661e-06 (requested = 5.000000e-06) MG level 0 (GPU): BiCGstab: Convergence at 6 iterations, L2 relative residual: iterated = 3.853343e-06, true = 3.853343e-06 (requested = 5.000000e-06) MG level 0 (GPU): BiCGstab: Convergence at 6 iterations, L2 relative residual: iterated = 4.675424e-06, true = 4.675424e-06 (requested = 5.000000e-06) MG level 0 (GPU): BiCGstab: Convergence at 6 iterations, L2 relative residual: iterated = 4.190422e-06, true = 4.190422e-06 (requested = 5.000000e-06) MG level 0 (GPU): BiCGstab: Convergence at 6 iterations, L2 relative residual: iterated = 4.105885e-06, true = 4.105885e-06 (requested = 5.000000e-06) MG level 0 (GPU): BiCGstab: Convergence at 6 iterations, L2 relative residual: iterated = 3.707929e-06, true = 3.707929e-06 (requested = 5.000000e-06) MG level 0 (GPU): BiCGstab: Convergence at 6 iterations, L2 relative residual: iterated = 4.662438e-06, true = 4.662438e-06 (requested = 5.000000e-06) MG level 0 (GPU): BiCGstab: Convergence at 6 iterations, L2 relative residual: iterated = 3.711373e-06, true = 3.711373e-06 (requested = 5.000000e-06) MG level 0 (GPU): BiCGstab: Convergence at 6 iterations, L2 relative residual: iterated = 4.176259e-06, true = 4.176259e-06 (requested = 5.000000e-06) MG level 0 (GPU): BiCGstab: Convergence at 6 iterations, L2 relative residual: iterated = 4.005173e-06, true = 4.005173e-06 (requested = 5.000000e-06) MG level 0 (GPU): BiCGstab: Convergence at 6 iterations, L2 relative residual: iterated = 3.998309e-06, true = 3.998309e-06 (requested = 5.000000e-06) MG level 0 (GPU): BiCGstab: Convergence at 6 iterations, L2 relative residual: iterated = 4.101013e-06, true = 4.101013e-06 (requested = 5.000000e-06) MG level 0 (GPU): BiCGstab: Convergence at 6 iterations, L2 relative residual: iterated = 3.715315e-06, true = 3.715315e-06 (requested = 5.000000e-06) MG level 0 (GPU): BiCGstab: Convergence at 6 iterations, L2 relative residual: iterated = 3.816766e-06, true = 3.816766e-06 (requested = 5.000000e-06) MG level 0 (GPU): BiCGstab: Convergence at 6 iterations, L2 relative residual: iterated = 4.744014e-06, true = 4.744014e-06 (requested = 5.000000e-06) MG level 0 (GPU): BiCGstab: Convergence at 6 iterations, L2 relative residual: iterated = 4.293564e-06, true = 4.293564e-06 (requested = 5.000000e-06) MG level 0 (GPU): BiCGstab: Convergence at 6 iterations, L2 relative residual: iterated = 3.802038e-06, true = 3.802038e-06 (requested = 5.000000e-06) MG level 0 (GPU): BiCGstab: Convergence at 6 iterations, L2 relative residual: iterated = 3.917147e-06, true = 3.917147e-06 (requested = 5.000000e-06) MG level 0 (GPU): BiCGstab: Convergence at 6 iterations, L2 relative residual: iterated = 4.867327e-06, true = 4.867327e-06 (requested = 5.000000e-06) MG level 0 (GPU): BiCGstab: Convergence at 6 iterations, L2 relative residual: iterated = 4.107894e-06, true = 4.107894e-06 (requested = 5.000000e-06) MG level 0 (GPU): BiCGstab: Convergence at 6 iterations, L2 relative residual: iterated = 4.279538e-06, true = 4.279538e-06 (requested = 5.000000e-06) MG level 0 (GPU): BiCGstab: Convergence at 6 iterations, L2 relative residual: iterated = 4.296545e-06, true = 4.296545e-06 (requested = 5.000000e-06) MG level 0 (GPU): BiCGstab: Convergence at 6 iterations, L2 relative residual: iterated = 3.825819e-06, true = 3.825819e-06 (requested = 5.000000e-06) MG level 0 (GPU): BiCGstab: Convergence at 6 iterations, L2 relative residual: iterated = 4.644150e-06, true = 4.644150e-06 (requested = 5.000000e-06) MG level 0 (GPU): Computing Y field...... MG level 0 (GPU): ....done computing Y field MG level 0 (GPU): Computing Yhat field...... MG level 0 (GPU): ....done computing Yhat field MG level 0 (GPU): Checking 0 = (1 - P P^\dagger) v_k for 24 vectors MG level 0 (GPU): ERROR: Precisions 4 2 do not match (/home/jiangxy/quda/build/lib/restrictor_3_24.cu:190 in Restrict<3, 24>()) (rank 0, host LQCD, lattice_field.h:885 in QudaPrecision quda::Precision_(const char*, const char*, int, const T1&, const T2&) [with T1 = quda::ColorSpinorField; T2 = quda::ColorSpinorField; QudaPrecision = QudaPrecision_s]()) MG level 0 (GPU): last kernel called was (name=N4quda7RNGInitE,volume=4x4x4x4,aux=GPU-offline,vol=256,parity=2,precision=4,order=2,Ns=2,Nc=24)
Thanks @SaltyChiang for bug report. I'll fix this today.
It might be introduced by #1434. I run the invert_test with some specific parameters:
Result of commit a979a4d:
Result of commit ed6160e: