Closed kostrzewa closed 3 years ago
This is with 3ef47faf55481520baa87f52ec934f9652db9639 of the develop branch.
Persists with 127153033a85cf03f8b7f5f9785933807c662282
@cpviolator: Would there be any interest in the Temple group in sharing experiences with the tm and tmclover mg? Perhaps you've come across the same issue?
(Looking on my phone, so not ideal)
I don't see above where you are setting the number of null space vectors. By default, the correct templates are only instantiated for 24 and 32 vectors, from the error message it looks like you're using either 4 or 8 vectors which is leading to this message.
Also, set smoother_tol to 0.2 for each level.
@maddyscientist Ah, I see, thanks. I used 4 vectors to make the setup fast and make everything fit on a single node. I might just try with a smaller gauge configuration and use 24 vectors instead. I guess I'll close this issue then. Could I just call extractGhostMG<Float, 8>(u, Ghost, extract)
without ill effects?
Ah, I see you are setting n_vec to 4: that's the problem. Change this to 24.
Also, best to use DIRECTPC_SOLVE for the smoother solve type. In general best solver comes from using e/o preconditioning everywhere.
It's on my todo list to update the wiki page outlining all these parameters and good default settings.
One other thing: a few of us are getting together in a few weeks to work on optimizing MG setup to get it in a form for HMC. So I expect to see huge improvements in this (most of the setup is on the CPU at the moment which is why it is s..l..o..w
@kostrzewa I'll keep you in the loop when it comes to our code development for TM multigrid. We have lots of code that performs several variations of 2pt and 3pt correlation functions using a delated solver. I'm just concentrating of upgrading one of those functions to multigrid and comparing data so I know I have all the settings right. If I come across errors or subtitles I'll let you know.
Looks like Kate has already spotted a couple of things.
To make it work with 4 vectors you just need to instantiated the correct templates. There are a bunch of places where this will be needed, so will likely take a few edit / recompile efforts to get this working to stop the runtime errors.
It's restricted to 24/32 vectors for compile time and binary size. Long term plan is runtime compilation to make the choice fully dynamic but this is likely still some months off.
Regarding memory usage, I have some changed I'll work on including when working on HMC in a few months; this should halve the memory needed.
@maddyscientist Thanks a lot for your help. I've gotten it to launch the first gcrNkrylov GCR iterations. The setup time is quite tolerable, at least for the two levels that I've tried now. I stop BiCGStab at 200 iterations because it kind of stops converging after that, I plan to study the subspace quality later, depending on which parameters are used in the setup.
Unfortunately, I get a segmentation fault. I can provide more output if necessary, the segfault occurs after iteration 10. Not sure if the behaviour of the GCR residual is normal...
MG level 1 (GPU): BiCGstab: 198 iterations, <r,r> = 3.220807e-01, |r|/|b| = 7.989212e-05
MG level 1 (GPU): BiCGstab: 199 iterations, <r,r> = 3.210173e-01, |r|/|b| = 7.976013e-05
MG level 1 (GPU): BiCGstab: 200 iterations, <r,r> = 3.208646e-01, |r|/|b| = 7.974115e-05
MG level 1 (GPU): WARNING: Exceeded maximum iterations 200
MG level 1 (GPU): BiCGstab: Reliable updates = 4
MG level 1 (GPU): BiCGstab: Convergence at 200 iterations, L2 relative residual: iterated = 7.974115e-05, true = 7.974243e-05
MG level 1 (GPU): Solution = 966.229
MG level 1 (GPU): smoother has operator N4quda6DiracME
MG level 1 (GPU): Creating a MR solver
MG level 1 (GPU): Creating a MR solver
MG level 1 (GPU): start creating transfer operator
MG level 1 (GPU): Transfer: using block size 4 x 4 x 4 x 4
MG level 1 (GPU): Transfer: creating V field with basis 0 with location 1
MG level 1 (GPU): Transfer: filling V field with zero
MG level 1 (GPU): Transfer: block orthogonalizing
MG level 1 (GPU): Block Orthogonalizing 5184 blocks of 1536 length and width 24
MG level 1 (GPU): Transferred prolongator to GPU
MG level 1 (GPU): end creating transfer operator
MG level 1 (GPU): Kappa = 1.563610e-01
MG level 1 (GPU): Computing Y field......
MG level 1 (GPU): Doing uni-directional link coarsening
MG level 1 (GPU): Computing forward 0 UV and VUV
MG level 1 (GPU): UAV2[0] = 1.243987e+05
MG level 1 (GPU): Y2[0] = 1.229086e+04
MG level 1 (GPU): Computing forward 1 UV and VUV
MG level 1 (GPU): UAV2[1] = 1.243810e+05
MG level 1 (GPU): Y2[1] = 1.231800e+04
MG level 1 (GPU): Computing forward 2 UV and VUV
MG level 1 (GPU): UAV2[2] = 1.244160e+05
MG level 1 (GPU): Y2[2] = 1.229529e+04
MG level 1 (GPU): Computing forward 3 UV and VUV
MG level 1 (GPU): UAV2[3] = 1.244830e+05
MG level 1 (GPU): Y2[3] = 1.232037e+04
MG level 1 (GPU): X2 = 5.602725e+05
MG level 1 (GPU): Reversing links
MG level 1 (GPU): Computing coarse local
MG level 1 (GPU): X2 = 4.761809e+04
MG level 1 (GPU): Summing diagonal contribution to coarse clover
MG level 1 (GPU): X2 = 2.383024e+04
MG level 1 (GPU):
MAGMA will use device architecture 370.
MG level 1 (GPU): BatchInvertMatrix with n=48 and batch=2592
MG level 1 (GPU): LU factorization completed in 0.140552 seconds with GFLOPS = 5.399986
MG level 1 (GPU): Matrix inversion completed in 0.033394 seconds with GFLOPS = 45.806095
MG level 1 (GPU): ....done computing Y field
MG level 1 (GPU): Creating coarse null-space vectors
MG level 1 (GPU): Creating next multigrid level
MG level 2 (GPU): Creating level 2 of 2 levels
MG level 2 (GPU): smoother has operator N4quda6DiracME
MG level 2 (GPU): Creating a GCR solver
MG level 2 (GPU): setup completed
MG level 1 (GPU): Assigned coarse solver to coarse MG operator
MG level 1 (GPU): setup completed
MG level 1 (GPU): MG level 1 Total time = 47.8651 secs
[...]
calling mg solver # invertQuda is called here
QUDA Inverter Parameters:
dslash_type = 7
inv_type = 2
kappa = 0.156361
mu = 0.015
twist_flavor = -1
tol = 1e-08
residual_type = 2
maxiter = 20000
reliable_delta = 0.01
use_sloppy_partial_accumulator = 0
max_res_increase = 1
max_res_increase_total = 10
heavy_quark_check = 10
pipeline = 0
num_offset = 0
num_src = 1
overlap = 0
solution_type = 0
solve_type = 0
matpc_type = 2
dagger = 0
mass_normalization = 0
solver_normalization = 0
preserve_source = 1
cpu_prec = 8
cuda_prec = 8
cuda_prec_sloppy = 4
input_location = 1
output_location = 1
clover_location = 1
gamma_basis = 2
dirac_order = 1
sp_pad = 0
tune = 0
gcrNkrylov = 10
use_init_guess = 0
omega = 1
verbosity = 3
iter = 0
spinorGiB = 0
gflops = 0
secs = 0
nev = 0
max_search_dim = 0
rhs_idx = 0
deflation_grid = 0
cg_iterref_tol = 0.05
eigcg_max_restarts = 2
max_restart_num = 3
inc_tol = 0.01
eigenval_tol = 0.1
use_resident_solution = 0
make_resident_solution = 0
Creating a DiracTwistedMass operator (1 flavor(s))
Creating a DiracTwistedMass operator (1 flavor(s))
typedid = N4quda19cpuColorSpinorFieldE
nColor = 3
nSpin = 4
twistFlavor = -1
nDim = 4
x[0] = 12
x[1] = 24
x[2] = 48
x[3] = 24
volume = 331776
precision = 8
pad = 0
stride = 331776
real_length = 7962624
length = 7962624
ghost_length = 0
ghost_norm_length = 0
bytes = 63700992
norm_bytes = 0
siteSubset = 1
siteOrder = 1
fieldOrder = 5
gammaBasis = 2
Is composite = 0
Is component = 0
PC type = 0
[...]
note: gcrNkrylov = 10
GCR: 0 iterations, <r,r> = 2.547952e+08, |r|/|b| = 1.000000e+00
MG level 1 (GPU): MR: 0 iterations, r2 = 1.000000e+00
MG level 1 (GPU): MR: 1 iterations, <r|A|r> = (9.747375e-01, -5.136634e-05)
MG level 1 (GPU): MR: 2 iterations, <r|A|r> = (1.354015e-01, 2.136232e-05)
MG level 1 (GPU): MR: Converged after 2 iterations, relative residual: iterated = 2.165519e-05
MG level 2 (GPU): WARNING: GCR with pipeline length 5 is experimental
MG level 2 (GPU): GCR: 0 iterations, <r,r> = 9.204328e+05, |r|/|b| = 1.000000e+00
MG level 2 (GPU): GCR: 1 iterations, <r,r> = 1.924604e+05, |r|/|b| = 4.572720e-01
MG level 2 (GPU): GCR: 2 iterations, <r,r> = 3.728666e+04, |r|/|b| = 2.012708e-01
MG level 2 (GPU): GCR: 3 iterations, <r,r> = 7.841015e+03, |r|/|b| = 9.229753e-02
MG level 2 (GPU): GCR: 4 iterations, <r,r> = 1.716295e+03, |r|/|b| = 4.318171e-02
MG level 2 (GPU): GCR: 5 iterations, <r,r> = 3.844369e+02, |r|/|b| = 2.043697e-02
MG level 2 (GPU): GCR: 6 iterations, <r,r> = 8.746939e+01, |r|/|b| = 9.748370e-03
MG level 2 (GPU): GCR: 7 iterations, <r,r> = 2.019664e+01, |r|/|b| = 4.684287e-03
MG level 2 (GPU): GCR: 8 iterations, <r,r> = 4.703644e+00, |r|/|b| = 2.260587e-03
MG level 2 (GPU): GCR: 9 iterations, <r,r> = 1.104278e+00, |r|/|b| = 1.095325e-03
MG level 2 (GPU): GCR: 10 iterations, <r,r> = 2.607408e-01, |r|/|b| = 5.322411e-04
MG level 2 (GPU): GCR: 11 iterations, <r,r> = 6.186445e-02, |r|/|b| = 2.592534e-04
MG level 2 (GPU): GCR: 12 iterations, <r,r> = 1.469351e-02, |r|/|b| = 1.263475e-04
MG level 2 (GPU): GCR (restart): 1 iterations, <r,r> = 9.829608e+04, |r|/|b| = 3.267925e-01
MG level 2 (GPU): GCR: 13 iterations, <r,r> = 1.852264e+04, |r|/|b| = 1.418585e-01
MG level 2 (GPU): GCR: 14 iterations, <r,r> = 3.467377e+03, |r|/|b| = 6.137683e-02
MG level 2 (GPU): GCR: 15 iterations, <r,r> = 7.144977e+02, |r|/|b| = 2.786149e-02
MG level 2 (GPU): GCR: 16 iterations, <r,r> = 1.544110e+02, |r|/|b| = 1.295219e-02
MG level 2 (GPU): GCR: 17 iterations, <r,r> = 3.436934e+01, |r|/|b| = 6.110680e-03
MG level 2 (GPU): GCR: 18 iterations, <r,r> = 7.807142e+00, |r|/|b| = 2.912393e-03
MG level 2 (GPU): GCR: 19 iterations, <r,r> = 1.789579e+00, |r|/|b| = 1.394374e-03
MG level 2 (GPU): GCR: 20 iterations, <r,r> = 4.146161e-01, |r|/|b| = 6.711615e-04
MG level 2 (GPU): GCR: 21 iterations, <r,r> = 9.650391e-02, |r|/|b| = 3.237997e-04
MG level 2 (GPU): GCR: 22 iterations, <r,r> = 2.265651e-02, |r|/|b| = 1.568919e-04
MG level 2 (GPU): WARNING: GCR: new reliable residual norm 3.640993e+02 is greater than previous reliable residual norm 3.135221e+02 (total #inc 1)
MG level 2 (GPU): GCR (restart): 2 iterations, <r,r> = 1.325683e+05, |r|/|b| = 3.795105e-01
MG level 2 (GPU): GCR: 23 iterations, <r,r> = 2.494767e+04, |r|/|b| = 1.646338e-01
MG level 2 (GPU): GCR: 24 iterations, <r,r> = 4.670534e+03, |r|/|b| = 7.123398e-02
MG level 2 (GPU): GCR: 25 iterations, <r,r> = 9.654547e+02, |r|/|b| = 3.238694e-02
MG level 2 (GPU): GCR: 26 iterations, <r,r> = 2.089576e+02, |r|/|b| = 1.506722e-02
MG level 2 (GPU): GCR: 27 iterations, <r,r> = 4.650421e+01, |r|/|b| = 7.108044e-03
MG level 2 (GPU): GCR: 28 iterations, <r,r> = 1.057462e+01, |r|/|b| = 3.389506e-03
MG level 2 (GPU): GCR: 29 iterations, <r,r> = 2.444309e+00, |r|/|b| = 1.629604e-03
MG level 2 (GPU): GCR: 30 iterations, <r,r> = 5.679288e-01, |r|/|b| = 7.855085e-04
MG level 2 (GPU): GCR: 31 iterations, <r,r> = 1.332599e-01, |r|/|b| = 3.804992e-04
MG level 2 (GPU): GCR: 32 iterations, <r,r> = 3.146561e-02, |r|/|b| = 1.848937e-04
MG level 2 (GPU): WARNING: GCR: new reliable residual norm 3.679512e+02 is greater than previous reliable residual norm 3.640993e+02 (total #inc 2)
MG level 2 (GPU): WARNING: GCR: solver exiting due to too many true residual norm increases
MG level 2 (GPU): GCR: number of restarts = 2
MG level 2 (GPU): GCR: Convergence at 32 iterations, L2 relative residual: iterated = 3.835254e-01, true = 0.000000e+00
MG level 1 (GPU): MR: 0 iterations, r2 = 1.000000e+00
MG level 1 (GPU): MR: 1 iterations, <r|A|r> = (5.889364e-01, 1.418358e-04)
MG level 1 (GPU): MR: 2 iterations, <r|A|r> = (2.047453e-01, -1.488353e-06)
MG level 1 (GPU): MR: Converged after 2 iterations, relative residual: iterated = 4.254054e-05
[...]
GCR debug iter=1: Ap2=7.942181e+06, p2=6.998764e+07, rPre2=5.380420e+06
GCR debug iter=1: Apr=(5.637887e+06,-5.053312e+02,7.941173e+06)
GCR debug iter=1: beta[0][0] = (0.000000e+00,0.000000e+00)
GCR debug iter=1: beta[0][1] = (3.176022e+01,-6.630973e-02)
GCR: 2 iterations, <r,r> = 1.377765e+06, |r|/|b| = 7.353464e-02
MG level 1 (GPU): MR: 0 iterations, r2 = 1.000000e+00
MG level 1 (GPU): MR: 1 iterations, <r|A|r> = (2.083481e-01, 7.163521e-05)
MG level 1 (GPU): MR: 2 iterations, <r|A|r> = (1.307704e-01, 2.378019e-05)
MG level 1 (GPU): MR: Converged after 2 iterations, relative residual: iterated = 7.291394e-04
MG level 2 (GPU): WARNING: GCR with pipeline length 5 is experimental
MG level 2 (GPU): GCR: 0 iterations, <r,r> = 2.197209e+05, |r|/|b| = 1.000000e+00
MG level 2 (GPU): GCR: 1 iterations, <r,r> = 5.708147e+04, |r|/|b| = 5.096967e-01
MG level 2 (GPU): GCR: 2 iterations, <r,r> = 1.276572e+04, |r|/|b| = 2.410388e-01
MG level 2 (GPU): GCR: 3 iterations, <r,r> = 2.874830e+03, |r|/|b| = 1.143853e-01
MG level 2 (GPU): GCR: 4 iterations, <r,r> = 6.569313e+02, |r|/|b| = 5.467947e-02
MG level 2 (GPU): GCR: 5 iterations, <r,r> = 1.516918e+02, |r|/|b| = 2.627516e-02
MG level 2 (GPU): GCR: 6 iterations, <r,r> = 3.535873e+01, |r|/|b| = 1.268565e-02
MG level 2 (GPU): GCR: 7 iterations, <r,r> = 8.267209e+00, |r|/|b| = 6.134000e-03
MG level 2 (GPU): GCR: 8 iterations, <r,r> = 1.938526e+00, |r|/|b| = 2.970299e-03
MG level 2 (GPU): GCR: 9 iterations, <r,r> = 4.554805e-01, |r|/|b| = 1.439790e-03
MG level 2 (GPU): GCR: 10 iterations, <r,r> = 1.073920e-01, |r|/|b| = 6.991176e-04
MG level 2 (GPU): GCR: 11 iterations, <r,r> = 2.539955e-02, |r|/|b| = 3.399987e-04
MG level 2 (GPU): GCR (restart): 1 iterations, <r,r> = 4.209115e+04, |r|/|b| = 4.376830e-01
MG level 2 (GPU): GCR: 12 iterations, <r,r> = 7.984969e+03, |r|/|b| = 1.906342e-01
MG level 2 (GPU): GCR: 13 iterations, <r,r> = 1.498760e+03, |r|/|b| = 8.259054e-02
MG level 2 (GPU): GCR: 14 iterations, <r,r> = 3.096383e+02, |r|/|b| = 3.753977e-02
MG level 2 (GPU): GCR: 15 iterations, <r,r> = 6.716912e+01, |r|/|b| = 1.748433e-02
MG level 2 (GPU): GCR: 16 iterations, <r,r> = 1.495349e+01, |r|/|b| = 8.249651e-03
MG level 2 (GPU): GCR: 17 iterations, <r,r> = 3.399394e+00, |r|/|b| = 3.933372e-03
MG level 2 (GPU): GCR: 18 iterations, <r,r> = 7.810924e-01, |r|/|b| = 1.885452e-03
MG level 2 (GPU): GCR: 19 iterations, <r,r> = 1.814576e-01, |r|/|b| = 9.087657e-04
MG level 2 (GPU): GCR: 20 iterations, <r,r> = 4.250959e-02, |r|/|b| = 4.398532e-04
MG level 2 (GPU): GCR: 21 iterations, <r,r> = 1.000394e-02, |r|/|b| = 2.133781e-04
MG level 2 (GPU): WARNING: GCR: new reliable residual norm 2.392168e+02 is greater than previous reliable residual norm 2.051613e+02 (total #inc 1)
MG level 2 (GPU): GCR (restart): 2 iterations, <r,r> = 5.722470e+04, |r|/|b| = 5.103358e-01
MG level 2 (GPU): GCR: 22 iterations, <r,r> = 1.085010e+04, |r|/|b| = 2.222190e-01
MG level 2 (GPU): GCR: 23 iterations, <r,r> = 2.036403e+03, |r|/|b| = 9.627115e-02
MG level 2 (GPU): GCR: 24 iterations, <r,r> = 4.205175e+02, |r|/|b| = 4.374781e-02
MG level 2 (GPU): GCR: 25 iterations, <r,r> = 9.111480e+01, |r|/|b| = 2.036380e-02
MG level 2 (GPU): GCR: 26 iterations, <r,r> = 2.023839e+01, |r|/|b| = 9.597369e-03
MG level 2 (GPU): GCR: 27 iterations, <r,r> = 4.602356e+00, |r|/|b| = 4.576721e-03
MG level 2 (GPU): GCR: 28 iterations, <r,r> = 1.060197e+00, |r|/|b| = 2.196633e-03
MG level 2 (GPU): GCR: 29 iterations, <r,r> = 2.463396e-01, |r|/|b| = 1.058842e-03
MG level 2 (GPU): GCR: 30 iterations, <r,r> = 5.738701e-02, |r|/|b| = 5.110590e-04
MG level 2 (GPU): GCR: 31 iterations, <r,r> = 1.345183e-02, |r|/|b| = 2.474315e-04
MG level 2 (GPU): WARNING: GCR: new reliable residual norm 2.419803e+02 is greater than previous reliable residual norm 2.392168e+02 (total #inc 2)
MG level 2 (GPU): WARNING: GCR: solver exiting due to too many true residual norm increases
MG level 2 (GPU): GCR: number of restarts = 2
MG level 2 (GPU): GCR: Convergence at 31 iterations, L2 relative residual: iterated = 5.162312e-01, true = 0.000000e+00
MG level 1 (GPU): MR: 0 iterations, r2 = 1.000000e+00
MG level 1 (GPU): MR: 1 iterations, <r|A|r> = (4.990603e-01, -2.459406e-06)
MG level 1 (GPU): MR: 2 iterations, <r|A|r> = (1.428030e-01, 8.311549e-05)
MG level 1 (GPU): MR: Converged after 2 iterations, relative residual: iterated = 5.928231e-04
[...]
GCR debug iter=9: beta[8][0] = (0.000000e+00,0.000000e+00)
GCR debug iter=9: beta[8][1] = (0.000000e+00,0.000000e+00)
GCR debug iter=9: beta[8][2] = (0.000000e+00,0.000000e+00)
GCR debug iter=9: beta[8][3] = (0.000000e+00,0.000000e+00)
GCR debug iter=9: beta[8][4] = (0.000000e+00,0.000000e+00)
GCR debug iter=9: beta[8][5] = (0.000000e+00,0.000000e+00)
GCR debug iter=9: beta[8][6] = (0.000000e+00,0.000000e+00)
GCR debug iter=9: beta[8][7] = (0.000000e+00,0.000000e+00)
GCR debug iter=9: beta[8][8] = (0.000000e+00,0.000000e+00)
GCR debug iter=9: beta[8][9] = (1.159733e+01,1.119411e-01)
GCR: 10 iterations, <r,r> = 2.459163e+04, |r|/|b| = 9.824219e-03
quda::qudaMemcpy_ bytes = 31850496
[jrc0468:mpi_rank_8][error_sighandler] Caught error: Segmentation fault (signal 11)
[...]
Program received signal SIGSEGV, Segmentation fault.
__cxxabiv1::__dynamic_cast (src_ptr=0x2aad1c67f530,
src_type=0xd9a0f60 <typeinfo for quda::ColorSpinorField>,
dst_type=0x1bdfc20 <typeinfo for quda::cudaColorSpinorField>, src2dst=0)
at ../../../../libstdc++-v3/libsupc++/dyncast.cc:56
Will try to get some more debug information tomorrow...
Is this something that can be reproduced with the multigrid_invert_test that comes with QUDA? If so, if you send me a modified version of that, I can debug from here to see what's going on.
I'll give it a try, but I will first need to compile a version of QUDA with all the dependencies necessary for running the *_invert_test stuff. Note that I went for 24 vectors now and simply run on more nodes. Reproduced it already with different lattice sizes on different numbers of nodes.
I don't think there are any extra dependencies for the tests. You don't need to use QMP/QIO, it just means you can't load a lattice, but a seg-fault shouldn't care if the data is random or a real config 😉 , so just a plain MPI + MAGMA build should be fine.
Thanks, that's true of course! Both the "tmLQCD-orchestrated" MG setup as well as that from multigrid_invert_test
fail MG::verify()
. Don't know if that's already a hint or just a side-effect of the MG implementation being "work in progress". I will disable it and report back.
I can't reproduce the segmentation fault with multigrid_invert_test
. I will try to see if I messed something up on the tmLQCD side with respect to reproducing the way the setup and solver are called.
Maybe try building with CMAKE_BUILD_TYPE=HOSTDEBUG
This will enable debug information the the CPU code and may help pinning down where the segfault occurs using backtrace.
Thanks. I seem to have resolved the segmentation fault, which appears to have resulted from unfortunate combinations of matpc_type
, solution_type
and solve_type
and a mismatch between the outer and inner parameters.
After I've committed my changes, I'll revert back to the segfaulting version, to try to provide the problematic combination since this should probably be caught (as other unsupported combinations are).
The next step is getting the solver to actually perform well, because currently GCR gets stuck after reducing the residual by about one order of magnitude. This could be related to the fact that the gamma bases are different in the outer and inner parameter set.
The next step is getting the solver to actually perform well, because currently GCR gets stuck after reducing the residual by about one order of magnitude. This could be related to the fact that the gamma bases are different in the outer and inner parameter set.
As a side note: the two level solver "works" for plain Wilson quarks on a twisted mass sea, as long as the pion mass is well in excess of around 300 MeV or so. For twisted mass quarks on a twisted mass sea at a pion mass of around 300 MeV, GCR gets stuck as mentioned above.
Attempting to use a three level setup seems to fail for me both for Wilson and TM with the orthogonalisation of the null vectors not working at some early stage (vector 7, 8 or 9 out of 32).
MG level 2 (GPU): BiCGstab: Convergence at 352 iterations, L2 relative residual: iterated = 4.647049e-06, true = 2.319178e+00
MG level 2 (GPU): ERROR:
Cannot orthogonalize 7 vector
(rank 0, host jrc0002, multigrid.cpp:695 in quda::MG::generateNullVectors())
It's a bit odd that the iterated residual is so far off the true one, although I've noticed that on the coarser levels, delta is relaxed significantly (in multigrid.cpp
), if I read the code correctly.
@kostrzewa I'm not sure if it's possible, but can you change the BiCGstab
to CG
? I noticed in my studies of low mass pions with Wilson Clover fermions using QUDA that BiCGstab
would get stuck but switching to CG
would allow the solver to converge.
@cpviolator CG would only work for DD^dag and I don't think it would work for the homogeneous case anyway... or did you work on the MG setup phase? As far as I discussed with @maddyscientist, the fact that BiCGStab doesn't converge very well during null vector generation is not that important. Do I remember correctly, Kate? I seem to recall that you mentioned that the resulting solver still converges well for TM.
The segfaulting combination of parameters was:
QudaInvertParam for outer solver:
# QUDA: QUDA Inverter Parameters:
# QUDA: dslash_type = 0
# QUDA: inv_type = 2
# QUDA: kappa = 0.156048
# QUDA: tol = 1e-08
# QUDA: residual_type = 2
# QUDA: maxiter = 20000
# QUDA: reliable_delta = 0.01
# QUDA: use_sloppy_partial_accumulator = 0
# QUDA: max_res_increase = 1
# QUDA: max_res_increase_total = 10
# QUDA: heavy_quark_check = 10
# QUDA: pipeline = 0
# QUDA: num_offset = 0
# QUDA: num_src = 1
# QUDA: overlap = 0
# QUDA: solution_type = 0
# QUDA: solve_type = 0
# QUDA: matpc_type = 0
# QUDA: dagger = 0
# QUDA: mass_normalization = 0
# QUDA: solver_normalization = 0
# QUDA: preserve_source = 1
# QUDA: cpu_prec = 8
# QUDA: cuda_prec = 8
# QUDA: cuda_prec_sloppy = 4
# QUDA: input_location = 1
# QUDA: output_location = 1
# QUDA: clover_location = 1
# QUDA: gamma_basis = 2
# QUDA: dirac_order = 1
# QUDA: sp_pad = 0
# QUDA: tune = 0
# QUDA: gcrNkrylov = 10
# QUDA: use_init_guess = 0
# QUDA: omega = 1
# QUDA: verbosity = 2
# QUDA: iter = 0
# QUDA: spinorGiB = 0
# QUDA: gflops = 0
# QUDA: secs = 0
# QUDA: nev = 0
# QUDA: max_search_dim = 0
# QUDA: rhs_idx = 0
# QUDA: deflation_grid = 0
# QUDA: cg_iterref_tol = 0.05
# QUDA: eigcg_max_restarts = 2
# QUDA: max_restart_num = 3
# QUDA: inc_tol = 0.01
# QUDA: eigenval_tol = 0.1
# QUDA: use_resident_solution = 0
# QUDA: make_resident_solution = 0
QudaInvertParam passed via QudaMultigridParam->invert_param
# QUDA: QUDA Inverter Parameters:
# QUDA: dslash_type = 0
# QUDA: inv_type = 2
# QUDA: kappa = 0.156048
# QUDA: tol = 1e-08
# QUDA: residual_type = 2
# QUDA: maxiter = 20000
# QUDA: reliable_delta = 0.01
# QUDA: use_sloppy_partial_accumulator = 0
# QUDA: max_res_increase = 1
# QUDA: max_res_increase_total = 10
# QUDA: heavy_quark_check = 10
# QUDA: pipeline = 0
# QUDA: num_offset = 0
# QUDA: num_src = 1
# QUDA: overlap = 0
# QUDA: solution_type = 0
# QUDA: solve_type = 0
# QUDA: matpc_type = 0
# QUDA: dagger = 0
# QUDA: mass_normalization = 0
# QUDA: solver_normalization = 0
# QUDA: preserve_source = 1
# QUDA: cpu_prec = 8
# QUDA: cuda_prec = 8
# QUDA: cuda_prec_sloppy = 4
# QUDA: input_location = 1
# QUDA: output_location = 1
# QUDA: clover_location = 1
# QUDA: gamma_basis = 0
# QUDA: dirac_order = 1
# QUDA: sp_pad = 0
# QUDA: tune = 0
# QUDA: gcrNkrylov = 10
# QUDA: use_init_guess = 0
# QUDA: omega = 1
# QUDA: verbosity = 2
# QUDA: iter = 0
# QUDA: spinorGiB = 0
# QUDA: gflops = 0
# QUDA: secs = 0
# QUDA: nev = 0
# QUDA: max_search_dim = 0
# QUDA: rhs_idx = 0
# QUDA: deflation_grid = 0
# QUDA: cg_iterref_tol = 0.05
# QUDA: eigcg_max_restarts = 2
# QUDA: max_restart_num = 3
# QUDA: inc_tol = 0.01
# QUDA: eigenval_tol = 0.1
# QUDA: use_resident_solution = 0
# QUDA: make_resident_solution = 0
QudaMultigridParam:
# QUDA: QUDA Multigrid Parameters:
# QUDA: QUDA Inverter Parameters:
# QUDA: dslash_type = 0
# QUDA: inv_type = 2
# QUDA: kappa = 0.156048
# QUDA: tol = 1e-08
# QUDA: residual_type = 2
# QUDA: maxiter = 20000
# QUDA: reliable_delta = 0.01
# QUDA: use_sloppy_partial_accumulator = 0
# QUDA: max_res_increase = 1
# QUDA: max_res_increase_total = 10
# QUDA: heavy_quark_check = 10
# QUDA: pipeline = 0
# QUDA: num_offset = 0
# QUDA: num_src = 1
# QUDA: overlap = 0
# QUDA: solution_type = 0
# QUDA: solve_type = 0
# QUDA: matpc_type = 0
# QUDA: dagger = 0
# QUDA: mass_normalization = 0
# QUDA: solver_normalization = 0
# QUDA: preserve_source = 1
# QUDA: cpu_prec = 8
# QUDA: cuda_prec = 8
# QUDA: cuda_prec_sloppy = 4
# QUDA: input_location = 1
# QUDA: output_location = 1
# QUDA: clover_location = 1
# QUDA: gamma_basis = 0
# QUDA: dirac_order = 1
# QUDA: sp_pad = 0
# QUDA: tune = 0
# QUDA: gcrNkrylov = 10
# QUDA: use_init_guess = 0
# QUDA: omega = 1
# QUDA: verbosity = 2
# QUDA: iter = 0
# QUDA: spinorGiB = 0
# QUDA: gflops = 0
# QUDA: secs = 0
# QUDA: nev = 0
# QUDA: max_search_dim = 0
# QUDA: rhs_idx = 0
# QUDA: deflation_grid = 0
# QUDA: cg_iterref_tol = 0.05
# QUDA: eigcg_max_restarts = 2
# QUDA: max_restart_num = 3
# QUDA: inc_tol = 0.01
# QUDA: eigenval_tol = 0.1
# QUDA: use_resident_solution = 0
# QUDA: make_resident_solution = 0
# QUDA: n_level = 2
# QUDA: smoother[i] = 3
# QUDA: smoother_solve_type[i] = 2
# QUDA: geo_block_size[i][j] = 4
# QUDA: geo_block_size[i][j] = 4
# QUDA: geo_block_size[i][j] = 4
# QUDA: geo_block_size[i][j] = 4
# QUDA: spin_block_size[i] = 2
# QUDA: n_vec[i] = 24
# QUDA: cycle_type[i] = 3
# QUDA: nu_pre[i] = 2
# QUDA: nu_post[i] = 2
# QUDA: coarse_grid_solution_type[i] = 0
# QUDA: smoother_tol[i] = 0.2
# QUDA: global_reduction[i] = 1
# QUDA: omega[i] = 0.85
# QUDA: location[i] = 2
# QUDA: smoother[i] = 2
# QUDA: smoother_solve_type[i] = 2
# QUDA: smoother_tol[i] = 0.2
# QUDA: global_reduction[i] = 1
# QUDA: omega[i] = 0.85
# QUDA: location[i] = 2
# QUDA: compute_null_vector = 1
# QUDA: generate_all_levels = 1
# QUDA: run_verify = 0
# QUDA: gflops = 0
# QUDA: secs = 0
Not sure if this helps...
A working parameter set is: outer QudaInvertParam:
# QUDA: QUDA Inverter Parameters:
# QUDA: dslash_type = 0
# QUDA: inv_type = 2
# QUDA: kappa = 0.156048
# QUDA: tol = 1e-08
# QUDA: residual_type = 2
# QUDA: maxiter = 20000
# QUDA: reliable_delta = 0.0001
# QUDA: use_sloppy_partial_accumulator = 0
# QUDA: max_res_increase = 1
# QUDA: max_res_increase_total = 10
# QUDA: heavy_quark_check = 10
# QUDA: pipeline = 0
# QUDA: num_offset = 0
# QUDA: num_src = 1
# QUDA: overlap = 0
# QUDA: solution_type = 0
# QUDA: solve_type = 2
# QUDA: matpc_type = 0
# QUDA: dagger = 0
# QUDA: mass_normalization = 0
# QUDA: solver_normalization = 0
# QUDA: preserve_source = 1
# QUDA: cpu_prec = 8
# QUDA: cuda_prec = 8
# QUDA: cuda_prec_sloppy = 4
# QUDA: input_location = 1
# QUDA: output_location = 1
# QUDA: clover_location = 1
# QUDA: gamma_basis = 2
# QUDA: dirac_order = 1
# QUDA: sp_pad = 0
# QUDA: tune = 0
# QUDA: gcrNkrylov = 20
# QUDA: use_init_guess = 0
# QUDA: omega = 1
# QUDA: verbosity = 2
# QUDA: iter = 0
# QUDA: spinorGiB = 0
# QUDA: gflops = 0
# QUDA: secs = 0
# QUDA: nev = 0
# QUDA: max_search_dim = 0
# QUDA: rhs_idx = 0
# QUDA: deflation_grid = 0
# QUDA: cg_iterref_tol = 0.05
# QUDA: eigcg_max_restarts = 2
# QUDA: max_restart_num = 3
# QUDA: inc_tol = 0.01
# QUDA: eigenval_tol = 0.1
# QUDA: use_resident_solution = 0
# QUDA: make_resident_solution = 0
QudaMultigridParam->invert_param
# QUDA: QUDA Inverter Parameters:
# QUDA: dslash_type = 0
# QUDA: inv_type = 2
# QUDA: kappa = 0.156048
# QUDA: tol = 1e-08
# QUDA: residual_type = 2
# QUDA: maxiter = 20000
# QUDA: reliable_delta = 0.0001
# QUDA: use_sloppy_partial_accumulator = 0
# QUDA: max_res_increase = 1
# QUDA: max_res_increase_total = 10
# QUDA: heavy_quark_check = 10
# QUDA: pipeline = 0
# QUDA: num_offset = 0
# QUDA: num_src = 1
# QUDA: overlap = 0
# QUDA: solution_type = 0
# QUDA: solve_type = 0
# QUDA: matpc_type = 0
# QUDA: dagger = 0
# QUDA: mass_normalization = 0
# QUDA: solver_normalization = 0
# QUDA: preserve_source = 0
# QUDA: cpu_prec = 8
# QUDA: cuda_prec = 8
# QUDA: cuda_prec_sloppy = 4
# QUDA: input_location = 1
# QUDA: output_location = 1
# QUDA: clover_location = 1
# QUDA: gamma_basis = 0
# QUDA: dirac_order = 1
# QUDA: sp_pad = 0
# QUDA: tune = 0
# QUDA: gcrNkrylov = 20
# QUDA: use_init_guess = 0
# QUDA: omega = 1
# QUDA: verbosity = 1
# QUDA: iter = 0
# QUDA: spinorGiB = 0
# QUDA: gflops = 0
# QUDA: secs = 0
# QUDA: nev = 0
# QUDA: max_search_dim = 0
# QUDA: rhs_idx = 0
# QUDA: deflation_grid = 0
# QUDA: cg_iterref_tol = 0.05
# QUDA: eigcg_max_restarts = 2
# QUDA: max_restart_num = 3
# QUDA: inc_tol = 0.01
# QUDA: eigenval_tol = 0.1
# QUDA: use_resident_solution = 0
# QUDA: make_resident_solution = 0
QudaMultigridParam:
# QUDA: QUDA Multigrid Parameters:
# QUDA: QUDA Inverter Parameters:
# QUDA: dslash_type = 0
# QUDA: inv_type = 2
# QUDA: kappa = 0.156048
# QUDA: tol = 1e-08
# QUDA: residual_type = 2
# QUDA: maxiter = 20000
# QUDA: reliable_delta = 0.0001
# QUDA: use_sloppy_partial_accumulator = 0
# QUDA: max_res_increase = 1
# QUDA: max_res_increase_total = 10
# QUDA: heavy_quark_check = 10
# QUDA: pipeline = 0
# QUDA: num_offset = 0
# QUDA: num_src = 1
# QUDA: overlap = 0
# QUDA: solution_type = 0
# QUDA: solve_type = 0
# QUDA: matpc_type = 0
# QUDA: dagger = 0
# QUDA: mass_normalization = 0
# QUDA: solver_normalization = 0
# QUDA: preserve_source = 0
# QUDA: cpu_prec = 8
# QUDA: cuda_prec = 8
# QUDA: cuda_prec_sloppy = 4
# QUDA: input_location = 1
# QUDA: output_location = 1
# QUDA: clover_location = 1
# QUDA: gamma_basis = 0
# QUDA: dirac_order = 1
# QUDA: sp_pad = 0
# QUDA: tune = 0
# QUDA: gcrNkrylov = 20
# QUDA: use_init_guess = 0
# QUDA: omega = 1
# QUDA: verbosity = 1
# QUDA: iter = 0
# QUDA: spinorGiB = 0
# QUDA: gflops = 0
# QUDA: secs = 0
# QUDA: nev = 0
# QUDA: max_search_dim = 0
# QUDA: rhs_idx = 0
# QUDA: deflation_grid = 0
# QUDA: cg_iterref_tol = 0.05
# QUDA: eigcg_max_restarts = 2
# QUDA: max_restart_num = 3
# QUDA: inc_tol = 0.01
# QUDA: eigenval_tol = 0.1
# QUDA: use_resident_solution = 0
# QUDA: make_resident_solution = 0
# QUDA: n_level = 2
# QUDA: smoother[i] = 3
# QUDA: smoother_solve_type[i] = 2
# QUDA: geo_block_size[i][j] = 4
# QUDA: geo_block_size[i][j] = 4
# QUDA: geo_block_size[i][j] = 4
# QUDA: geo_block_size[i][j] = 4
# QUDA: spin_block_size[i] = 2
# QUDA: n_vec[i] = 24
# QUDA: cycle_type[i] = 3
# QUDA: nu_pre[i] = 2
# QUDA: nu_post[i] = 2
# QUDA: coarse_grid_solution_type[i] = 2
# QUDA: smoother_tol[i] = 0.2
# QUDA: global_reduction[i] = 1
# QUDA: omega[i] = 0.85
# QUDA: location[i] = 2
# QUDA: smoother[i] = 2
# QUDA: smoother_solve_type[i] = 2
# QUDA: smoother_tol[i] = 0.2
# QUDA: global_reduction[i] = 1
# QUDA: omega[i] = 0.85
# QUDA: location[i] = 2
# QUDA: compute_null_vector = 1
# QUDA: generate_all_levels = 1
# QUDA: run_verify = 0
# QUDA: gflops = 0
# QUDA: secs = 0
As far as I can tell, since gcrNkrylov is irrelevent for the segfault, the only difference is whether the outer solve is QUDA_DIRECT_PC_SOLVE
or not. If I run the outer solver with QUDA_DIRECT_SOLVE
, I get the segfault again.
As a consequence, when setting up the MG parameters, I will set (inv_param is the param struct for the outer solver):
mg_param->coarse_grid_solution_type[i] = inv_param.solve_type == QUDA_DIRECT_PC_SOLVE ? QUDA_MATPC_SOLUTION : QUDA_MAT_SOLUTION;
so the other difference will be the coarse grid solution type.
@kostrzewa Apologies I've not had a chance to look at this for the past few weeks (too many things ongoing at the moment). Next week I'll be focussed on MG at our hackathon, so will look at this issue then.
Thanks a lot! I've essentially gotten everything to work for a two-level solve, but the segfault still exists for certain combinations of parameters, as described above. In addition, I can't get the three-level solver to work due to the orthogonalisation problem on the coarser level, as described above. I was thinking of relaxing the orthogonalisation condition in multigrid.cpp
for this purpose, but would first like to understand if there's not some other problem, which could also explain the poor performance that I get.
I'm having convergence issues with the two level solver on twisted mass gauge configurations with low mass Wilson fermions (I would think that this is due to exceptional eigenvalues, I can provide configurations for testing [ILDG format, so easily read via QIO]). The solver gets stuck after reducing the residual by some factor x, where x can be between 1e2 and 1e6 or so. The situation is similar, but worse, for twisted mass quarks on the same configurations. For somewhat heavy Wilson fermions, I get convergence but the iteration counts are quite high (600+ iterations of 2-level MG-GCR at am_0^bare ~ 0.0065, a~0.07 fm)
If you want, I can provide a detailed list of the parameters used (in a more condensed and readable format than above) together with the gauge configurations.
I have now re-enabled the verification step and get this kind of result (for the 600+ iteration solver mentioned above):
[...]
MG level 1 (GPU): Vector 23: norms v_k = 1.000000e+00 P^\dagger v_k = 9.999999e-01 P P^\dagger v_k = 9.999999e-01
MG level 1 (GPU): L2 relative deviation = 2.263896e-05
MG level 1 (GPU):
MG level 1 (GPU): Checking 0 = (1 - P^\dagger P) eta_c
MG level 1 (GPU): Vector norms 1.326953e+06 1.326953e+06 (fine tmp 1.326953e+06) MG level 1 (GPU): L2 relative deviation = 9.367346e-06
MG level 1 (GPU):
MG level 1 (GPU): Comparing native coarse operator to emulated operator
MG level 1 (GPU): Vector norms Emulated=1.761139e+05 Native=1.761139e+05 MG level 1 (GPU): L2 relative deviation = 2.570257e-05
I've gotten the three-level solver to work by setting mg_param->generate_all_levels = QUDA_BOOLEAN_NO
and instantiating a few more block sizes for the restrictor. Unfortunately, although this roughly halves the number of iterations for the converging case, it does not seem to help with the stagnation problem described above, although it does increase the mentioned factor x
by which the residual is reduced before the solver stagnates.
The good news is that for tm+clover, the solver behaves quite well even with just two levels and converges even at the physical pion mass in around 1000 MG-GCR iterations, which is quite interesting. The setup time is quite substantial, however, as you mentioned.
@kostrzewa Excellent. So the main issue that is left to be resolved is the segmentation fault?
@maddyscientist Well, it would be nice if it converged also for tm without a clover term. It would probably also be interesting to understand if accidental zero eigenvalues are really to blame for the non-convergence of low-mass Wilson quarks on a twisted mass sea. The former point is more important though. I'm also not sure why mg_param->generate_all_levels = QUDA_BOOLEAN_YES
leads to orthogonalisation problems for the coarse null vectors in a three-level setup.
Another thing I've encountered is an illegal memory access in the tuning phase in a three-level setup in which I use 4^4 blocks on the fine grid and 3^4 blocks on the coarse grid (the configuration is 48^3x96, so this seems to be a good setup for the 16 cards that I use presently). The error I get is as follows:
MAGMA will use device architecture 370.
MG level 2 (GPU): BatchInvertMatrix with n=64 and batch=32
MG level 2 (GPU): LU factorization completed in 0.003755 seconds with GFLOPS = 5.924977
MG level 2 (GPU): Matrix inversion completed in 0.002197 seconds with GFLOPS = 20.369750
MG level 2 (GPU): ....done computing Y field
MG level 2 (GPU): ERROR: Failed to clear error state an illegal memory access was encountered
(rank 11, host jrc0461, tune.cpp:600 in quda::tuneLaunch())
MG level 2 (GPU): last kernel called was (name=N4quda14RestrictLaunchIfNS_11RestrictArgINS_11colorspinor12FieldOrderCBIfLi2ELi32ELi1EL16QudaFieldOrder_s2EEES5_NS3_IfLi2ELi32ELi32ELS4_2EEELi2ELi2EEELi2ELi32ELi2ELi32ELi2EEE,volume=4x2x2x2,12x6x6x6,aux=vol=32,stride=16,precision=4,TwistFlavour=-1,vol=2592,stride=1296,precision=4,TwistFlavour=-1)
MG level 2 (GPU): ERROR: Failed to clear error state an illegal memory access was encountered
(rank 9, host jrc0461, tune.cpp:600 in quda::tuneLaunch())
MG level 2 (GPU): last kernel called was (name=N4quda14RestrictLaunchIfNS_11RestrictArgINS_11colorspinor12FieldOrderCBIfLi2ELi32ELi1EL16QudaFieldOrder_s2EEES5_NS3_IfLi2ELi32ELi32ELS4_2EEELi2ELi2EEELi2ELi32ELi2ELi32ELi2EEE,volume=4x2x2x2,12x6x6x6,aux=vol=32,stride=16,precision=4,TwistFlavour=-1,vol=2592,stride=1296,precision=4,TwistFlavour=-1)
I don't get any warnings about odd block counts in any dimension on the coarsest level, but I'm not sure what happens once this is e/o preconditioned. If I understand things correctly and didn't make any mistakes, I should have (xyzt) N=4x2x2x2 aggregates of geometric volume 3^4 on each card in this setup.
@kostrzewa Ok, multiple things to get fixed then 😄
One thing I've been meaning to ask you, I see in tmQCD code that you have BiCGStab(l) support. You may have noticed that @weinbe2 has been adding this to QUDA (I think it should help reduce the null-space generation cost). How does this solver behave with twisted mass and twisted clover? I would have naively thought that it would behave a whole lot better than BiCGstab since it should treat complex eigenvalues better.
@maddyscientist Regarding plain BiCGStab, we know that it doesn't converge well (at all) for tm or tmclover. What I've tried is that I move away from maximal twist in the setup phase (kappa_setup ~ kappa_c - 0.003_kappa_c or -0.006_kappa_c), which seems to produce null vectors which are good enough at least for tmclover. This is essential for the prolongator to pass MG::verify(), as far as I can tell. The unconverged null vectors produced at kappa_c (500 iterations of BiCGstab) result in failing MG::verify():
MG level 1 (GPU): Checking 0 = (1 - P P^\dagger) v_k for 32 vectors
MG level 1 (GPU): Vector 0: norms v_k = 1.000000e+00 P^\dagger v_k = 1.000000e+00 P P^\dagger v_k = 1.000000e+00
MG level 1 (GPU): L2 relative deviation = 1.213198e-04
For Wilson quarks, doing this also seems to help in making outer iterations cheaper by making the coarse operator better conditioned without really affecting the outer iteration count, as far as I can see and if I have understood everything correctly. This is in line with the findings on the DDalphaAMG side that increasing the coarse twisted mass parameter improves overall time to solution.
As for BiCGStab-L in tmLQCD, I don't know if that was ever used. Maybe @urbach and I can revive it and run some tests. (I don't think it's really wired up for most operators in tmLQCD, but it should be easy enough to get working again)
I can confirm the segfaults when using a vanilla quda-develop and multigrid_invert_tests
. When using a twisted-mass dslash and direct solve type there is a segfault. When using twisted-clover and direct then verify()
fails in multigrid.cpp. Same errors for normop and normop-pc solves too. I can only get things working with direct-pc.
Ok I will try to reproduce this in the next few days. Does the problem occur for both single and multi GPU?
On progress last week, at the hackathon, we now have an interface which allows the MG state to persist for gauge evolution with the introduction of a new interface call updateMultigridQuda
. Also, I've just about got most of the setup off loaded to the GPU so there will soon be a big speedup in this regard.
Same story for single and multi-GPU. Right now I'm studying the working test combos --dslash-type twisted-mass --solve-type direct-pc
and trying to get our code to work with it. We used to do normop-pc to get the eigenvectors for deflation, but I think that we can do a direct-pc solve if we upgrade to MG. I'll report back when a working setup is found.
OK, I'm also seeing with our code that the level 1 BiCGStab solver has trouble getting past 1e-5 after 200 interactions using direct-pc and twisted mass, and that the GCR solver then has trouble getting past 1e-1 for |r|/|b|. I'm relaxing the solver tolerance way down to 5e-1 just to get it to 'converge' and produce data. I'm using the same MG set up as given in multigrid_invert_test
: a two level solve with 24 null vectors. I did change the maxiter on GCR level 1 to 200 as prescribed by @kostrzewa just for a little seed up.
Next task is to utilise @weinbe2 BiCGStab(I) code and look for improvements. @maddyscientist, would it be worthwhile to keep the first level(s) very loose on the tolerance, then dial down at say level 3 for a better convergence? Kind of like using the lower (1,2) MG levels as a sloppy preconditioner for the tighter (3,4) levels.
The code we are developing will be used for twisted-clover, so once I have some physical sized test lattices I'll report back with findings.
@cpviolator see here for a working setup for tmclover: https://github.com/kostrzewa/tmLQCD/blob/quda_interface_multigrid/quda_interface.c#L1084
1) lower kappa by 2.5 per-mille (away from kappa_c to get a positive PCAC mass) 2) increase mu in the MG to 5.2*target_mu
Works for 2 and 3 level solves. I used direct_pc as that is the only setup where I don't experience segfaults, as you said above.
ps: in this setup, some of the homogeneous BiCGstab solves converge with maxiter=500
@cpviolator regarding the lattices, can Martha get these or would you like me to provide some?
The ideal situation would be to generate the test vectors iteratively: run O(100) iterations of BiCGstab on the homogeneous systems to build a first approximation MG solver. Then use this solver on the homogeneous systems to improve the test vectors. Finally, at least according to the DDalphaAMG people (and in line with what I see in the QUDA-MG), the coarsest level mu should be increased relative to the finer levels, such that the coarse operator is better conditioned. An alternative would be to stringently limit the number of iterations on the coarsest level, I guess, but I feel that it would be difficult to find an ideal setting for the latter approach.
@kostrzewa Thank you very much! I'll run with that setup and see what it brings. We have some lattices knocking around in Cyprus so I should be OK there.
With regards to the convergence problem, I think that this adaptive approach is what's needed too. I can design a small testing program to see what sorts of changes give the best improvements, if any at all.
Ok, just reporting back on progress.
I was using an L=48 lattice with 3 and 4 GPUS but ran into an error uninstantiated block sizes. @kostrzewa, can you explain how you instantiated different block sizes? I see the code in the function
void apply(const cudaStream_t &stream)
and the various permissible block sizes. Does one simply add more conditionals? I also see in your setMultigridParams()
setup that there are some conditionals in the i,j
loop over mg_levels
not present in the current develop branch. I'll copy/paste that in to my code.
I'm switching to L=64 T=128 as these are the lattices which ultimately will be used in production so it makes sense to work from there and generalise later.
Yes, just add the correct block sizes instantiations to restrictor.cu. Block size template is half of multigrid aggregate size (parity is on the y axis).
@cpviolator
Yes, I simply added some more block sizes to restrictor.cu
.
As for the aggregation parameters in "my" setMultigridParams()
, I wanted to make the coarser blocking automatic (for level n
, say) by choosing either 3, 2 or 1 block aggregates on the coarser levels depending on the dimensions of the local block lattice on level n-1
. I get issues as described above with the odd blocking dimensions, so something is still amiss.
@kostrzewa I'm not sure that the code you have in https://github.com/kostrzewa/tmLQCD/blob/quda_interface_multigrid/quda_interface.c#L1084 does what it should. From what I've seen, the \mu parameter remains unchanged on the finest levels, but on the coarsest level it gets rescaled. In your code (as far as I can see) the \mu parameter is scaled up at setup time to 5.2 its input value, and remains that way throughout the calculation.
I'm writing code to change the \mu parameter at the nth MG level so that one inputs a \mu_0 and C_mu. One then specifies which value of \mu is to be used at the nth level. For the three level solves we do using DDalpha, we have
level 1: mu = mu_0 level 2: mu = mu_0 level 3: mu = mu_0 * C_mu
As a baseline, I'll emulate this setup in QUDA, but I think a worthwhile experiment would be to use
level 1: mu = mu_0 level 2: mu = mu_0 * (C_mu)^0.5 level 3: mu = mu_0 * C_mu
and similar power law scaling for higher level solves, eg
level 1: mu = mu_0 * (C_mu)^{0/3} level 2: mu = mu_0 * (C_mu)^{1/3} level 3: mu = mu_0 * (C_mu)^{2/3} level 4: mu = mu_0 * (C_mu)^{3/3}
Please let me know if I've read your code incorrectly.
@cpviolator You are of course correct, I simply haven't had the time to poke into the QUDA innards to copy the DDalphaAMG-style mu-rescaling. Increasing mu on all coarse levels seems to kind of work as well and it was a first step towards getting a solver that converges. Especially during the setup phase, mu should ideally be the target mass, although this will likely affect the convergence of BiCGstab.
I'm working on code to declare an array for mu
which can be accessed by the solvers at various levels. The conventional thought is to increase mu
on the coarsest level, but this method should allow for experimentation too. I'll share it when I get convergence.
we've addressed this
I'm currently extending the tmLQCD QUDA interface to provide access to the multigrid solver and its parameters. While the gamma basis problem still exists, I currently "fix" this by forcing
QUDA_DEGRAND_ROSSI_GAMMA_BASIS
in theQudaInvertParam
struct which is pointed to in theQudaMultigridParam
struct when it is passed tonewMultigridQuda
. The problem I encounter now is that after setting up the null vectors and running the MAGMA routines, the code fails with the following error:I set the MG parameters as follows, where
mg_param->invert_param
is already populated with parameters appropriate for the operator being inverted. I use few vectors for testing purposes.Is there anything that you can spot, which you would consider responsible for the error?