lattice / quda

QUDA is a library for performing calculations in lattice QCD on GPUs.
https://lattice.github.io/quda
Other
289 stars 97 forks source link

QUDA twisted mass multigrid interface progress #512

Closed kostrzewa closed 3 years ago

kostrzewa commented 8 years ago

I'm currently extending the tmLQCD QUDA interface to provide access to the multigrid solver and its parameters. While the gamma basis problem still exists, I currently "fix" this by forcing QUDA_DEGRAND_ROSSI_GAMMA_BASIS in the QudaInvertParam struct which is pointed to in the QudaMultigridParam struct when it is passed to newMultigridQuda. The problem I encounter now is that after setting up the null vectors and running the MAGMA routines, the code fails with the following error:

MG level 1 (GPU): Creating a MR solver
MG level 1 (GPU): Creating a MR solver
MG level 1 (GPU): start creating transfer operator
MG level 1 (GPU): Transfer: using block size 4 x 4 x 4 x 4
MG level 1 (GPU): Transfer: creating V field with basis 0 with location 1
MG level 1 (GPU): Transfer: filling V field with zero
MG level 1 (GPU): Transfer: block orthogonalizing
MG level 1 (GPU): Block Orthogonalizing 20736 blocks of 1536 length and width 4
MG level 1 (GPU): Transferred prolongator to GPU
MG level 1 (GPU): end creating transfer operator
MG level 1 (GPU): Kappa = 1.563610e-01
MG level 1 (GPU): Computing Y field......
MG level 1 (GPU): Doing uni-directional link coarsening
MG level 1 (GPU): Computing forward 0 UV and VUV
MG level 1 (GPU): UAV2[0] = 8.294400e+04
MG level 1 (GPU): Y2[0] = 4.523735e+03
MG level 1 (GPU): Computing forward 1 UV and VUV
MG level 1 (GPU): UAV2[1] = 8.294400e+04
MG level 1 (GPU): Y2[1] = 4.529617e+03
MG level 1 (GPU): Computing forward 2 UV and VUV
MG level 1 (GPU): UAV2[2] = 8.294400e+04
MG level 1 (GPU): Y2[2] = 4.533525e+03
MG level 1 (GPU): Computing forward 3 UV and VUV
MG level 1 (GPU): UAV2[3] = 8.302104e+04
MG level 1 (GPU): Y2[3] = 4.529258e+03
MG level 1 (GPU): X2 = 4.461438e+05
MG level 1 (GPU): Reversing links
MG level 1 (GPU): Computing coarse local
MG level 1 (GPU): X2 = 4.113432e+04
MG level 1 (GPU): Summing diagonal contribution to coarse clover
MG level 1 (GPU): X2 = 8.461884e+03
MG level 1 (GPU): 
MAGMA will use device architecture 370.
MG level 1 (GPU): BatchInvertMatrix with n=8 and batch=10368
MG level 1 (GPU): LU factorization completed in 0.889723 seconds with GFLOPS = 0.015568
MG level 1 (GPU): ERROR: Ncolor = 8 not supported (rank 1, host jrc0002, extract_gauge_ghost_mg.cu:59 in extractGhostMG())
MG level 1 (GPU):        last kernel called was (name=cudaMemcpyDeviceToHost,volume=bytes=5308416,aux=BatchInvertMatrix,blas_magma.cu:1130)
MG level 1 (GPU): ERROR: Ncolor = 8 not supported (rank 3, host jrc0002, extract_gauge_ghost_mg.cu:59 in extractGhostMG())
MG level 1 (GPU):        last kernel called was (name=cudaMemcpyDeviceToHost,volume=bytes=5308416,aux=BatchInvertMatrix,blas_magma.cu:1130)
MG level 1 (GPU): ERROR: Ncolor = 8 not supported (rank 2, host jrc0002, extract_gauge_ghost_mg.cu:59 in extractGhostMG())
MG level 1 (GPU):        last kernel called was (name=cudaMemcpyDeviceToHost,volume=bytes=5308416,aux=BatchInvertMatrix,blas_magma.cu:1130)
MG level 1 (GPU): Matrix inversion completed in 0.036952 seconds with GFLOPS = 0.781135
MG level 1 (GPU): ERROR: Ncolor = 8 not supported (rank 0, host jrc0002, extract_gauge_ghost_mg.cu:59 in extractGhostMG())
MG level 1 (GPU):        last kernel called was (name=cudaMemcpyDeviceToHost,volume=bytes=5308416,aux=BatchInvertMatrix,blas_magma.cu:1130)

I set the MG parameters as follows, where mg_param->invert_param is already populated with parameters appropriate for the operator being inverted. I use few vectors for testing purposes.

void _setMultigridParam(QudaMultigridParam* mg_param) {
  QudaInvertParam *mg_inv_param = mg_param->invert_param;

  mg_inv_param->sp_pad = 0;
  mg_inv_param->cl_pad = 0;

  mg_inv_param->preserve_source = QUDA_PRESERVE_SOURCE_NO;
  mg_inv_param->gamma_basis = QUDA_DEGRAND_ROSSI_GAMMA_BASIS;
  mg_inv_param->dirac_order = QUDA_DIRAC_ORDER;

  mg_inv_param->input_location = QUDA_CPU_FIELD_LOCATION;
  mg_inv_param->output_location = QUDA_CPU_FIELD_LOCATION;

  mg_param->n_level = 2;
  for (int i=0; i<mg_param->n_level; i++) {
    for (int j=0; j<QUDA_MAX_DIM; j++) {
      mg_param->geo_block_size[i][j] = 4;
    }
    mg_param->spin_block_size[i] = 1;
    mg_param->n_vec[i] = 4;
    mg_param->nu_pre[i] = 2;
    mg_param->nu_post[i] = 2;

    mg_param->cycle_type[i] = QUDA_MG_CYCLE_RECURSIVE;
    mg_param->smoother[i] = QUDA_MR_INVERTER;
    // set the smoother / bottom solver tolerance (for MR smoothing this will be ignored)
    mg_param->smoother_tol[i] = 10; // repurpose heavy-quark tolerance for now

    mg_param->global_reduction[i] = QUDA_BOOLEAN_YES;

    // set to QUDA_DIRECT_SOLVE for no even/odd preconditioning on the smoother
    // set to QUDA_DIRECT_PC_SOLVE for to enable even/odd preconditioning on the smoother
    mg_param->smoother_solve_type[i] = QUDA_DIRECT_SOLVE;//QUDA_DIRECT_PC_SOLVE; // EVEN-ODD

    // set to QUDA_MAT_SOLUTION to inject a full field into coarse grid
    // set to QUDA_MATPC_SOLUTION to inject single parity field into coarse grid

    // if we are using an outer even-odd preconditioned solve, then we
    // use single parity injection into the coarse grid
    mg_param->coarse_grid_solution_type[i] = mg_inv_param->solve_type == QUDA_DIRECT_PC_SOLVE ? QUDA_MATPC_SOLUTION : QUDA_MAT_SOLUTION;

    mg_param->omega[i] = 0.85; // over/under relaxation factor

    mg_param->location[i] = QUDA_CUDA_FIELD_LOCATION;
  }

  // only coarsen the spin on the first restriction
  mg_param->spin_block_size[0] = 2;

  // coarse grid solver is GCR
  mg_param->smoother[mg_param->n_level-1] = QUDA_GCR_INVERTER;

  mg_param->compute_null_vector = QUDA_COMPUTE_NULL_VECTOR_YES;
  mg_param->generate_all_levels = QUDA_BOOLEAN_YES;
  mg_param->run_verify = QUDA_BOOLEAN_YES;

  // set file i/o parameters
  strcpy(mg_param->vec_infile, "");
  strcpy(mg_param->vec_outfile,"");

  // these need to tbe set for now but are actually ignored by the MG setup
  // needed to make it pass the initialization test
  // in tmLQCD, these are set elsewhere, except for inv_type
  mg_inv_param->inv_type = QUDA_MR_INVERTER;
  //inv_param->tol = 1e-10;
  //inv_param.maxiter = 1000;
  //inv_param.reliable_delta = 1e-10;
  mg_inv_param->gcrNkrylov = 10;

  mg_inv_param->verbosity = QUDA_VERBOSE;
  mg_inv_param->verbosity_precondition = QUDA_VERBOSE;
}

Is there anything that you can spot, which you would consider responsible for the error?

kostrzewa commented 8 years ago

This is with 3ef47faf55481520baa87f52ec934f9652db9639 of the develop branch.

kostrzewa commented 8 years ago

Persists with 127153033a85cf03f8b7f5f9785933807c662282

kostrzewa commented 8 years ago

@cpviolator: Would there be any interest in the Temple group in sharing experiences with the tm and tmclover mg? Perhaps you've come across the same issue?

maddyscientist commented 8 years ago

(Looking on my phone, so not ideal)

I don't see above where you are setting the number of null space vectors. By default, the correct templates are only instantiated for 24 and 32 vectors, from the error message it looks like you're using either 4 or 8 vectors which is leading to this message.

maddyscientist commented 8 years ago

Also, set smoother_tol to 0.2 for each level.

kostrzewa commented 8 years ago

@maddyscientist Ah, I see, thanks. I used 4 vectors to make the setup fast and make everything fit on a single node. I might just try with a smaller gauge configuration and use 24 vectors instead. I guess I'll close this issue then. Could I just call extractGhostMG<Float, 8>(u, Ghost, extract) without ill effects?

maddyscientist commented 8 years ago

Ah, I see you are setting n_vec to 4: that's the problem. Change this to 24.

Also, best to use DIRECTPC_SOLVE for the smoother solve type. In general best solver comes from using e/o preconditioning everywhere.

It's on my todo list to update the wiki page outlining all these parameters and good default settings.

One other thing: a few of us are getting together in a few weeks to work on optimizing MG setup to get it in a form for HMC. So I expect to see huge improvements in this (most of the setup is on the CPU at the moment which is why it is s..l..o..w

cpviolator commented 8 years ago

@kostrzewa I'll keep you in the loop when it comes to our code development for TM multigrid. We have lots of code that performs several variations of 2pt and 3pt correlation functions using a delated solver. I'm just concentrating of upgrading one of those functions to multigrid and comparing data so I know I have all the settings right. If I come across errors or subtitles I'll let you know.

Looks like Kate has already spotted a couple of things.

maddyscientist commented 8 years ago

To make it work with 4 vectors you just need to instantiated the correct templates. There are a bunch of places where this will be needed, so will likely take a few edit / recompile efforts to get this working to stop the runtime errors.

It's restricted to 24/32 vectors for compile time and binary size. Long term plan is runtime compilation to make the choice fully dynamic but this is likely still some months off.

Regarding memory usage, I have some changed I'll work on including when working on HMC in a few months; this should halve the memory needed.

kostrzewa commented 8 years ago

@maddyscientist Thanks a lot for your help. I've gotten it to launch the first gcrNkrylov GCR iterations. The setup time is quite tolerable, at least for the two levels that I've tried now. I stop BiCGStab at 200 iterations because it kind of stops converging after that, I plan to study the subspace quality later, depending on which parameters are used in the setup.

Unfortunately, I get a segmentation fault. I can provide more output if necessary, the segfault occurs after iteration 10. Not sure if the behaviour of the GCR residual is normal...

MG level 1 (GPU): BiCGstab: 198 iterations, <r,r> = 3.220807e-01, |r|/|b| = 7.989212e-05
MG level 1 (GPU): BiCGstab: 199 iterations, <r,r> = 3.210173e-01, |r|/|b| = 7.976013e-05
MG level 1 (GPU): BiCGstab: 200 iterations, <r,r> = 3.208646e-01, |r|/|b| = 7.974115e-05
MG level 1 (GPU): WARNING: Exceeded maximum iterations 200
MG level 1 (GPU): BiCGstab: Reliable updates = 4
MG level 1 (GPU): BiCGstab: Convergence at 200 iterations, L2 relative residual: iterated = 7.974115e-05, true = 7.974243e-05
MG level 1 (GPU): Solution = 966.229
MG level 1 (GPU): smoother has operator N4quda6DiracME
MG level 1 (GPU): Creating a MR solver
MG level 1 (GPU): Creating a MR solver
MG level 1 (GPU): start creating transfer operator
MG level 1 (GPU): Transfer: using block size 4 x 4 x 4 x 4
MG level 1 (GPU): Transfer: creating V field with basis 0 with location 1
MG level 1 (GPU): Transfer: filling V field with zero
MG level 1 (GPU): Transfer: block orthogonalizing
MG level 1 (GPU): Block Orthogonalizing 5184 blocks of 1536 length and width 24
MG level 1 (GPU): Transferred prolongator to GPU
MG level 1 (GPU): end creating transfer operator
MG level 1 (GPU): Kappa = 1.563610e-01
MG level 1 (GPU): Computing Y field......
MG level 1 (GPU): Doing uni-directional link coarsening
MG level 1 (GPU): Computing forward 0 UV and VUV
MG level 1 (GPU): UAV2[0] = 1.243987e+05
MG level 1 (GPU): Y2[0] = 1.229086e+04
MG level 1 (GPU): Computing forward 1 UV and VUV
MG level 1 (GPU): UAV2[1] = 1.243810e+05
MG level 1 (GPU): Y2[1] = 1.231800e+04
MG level 1 (GPU): Computing forward 2 UV and VUV
MG level 1 (GPU): UAV2[2] = 1.244160e+05
MG level 1 (GPU): Y2[2] = 1.229529e+04
MG level 1 (GPU): Computing forward 3 UV and VUV
MG level 1 (GPU): UAV2[3] = 1.244830e+05
MG level 1 (GPU): Y2[3] = 1.232037e+04
MG level 1 (GPU): X2 = 5.602725e+05
MG level 1 (GPU): Reversing links
MG level 1 (GPU): Computing coarse local
MG level 1 (GPU): X2 = 4.761809e+04
MG level 1 (GPU): Summing diagonal contribution to coarse clover
MG level 1 (GPU): X2 = 2.383024e+04
MG level 1 (GPU): 
MAGMA will use device architecture 370.
MG level 1 (GPU): BatchInvertMatrix with n=48 and batch=2592
MG level 1 (GPU): LU factorization completed in 0.140552 seconds with GFLOPS = 5.399986
MG level 1 (GPU): Matrix inversion completed in 0.033394 seconds with GFLOPS = 45.806095
MG level 1 (GPU): ....done computing Y field
MG level 1 (GPU): Creating coarse null-space vectors
MG level 1 (GPU): Creating next multigrid level
MG level 2 (GPU): Creating level 2 of 2 levels
MG level 2 (GPU): smoother has operator N4quda6DiracME
MG level 2 (GPU): Creating a GCR solver
MG level 2 (GPU): setup completed
MG level 1 (GPU): Assigned coarse solver to coarse MG operator
MG level 1 (GPU): setup completed
MG level 1 (GPU):              MG level 1 Total time = 47.8651 secs

[...]
calling mg solver  # invertQuda is called here
QUDA Inverter Parameters:
dslash_type = 7
inv_type = 2
kappa = 0.156361
mu = 0.015
twist_flavor = -1
tol = 1e-08
residual_type = 2
maxiter = 20000
reliable_delta = 0.01
use_sloppy_partial_accumulator = 0
max_res_increase = 1
max_res_increase_total = 10
heavy_quark_check = 10
pipeline = 0
num_offset = 0
num_src = 1
overlap = 0
solution_type = 0
solve_type = 0
matpc_type = 2
dagger = 0
mass_normalization = 0
solver_normalization = 0
preserve_source = 1
cpu_prec = 8
cuda_prec = 8
cuda_prec_sloppy = 4
input_location = 1
output_location = 1
clover_location = 1
gamma_basis = 2
dirac_order = 1
sp_pad = 0
tune = 0
gcrNkrylov = 10
use_init_guess = 0
omega = 1
verbosity = 3
iter = 0
spinorGiB = 0
gflops = 0
secs = 0
nev = 0
max_search_dim = 0
rhs_idx = 0
deflation_grid = 0
cg_iterref_tol = 0.05
eigcg_max_restarts = 2
max_restart_num = 3
inc_tol = 0.01
eigenval_tol = 0.1
use_resident_solution = 0
make_resident_solution = 0
Creating a DiracTwistedMass operator (1 flavor(s))
Creating a DiracTwistedMass operator (1 flavor(s))
typedid = N4quda19cpuColorSpinorFieldE
nColor = 3
nSpin = 4
twistFlavor = -1
nDim = 4
x[0] = 12
x[1] = 24
x[2] = 48
x[3] = 24
volume = 331776
precision = 8
pad = 0
stride = 331776
real_length = 7962624
length = 7962624
ghost_length = 0
ghost_norm_length = 0
bytes = 63700992
norm_bytes = 0
siteSubset = 1
siteOrder = 1
fieldOrder = 5
gammaBasis = 2
Is composite = 0
Is component = 0
PC type = 0

[...]

note: gcrNkrylov = 10

GCR: 0 iterations, <r,r> = 2.547952e+08, |r|/|b| = 1.000000e+00
MG level 1 (GPU): MR: 0 iterations, r2 = 1.000000e+00
MG level 1 (GPU): MR: 1 iterations, <r|A|r> = (9.747375e-01, -5.136634e-05)
MG level 1 (GPU): MR: 2 iterations, <r|A|r> = (1.354015e-01, 2.136232e-05)
MG level 1 (GPU): MR: Converged after 2 iterations, relative residual: iterated = 2.165519e-05
MG level 2 (GPU): WARNING: GCR with pipeline length 5 is experimental
MG level 2 (GPU): GCR: 0 iterations, <r,r> = 9.204328e+05, |r|/|b| = 1.000000e+00
MG level 2 (GPU): GCR: 1 iterations, <r,r> = 1.924604e+05, |r|/|b| = 4.572720e-01
MG level 2 (GPU): GCR: 2 iterations, <r,r> = 3.728666e+04, |r|/|b| = 2.012708e-01
MG level 2 (GPU): GCR: 3 iterations, <r,r> = 7.841015e+03, |r|/|b| = 9.229753e-02
MG level 2 (GPU): GCR: 4 iterations, <r,r> = 1.716295e+03, |r|/|b| = 4.318171e-02
MG level 2 (GPU): GCR: 5 iterations, <r,r> = 3.844369e+02, |r|/|b| = 2.043697e-02
MG level 2 (GPU): GCR: 6 iterations, <r,r> = 8.746939e+01, |r|/|b| = 9.748370e-03
MG level 2 (GPU): GCR: 7 iterations, <r,r> = 2.019664e+01, |r|/|b| = 4.684287e-03
MG level 2 (GPU): GCR: 8 iterations, <r,r> = 4.703644e+00, |r|/|b| = 2.260587e-03
MG level 2 (GPU): GCR: 9 iterations, <r,r> = 1.104278e+00, |r|/|b| = 1.095325e-03
MG level 2 (GPU): GCR: 10 iterations, <r,r> = 2.607408e-01, |r|/|b| = 5.322411e-04
MG level 2 (GPU): GCR: 11 iterations, <r,r> = 6.186445e-02, |r|/|b| = 2.592534e-04
MG level 2 (GPU): GCR: 12 iterations, <r,r> = 1.469351e-02, |r|/|b| = 1.263475e-04
MG level 2 (GPU): GCR (restart): 1 iterations, <r,r> = 9.829608e+04, |r|/|b| = 3.267925e-01
MG level 2 (GPU): GCR: 13 iterations, <r,r> = 1.852264e+04, |r|/|b| = 1.418585e-01
MG level 2 (GPU): GCR: 14 iterations, <r,r> = 3.467377e+03, |r|/|b| = 6.137683e-02
MG level 2 (GPU): GCR: 15 iterations, <r,r> = 7.144977e+02, |r|/|b| = 2.786149e-02
MG level 2 (GPU): GCR: 16 iterations, <r,r> = 1.544110e+02, |r|/|b| = 1.295219e-02
MG level 2 (GPU): GCR: 17 iterations, <r,r> = 3.436934e+01, |r|/|b| = 6.110680e-03
MG level 2 (GPU): GCR: 18 iterations, <r,r> = 7.807142e+00, |r|/|b| = 2.912393e-03
MG level 2 (GPU): GCR: 19 iterations, <r,r> = 1.789579e+00, |r|/|b| = 1.394374e-03
MG level 2 (GPU): GCR: 20 iterations, <r,r> = 4.146161e-01, |r|/|b| = 6.711615e-04
MG level 2 (GPU): GCR: 21 iterations, <r,r> = 9.650391e-02, |r|/|b| = 3.237997e-04
MG level 2 (GPU): GCR: 22 iterations, <r,r> = 2.265651e-02, |r|/|b| = 1.568919e-04
MG level 2 (GPU): WARNING: GCR: new reliable residual norm 3.640993e+02 is greater than previous reliable residual norm 3.135221e+02 (total #inc 1)
MG level 2 (GPU): GCR (restart): 2 iterations, <r,r> = 1.325683e+05, |r|/|b| = 3.795105e-01
MG level 2 (GPU): GCR: 23 iterations, <r,r> = 2.494767e+04, |r|/|b| = 1.646338e-01
MG level 2 (GPU): GCR: 24 iterations, <r,r> = 4.670534e+03, |r|/|b| = 7.123398e-02
MG level 2 (GPU): GCR: 25 iterations, <r,r> = 9.654547e+02, |r|/|b| = 3.238694e-02
MG level 2 (GPU): GCR: 26 iterations, <r,r> = 2.089576e+02, |r|/|b| = 1.506722e-02
MG level 2 (GPU): GCR: 27 iterations, <r,r> = 4.650421e+01, |r|/|b| = 7.108044e-03
MG level 2 (GPU): GCR: 28 iterations, <r,r> = 1.057462e+01, |r|/|b| = 3.389506e-03
MG level 2 (GPU): GCR: 29 iterations, <r,r> = 2.444309e+00, |r|/|b| = 1.629604e-03
MG level 2 (GPU): GCR: 30 iterations, <r,r> = 5.679288e-01, |r|/|b| = 7.855085e-04
MG level 2 (GPU): GCR: 31 iterations, <r,r> = 1.332599e-01, |r|/|b| = 3.804992e-04
MG level 2 (GPU): GCR: 32 iterations, <r,r> = 3.146561e-02, |r|/|b| = 1.848937e-04
MG level 2 (GPU): WARNING: GCR: new reliable residual norm 3.679512e+02 is greater than previous reliable residual norm 3.640993e+02 (total #inc 2)
MG level 2 (GPU): WARNING: GCR: solver exiting due to too many true residual norm increases
MG level 2 (GPU): GCR: number of restarts = 2
MG level 2 (GPU): GCR: Convergence at 32 iterations, L2 relative residual: iterated = 3.835254e-01, true = 0.000000e+00
MG level 1 (GPU): MR: 0 iterations, r2 = 1.000000e+00
MG level 1 (GPU): MR: 1 iterations, <r|A|r> = (5.889364e-01, 1.418358e-04)
MG level 1 (GPU): MR: 2 iterations, <r|A|r> = (2.047453e-01, -1.488353e-06)
MG level 1 (GPU): MR: Converged after 2 iterations, relative residual: iterated = 4.254054e-05

[...]

GCR debug iter=1: Ap2=7.942181e+06, p2=6.998764e+07, rPre2=5.380420e+06
GCR debug iter=1: Apr=(5.637887e+06,-5.053312e+02,7.941173e+06)
GCR debug iter=1: beta[0][0] = (0.000000e+00,0.000000e+00)
GCR debug iter=1: beta[0][1] = (3.176022e+01,-6.630973e-02)
GCR: 2 iterations, <r,r> = 1.377765e+06, |r|/|b| = 7.353464e-02
MG level 1 (GPU): MR: 0 iterations, r2 = 1.000000e+00
MG level 1 (GPU): MR: 1 iterations, <r|A|r> = (2.083481e-01, 7.163521e-05)
MG level 1 (GPU): MR: 2 iterations, <r|A|r> = (1.307704e-01, 2.378019e-05)
MG level 1 (GPU): MR: Converged after 2 iterations, relative residual: iterated = 7.291394e-04
MG level 2 (GPU): WARNING: GCR with pipeline length 5 is experimental
MG level 2 (GPU): GCR: 0 iterations, <r,r> = 2.197209e+05, |r|/|b| = 1.000000e+00
MG level 2 (GPU): GCR: 1 iterations, <r,r> = 5.708147e+04, |r|/|b| = 5.096967e-01
MG level 2 (GPU): GCR: 2 iterations, <r,r> = 1.276572e+04, |r|/|b| = 2.410388e-01
MG level 2 (GPU): GCR: 3 iterations, <r,r> = 2.874830e+03, |r|/|b| = 1.143853e-01
MG level 2 (GPU): GCR: 4 iterations, <r,r> = 6.569313e+02, |r|/|b| = 5.467947e-02
MG level 2 (GPU): GCR: 5 iterations, <r,r> = 1.516918e+02, |r|/|b| = 2.627516e-02
MG level 2 (GPU): GCR: 6 iterations, <r,r> = 3.535873e+01, |r|/|b| = 1.268565e-02
MG level 2 (GPU): GCR: 7 iterations, <r,r> = 8.267209e+00, |r|/|b| = 6.134000e-03
MG level 2 (GPU): GCR: 8 iterations, <r,r> = 1.938526e+00, |r|/|b| = 2.970299e-03
MG level 2 (GPU): GCR: 9 iterations, <r,r> = 4.554805e-01, |r|/|b| = 1.439790e-03
MG level 2 (GPU): GCR: 10 iterations, <r,r> = 1.073920e-01, |r|/|b| = 6.991176e-04
MG level 2 (GPU): GCR: 11 iterations, <r,r> = 2.539955e-02, |r|/|b| = 3.399987e-04
MG level 2 (GPU): GCR (restart): 1 iterations, <r,r> = 4.209115e+04, |r|/|b| = 4.376830e-01
MG level 2 (GPU): GCR: 12 iterations, <r,r> = 7.984969e+03, |r|/|b| = 1.906342e-01
MG level 2 (GPU): GCR: 13 iterations, <r,r> = 1.498760e+03, |r|/|b| = 8.259054e-02
MG level 2 (GPU): GCR: 14 iterations, <r,r> = 3.096383e+02, |r|/|b| = 3.753977e-02
MG level 2 (GPU): GCR: 15 iterations, <r,r> = 6.716912e+01, |r|/|b| = 1.748433e-02
MG level 2 (GPU): GCR: 16 iterations, <r,r> = 1.495349e+01, |r|/|b| = 8.249651e-03
MG level 2 (GPU): GCR: 17 iterations, <r,r> = 3.399394e+00, |r|/|b| = 3.933372e-03
MG level 2 (GPU): GCR: 18 iterations, <r,r> = 7.810924e-01, |r|/|b| = 1.885452e-03
MG level 2 (GPU): GCR: 19 iterations, <r,r> = 1.814576e-01, |r|/|b| = 9.087657e-04
MG level 2 (GPU): GCR: 20 iterations, <r,r> = 4.250959e-02, |r|/|b| = 4.398532e-04
MG level 2 (GPU): GCR: 21 iterations, <r,r> = 1.000394e-02, |r|/|b| = 2.133781e-04
MG level 2 (GPU): WARNING: GCR: new reliable residual norm 2.392168e+02 is greater than previous reliable residual norm 2.051613e+02 (total #inc 1)
MG level 2 (GPU): GCR (restart): 2 iterations, <r,r> = 5.722470e+04, |r|/|b| = 5.103358e-01
MG level 2 (GPU): GCR: 22 iterations, <r,r> = 1.085010e+04, |r|/|b| = 2.222190e-01
MG level 2 (GPU): GCR: 23 iterations, <r,r> = 2.036403e+03, |r|/|b| = 9.627115e-02
MG level 2 (GPU): GCR: 24 iterations, <r,r> = 4.205175e+02, |r|/|b| = 4.374781e-02
MG level 2 (GPU): GCR: 25 iterations, <r,r> = 9.111480e+01, |r|/|b| = 2.036380e-02
MG level 2 (GPU): GCR: 26 iterations, <r,r> = 2.023839e+01, |r|/|b| = 9.597369e-03
MG level 2 (GPU): GCR: 27 iterations, <r,r> = 4.602356e+00, |r|/|b| = 4.576721e-03
MG level 2 (GPU): GCR: 28 iterations, <r,r> = 1.060197e+00, |r|/|b| = 2.196633e-03
MG level 2 (GPU): GCR: 29 iterations, <r,r> = 2.463396e-01, |r|/|b| = 1.058842e-03
MG level 2 (GPU): GCR: 30 iterations, <r,r> = 5.738701e-02, |r|/|b| = 5.110590e-04
MG level 2 (GPU): GCR: 31 iterations, <r,r> = 1.345183e-02, |r|/|b| = 2.474315e-04
MG level 2 (GPU): WARNING: GCR: new reliable residual norm 2.419803e+02 is greater than previous reliable residual norm 2.392168e+02 (total #inc 2)
MG level 2 (GPU): WARNING: GCR: solver exiting due to too many true residual norm increases
MG level 2 (GPU): GCR: number of restarts = 2
MG level 2 (GPU): GCR: Convergence at 31 iterations, L2 relative residual: iterated = 5.162312e-01, true = 0.000000e+00
MG level 1 (GPU): MR: 0 iterations, r2 = 1.000000e+00
MG level 1 (GPU): MR: 1 iterations, <r|A|r> = (4.990603e-01, -2.459406e-06)
MG level 1 (GPU): MR: 2 iterations, <r|A|r> = (1.428030e-01, 8.311549e-05)
MG level 1 (GPU): MR: Converged after 2 iterations, relative residual: iterated = 5.928231e-04

[...]

GCR debug iter=9: beta[8][0] = (0.000000e+00,0.000000e+00)
GCR debug iter=9: beta[8][1] = (0.000000e+00,0.000000e+00)
GCR debug iter=9: beta[8][2] = (0.000000e+00,0.000000e+00)
GCR debug iter=9: beta[8][3] = (0.000000e+00,0.000000e+00)
GCR debug iter=9: beta[8][4] = (0.000000e+00,0.000000e+00)
GCR debug iter=9: beta[8][5] = (0.000000e+00,0.000000e+00)
GCR debug iter=9: beta[8][6] = (0.000000e+00,0.000000e+00)
GCR debug iter=9: beta[8][7] = (0.000000e+00,0.000000e+00)
GCR debug iter=9: beta[8][8] = (0.000000e+00,0.000000e+00)
GCR debug iter=9: beta[8][9] = (1.159733e+01,1.119411e-01)
GCR: 10 iterations, <r,r> = 2.459163e+04, |r|/|b| = 9.824219e-03
quda::qudaMemcpy_ bytes = 31850496
[jrc0468:mpi_rank_8][error_sighandler] Caught error: Segmentation fault (signal 11)
[...]
kostrzewa commented 8 years ago
Program received signal SIGSEGV, Segmentation fault.
__cxxabiv1::__dynamic_cast (src_ptr=0x2aad1c67f530, 
    src_type=0xd9a0f60 <typeinfo for quda::ColorSpinorField>, 
    dst_type=0x1bdfc20 <typeinfo for quda::cudaColorSpinorField>, src2dst=0)
    at ../../../../libstdc++-v3/libsupc++/dyncast.cc:56

Will try to get some more debug information tomorrow...

maddyscientist commented 8 years ago

Is this something that can be reproduced with the multigrid_invert_test that comes with QUDA? If so, if you send me a modified version of that, I can debug from here to see what's going on.

kostrzewa commented 8 years ago

I'll give it a try, but I will first need to compile a version of QUDA with all the dependencies necessary for running the *_invert_test stuff. Note that I went for 24 vectors now and simply run on more nodes. Reproduced it already with different lattice sizes on different numbers of nodes.

maddyscientist commented 8 years ago

I don't think there are any extra dependencies for the tests. You don't need to use QMP/QIO, it just means you can't load a lattice, but a seg-fault shouldn't care if the data is random or a real config 😉 , so just a plain MPI + MAGMA build should be fine.

kostrzewa commented 8 years ago

Thanks, that's true of course! Both the "tmLQCD-orchestrated" MG setup as well as that from multigrid_invert_test fail MG::verify(). Don't know if that's already a hint or just a side-effect of the MG implementation being "work in progress". I will disable it and report back.

kostrzewa commented 8 years ago

I can't reproduce the segmentation fault with multigrid_invert_test. I will try to see if I messed something up on the tmLQCD side with respect to reproducing the way the setup and solver are called.

mathiaswagner commented 8 years ago

Maybe try building with CMAKE_BUILD_TYPE=HOSTDEBUG This will enable debug information the the CPU code and may help pinning down where the segfault occurs using backtrace.

kostrzewa commented 8 years ago

Thanks. I seem to have resolved the segmentation fault, which appears to have resulted from unfortunate combinations of matpc_type, solution_type and solve_type and a mismatch between the outer and inner parameters.

After I've committed my changes, I'll revert back to the segfaulting version, to try to provide the problematic combination since this should probably be caught (as other unsupported combinations are).

The next step is getting the solver to actually perform well, because currently GCR gets stuck after reducing the residual by about one order of magnitude. This could be related to the fact that the gamma bases are different in the outer and inner parameter set.

kostrzewa commented 8 years ago

The next step is getting the solver to actually perform well, because currently GCR gets stuck after reducing the residual by about one order of magnitude. This could be related to the fact that the gamma bases are different in the outer and inner parameter set.

As a side note: the two level solver "works" for plain Wilson quarks on a twisted mass sea, as long as the pion mass is well in excess of around 300 MeV or so. For twisted mass quarks on a twisted mass sea at a pion mass of around 300 MeV, GCR gets stuck as mentioned above.

kostrzewa commented 8 years ago

Attempting to use a three level setup seems to fail for me both for Wilson and TM with the orthogonalisation of the null vectors not working at some early stage (vector 7, 8 or 9 out of 32).

MG level 2 (GPU): BiCGstab: Convergence at 352 iterations, L2 relative residual: iterated = 4.647049e-06, true = 2.319178e+00
MG level 2 (GPU): ERROR: 
Cannot orthogonalize 7 vector
 (rank 0, host jrc0002, multigrid.cpp:695 in quda::MG::generateNullVectors())

It's a bit odd that the iterated residual is so far off the true one, although I've noticed that on the coarser levels, delta is relaxed significantly (in multigrid.cpp), if I read the code correctly.

cpviolator commented 8 years ago

@kostrzewa I'm not sure if it's possible, but can you change the BiCGstab to CG? I noticed in my studies of low mass pions with Wilson Clover fermions using QUDA that BiCGstab would get stuck but switching to CG would allow the solver to converge.

kostrzewa commented 8 years ago

@cpviolator CG would only work for DD^dag and I don't think it would work for the homogeneous case anyway... or did you work on the MG setup phase? As far as I discussed with @maddyscientist, the fact that BiCGStab doesn't converge very well during null vector generation is not that important. Do I remember correctly, Kate? I seem to recall that you mentioned that the resulting solver still converges well for TM.

kostrzewa commented 8 years ago

The segfaulting combination of parameters was:

QudaInvertParam for outer solver:

# QUDA: QUDA Inverter Parameters:
# QUDA: dslash_type = 0
# QUDA: inv_type = 2
# QUDA: kappa = 0.156048
# QUDA: tol = 1e-08
# QUDA: residual_type = 2
# QUDA: maxiter = 20000
# QUDA: reliable_delta = 0.01
# QUDA: use_sloppy_partial_accumulator = 0
# QUDA: max_res_increase = 1
# QUDA: max_res_increase_total = 10
# QUDA: heavy_quark_check = 10
# QUDA: pipeline = 0
# QUDA: num_offset = 0
# QUDA: num_src = 1
# QUDA: overlap = 0
# QUDA: solution_type = 0
# QUDA: solve_type = 0
# QUDA: matpc_type = 0
# QUDA: dagger = 0
# QUDA: mass_normalization = 0
# QUDA: solver_normalization = 0
# QUDA: preserve_source = 1
# QUDA: cpu_prec = 8
# QUDA: cuda_prec = 8
# QUDA: cuda_prec_sloppy = 4
# QUDA: input_location = 1
# QUDA: output_location = 1
# QUDA: clover_location = 1
# QUDA: gamma_basis = 2
# QUDA: dirac_order = 1
# QUDA: sp_pad = 0
# QUDA: tune = 0
# QUDA: gcrNkrylov = 10
# QUDA: use_init_guess = 0
# QUDA: omega = 1
# QUDA: verbosity = 2
# QUDA: iter = 0
# QUDA: spinorGiB = 0
# QUDA: gflops = 0
# QUDA: secs = 0
# QUDA: nev = 0
# QUDA: max_search_dim = 0
# QUDA: rhs_idx = 0
# QUDA: deflation_grid = 0
# QUDA: cg_iterref_tol = 0.05
# QUDA: eigcg_max_restarts = 2
# QUDA: max_restart_num = 3
# QUDA: inc_tol = 0.01
# QUDA: eigenval_tol = 0.1
# QUDA: use_resident_solution = 0
# QUDA: make_resident_solution = 0

QudaInvertParam passed via QudaMultigridParam->invert_param

# QUDA: QUDA Inverter Parameters:
# QUDA: dslash_type = 0
# QUDA: inv_type = 2
# QUDA: kappa = 0.156048
# QUDA: tol = 1e-08
# QUDA: residual_type = 2
# QUDA: maxiter = 20000
# QUDA: reliable_delta = 0.01
# QUDA: use_sloppy_partial_accumulator = 0
# QUDA: max_res_increase = 1
# QUDA: max_res_increase_total = 10
# QUDA: heavy_quark_check = 10
# QUDA: pipeline = 0
# QUDA: num_offset = 0
# QUDA: num_src = 1
# QUDA: overlap = 0
# QUDA: solution_type = 0
# QUDA: solve_type = 0
# QUDA: matpc_type = 0
# QUDA: dagger = 0
# QUDA: mass_normalization = 0
# QUDA: solver_normalization = 0
# QUDA: preserve_source = 1
# QUDA: cpu_prec = 8
# QUDA: cuda_prec = 8
# QUDA: cuda_prec_sloppy = 4
# QUDA: input_location = 1
# QUDA: output_location = 1
# QUDA: clover_location = 1
# QUDA: gamma_basis = 0
# QUDA: dirac_order = 1
# QUDA: sp_pad = 0
# QUDA: tune = 0
# QUDA: gcrNkrylov = 10
# QUDA: use_init_guess = 0
# QUDA: omega = 1
# QUDA: verbosity = 2
# QUDA: iter = 0
# QUDA: spinorGiB = 0
# QUDA: gflops = 0
# QUDA: secs = 0
# QUDA: nev = 0
# QUDA: max_search_dim = 0
# QUDA: rhs_idx = 0
# QUDA: deflation_grid = 0
# QUDA: cg_iterref_tol = 0.05
# QUDA: eigcg_max_restarts = 2
# QUDA: max_restart_num = 3
# QUDA: inc_tol = 0.01
# QUDA: eigenval_tol = 0.1
# QUDA: use_resident_solution = 0
# QUDA: make_resident_solution = 0

QudaMultigridParam:

# QUDA: QUDA Multigrid Parameters:
# QUDA: QUDA Inverter Parameters:
# QUDA: dslash_type = 0
# QUDA: inv_type = 2
# QUDA: kappa = 0.156048
# QUDA: tol = 1e-08
# QUDA: residual_type = 2
# QUDA: maxiter = 20000
# QUDA: reliable_delta = 0.01
# QUDA: use_sloppy_partial_accumulator = 0
# QUDA: max_res_increase = 1
# QUDA: max_res_increase_total = 10
# QUDA: heavy_quark_check = 10
# QUDA: pipeline = 0
# QUDA: num_offset = 0
# QUDA: num_src = 1
# QUDA: overlap = 0
# QUDA: solution_type = 0
# QUDA: solve_type = 0
# QUDA: matpc_type = 0
# QUDA: dagger = 0
# QUDA: mass_normalization = 0
# QUDA: solver_normalization = 0
# QUDA: preserve_source = 1
# QUDA: cpu_prec = 8
# QUDA: cuda_prec = 8
# QUDA: cuda_prec_sloppy = 4
# QUDA: input_location = 1
# QUDA: output_location = 1
# QUDA: clover_location = 1
# QUDA: gamma_basis = 0
# QUDA: dirac_order = 1
# QUDA: sp_pad = 0
# QUDA: tune = 0
# QUDA: gcrNkrylov = 10
# QUDA: use_init_guess = 0
# QUDA: omega = 1
# QUDA: verbosity = 2
# QUDA: iter = 0
# QUDA: spinorGiB = 0
# QUDA: gflops = 0
# QUDA: secs = 0
# QUDA: nev = 0
# QUDA: max_search_dim = 0
# QUDA: rhs_idx = 0
# QUDA: deflation_grid = 0
# QUDA: cg_iterref_tol = 0.05
# QUDA: eigcg_max_restarts = 2
# QUDA: max_restart_num = 3
# QUDA: inc_tol = 0.01
# QUDA: eigenval_tol = 0.1
# QUDA: use_resident_solution = 0
# QUDA: make_resident_solution = 0
# QUDA: n_level = 2
# QUDA: smoother[i] = 3
# QUDA: smoother_solve_type[i] = 2
# QUDA: geo_block_size[i][j] = 4
# QUDA: geo_block_size[i][j] = 4
# QUDA: geo_block_size[i][j] = 4
# QUDA: geo_block_size[i][j] = 4
# QUDA: spin_block_size[i] = 2
# QUDA: n_vec[i] = 24
# QUDA: cycle_type[i] = 3
# QUDA: nu_pre[i] = 2
# QUDA: nu_post[i] = 2
# QUDA: coarse_grid_solution_type[i] = 0
# QUDA: smoother_tol[i] = 0.2
# QUDA: global_reduction[i] = 1
# QUDA: omega[i] = 0.85
# QUDA: location[i] = 2
# QUDA: smoother[i] = 2
# QUDA: smoother_solve_type[i] = 2
# QUDA: smoother_tol[i] = 0.2
# QUDA: global_reduction[i] = 1
# QUDA: omega[i] = 0.85
# QUDA: location[i] = 2
# QUDA: compute_null_vector = 1
# QUDA: generate_all_levels = 1
# QUDA: run_verify = 0
# QUDA: gflops = 0
# QUDA: secs = 0

Not sure if this helps...

kostrzewa commented 8 years ago

A working parameter set is: outer QudaInvertParam:

# QUDA: QUDA Inverter Parameters:
# QUDA: dslash_type = 0
# QUDA: inv_type = 2
# QUDA: kappa = 0.156048
# QUDA: tol = 1e-08
# QUDA: residual_type = 2
# QUDA: maxiter = 20000
# QUDA: reliable_delta = 0.0001
# QUDA: use_sloppy_partial_accumulator = 0
# QUDA: max_res_increase = 1
# QUDA: max_res_increase_total = 10
# QUDA: heavy_quark_check = 10
# QUDA: pipeline = 0
# QUDA: num_offset = 0
# QUDA: num_src = 1
# QUDA: overlap = 0
# QUDA: solution_type = 0
# QUDA: solve_type = 2
# QUDA: matpc_type = 0
# QUDA: dagger = 0
# QUDA: mass_normalization = 0
# QUDA: solver_normalization = 0
# QUDA: preserve_source = 1
# QUDA: cpu_prec = 8
# QUDA: cuda_prec = 8
# QUDA: cuda_prec_sloppy = 4
# QUDA: input_location = 1
# QUDA: output_location = 1
# QUDA: clover_location = 1
# QUDA: gamma_basis = 2
# QUDA: dirac_order = 1
# QUDA: sp_pad = 0
# QUDA: tune = 0
# QUDA: gcrNkrylov = 20
# QUDA: use_init_guess = 0
# QUDA: omega = 1
# QUDA: verbosity = 2
# QUDA: iter = 0
# QUDA: spinorGiB = 0
# QUDA: gflops = 0
# QUDA: secs = 0
# QUDA: nev = 0
# QUDA: max_search_dim = 0
# QUDA: rhs_idx = 0
# QUDA: deflation_grid = 0
# QUDA: cg_iterref_tol = 0.05
# QUDA: eigcg_max_restarts = 2
# QUDA: max_restart_num = 3
# QUDA: inc_tol = 0.01
# QUDA: eigenval_tol = 0.1
# QUDA: use_resident_solution = 0
# QUDA: make_resident_solution = 0

QudaMultigridParam->invert_param

# QUDA: QUDA Inverter Parameters:
# QUDA: dslash_type = 0
# QUDA: inv_type = 2
# QUDA: kappa = 0.156048
# QUDA: tol = 1e-08
# QUDA: residual_type = 2
# QUDA: maxiter = 20000
# QUDA: reliable_delta = 0.0001
# QUDA: use_sloppy_partial_accumulator = 0
# QUDA: max_res_increase = 1
# QUDA: max_res_increase_total = 10
# QUDA: heavy_quark_check = 10
# QUDA: pipeline = 0
# QUDA: num_offset = 0
# QUDA: num_src = 1
# QUDA: overlap = 0
# QUDA: solution_type = 0
# QUDA: solve_type = 0
# QUDA: matpc_type = 0
# QUDA: dagger = 0
# QUDA: mass_normalization = 0
# QUDA: solver_normalization = 0
# QUDA: preserve_source = 0
# QUDA: cpu_prec = 8
# QUDA: cuda_prec = 8
# QUDA: cuda_prec_sloppy = 4
# QUDA: input_location = 1
# QUDA: output_location = 1
# QUDA: clover_location = 1
# QUDA: gamma_basis = 0
# QUDA: dirac_order = 1
# QUDA: sp_pad = 0
# QUDA: tune = 0
# QUDA: gcrNkrylov = 20
# QUDA: use_init_guess = 0
# QUDA: omega = 1
# QUDA: verbosity = 1
# QUDA: iter = 0
# QUDA: spinorGiB = 0
# QUDA: gflops = 0
# QUDA: secs = 0
# QUDA: nev = 0
# QUDA: max_search_dim = 0
# QUDA: rhs_idx = 0
# QUDA: deflation_grid = 0
# QUDA: cg_iterref_tol = 0.05
# QUDA: eigcg_max_restarts = 2
# QUDA: max_restart_num = 3
# QUDA: inc_tol = 0.01
# QUDA: eigenval_tol = 0.1
# QUDA: use_resident_solution = 0
# QUDA: make_resident_solution = 0

QudaMultigridParam:

# QUDA: QUDA Multigrid Parameters:
# QUDA: QUDA Inverter Parameters:
# QUDA: dslash_type = 0
# QUDA: inv_type = 2
# QUDA: kappa = 0.156048
# QUDA: tol = 1e-08
# QUDA: residual_type = 2
# QUDA: maxiter = 20000
# QUDA: reliable_delta = 0.0001
# QUDA: use_sloppy_partial_accumulator = 0
# QUDA: max_res_increase = 1
# QUDA: max_res_increase_total = 10
# QUDA: heavy_quark_check = 10
# QUDA: pipeline = 0
# QUDA: num_offset = 0
# QUDA: num_src = 1
# QUDA: overlap = 0
# QUDA: solution_type = 0
# QUDA: solve_type = 0
# QUDA: matpc_type = 0
# QUDA: dagger = 0
# QUDA: mass_normalization = 0
# QUDA: solver_normalization = 0
# QUDA: preserve_source = 0
# QUDA: cpu_prec = 8
# QUDA: cuda_prec = 8
# QUDA: cuda_prec_sloppy = 4
# QUDA: input_location = 1
# QUDA: output_location = 1
# QUDA: clover_location = 1
# QUDA: gamma_basis = 0
# QUDA: dirac_order = 1
# QUDA: sp_pad = 0
# QUDA: tune = 0
# QUDA: gcrNkrylov = 20
# QUDA: use_init_guess = 0
# QUDA: omega = 1
# QUDA: verbosity = 1
# QUDA: iter = 0
# QUDA: spinorGiB = 0
# QUDA: gflops = 0
# QUDA: secs = 0
# QUDA: nev = 0
# QUDA: max_search_dim = 0
# QUDA: rhs_idx = 0
# QUDA: deflation_grid = 0
# QUDA: cg_iterref_tol = 0.05
# QUDA: eigcg_max_restarts = 2
# QUDA: max_restart_num = 3
# QUDA: inc_tol = 0.01
# QUDA: eigenval_tol = 0.1
# QUDA: use_resident_solution = 0
# QUDA: make_resident_solution = 0
# QUDA: n_level = 2
# QUDA: smoother[i] = 3
# QUDA: smoother_solve_type[i] = 2
# QUDA: geo_block_size[i][j] = 4
# QUDA: geo_block_size[i][j] = 4
# QUDA: geo_block_size[i][j] = 4
# QUDA: geo_block_size[i][j] = 4
# QUDA: spin_block_size[i] = 2
# QUDA: n_vec[i] = 24
# QUDA: cycle_type[i] = 3
# QUDA: nu_pre[i] = 2
# QUDA: nu_post[i] = 2
# QUDA: coarse_grid_solution_type[i] = 2
# QUDA: smoother_tol[i] = 0.2
# QUDA: global_reduction[i] = 1
# QUDA: omega[i] = 0.85
# QUDA: location[i] = 2
# QUDA: smoother[i] = 2
# QUDA: smoother_solve_type[i] = 2
# QUDA: smoother_tol[i] = 0.2
# QUDA: global_reduction[i] = 1
# QUDA: omega[i] = 0.85
# QUDA: location[i] = 2
# QUDA: compute_null_vector = 1
# QUDA: generate_all_levels = 1
# QUDA: run_verify = 0
# QUDA: gflops = 0
# QUDA: secs = 0

As far as I can tell, since gcrNkrylov is irrelevent for the segfault, the only difference is whether the outer solve is QUDA_DIRECT_PC_SOLVE or not. If I run the outer solver with QUDA_DIRECT_SOLVE, I get the segfault again.

kostrzewa commented 8 years ago

As a consequence, when setting up the MG parameters, I will set (inv_param is the param struct for the outer solver):

mg_param->coarse_grid_solution_type[i] = inv_param.solve_type == QUDA_DIRECT_PC_SOLVE ? QUDA_MATPC_SOLUTION : QUDA_MAT_SOLUTION;

so the other difference will be the coarse grid solution type.

maddyscientist commented 7 years ago

@kostrzewa Apologies I've not had a chance to look at this for the past few weeks (too many things ongoing at the moment). Next week I'll be focussed on MG at our hackathon, so will look at this issue then.

kostrzewa commented 7 years ago

Thanks a lot! I've essentially gotten everything to work for a two-level solve, but the segfault still exists for certain combinations of parameters, as described above. In addition, I can't get the three-level solver to work due to the orthogonalisation problem on the coarser level, as described above. I was thinking of relaxing the orthogonalisation condition in multigrid.cpp for this purpose, but would first like to understand if there's not some other problem, which could also explain the poor performance that I get.

I'm having convergence issues with the two level solver on twisted mass gauge configurations with low mass Wilson fermions (I would think that this is due to exceptional eigenvalues, I can provide configurations for testing [ILDG format, so easily read via QIO]). The solver gets stuck after reducing the residual by some factor x, where x can be between 1e2 and 1e6 or so. The situation is similar, but worse, for twisted mass quarks on the same configurations. For somewhat heavy Wilson fermions, I get convergence but the iteration counts are quite high (600+ iterations of 2-level MG-GCR at am_0^bare ~ 0.0065, a~0.07 fm)

If you want, I can provide a detailed list of the parameters used (in a more condensed and readable format than above) together with the gauge configurations.

I have now re-enabled the verification step and get this kind of result (for the 600+ iteration solver mentioned above):

[...]
MG level 1 (GPU): Vector 23: norms v_k = 1.000000e+00 P^\dagger v_k = 9.999999e-01 P P^\dagger v_k = 9.999999e-01
MG level 1 (GPU): L2 relative deviation = 2.263896e-05
MG level 1 (GPU): 
MG level 1 (GPU): Checking 0 = (1 - P^\dagger P) eta_c
MG level 1 (GPU): Vector norms 1.326953e+06 1.326953e+06 (fine tmp 1.326953e+06) MG level 1 (GPU): L2 relative deviation = 9.367346e-06
MG level 1 (GPU): 
MG level 1 (GPU): Comparing native coarse operator to emulated operator
MG level 1 (GPU): Vector norms Emulated=1.761139e+05 Native=1.761139e+05 MG level 1 (GPU): L2 relative deviation = 2.570257e-05
kostrzewa commented 7 years ago

I've gotten the three-level solver to work by setting mg_param->generate_all_levels = QUDA_BOOLEAN_NO and instantiating a few more block sizes for the restrictor. Unfortunately, although this roughly halves the number of iterations for the converging case, it does not seem to help with the stagnation problem described above, although it does increase the mentioned factor x by which the residual is reduced before the solver stagnates.

kostrzewa commented 7 years ago

The good news is that for tm+clover, the solver behaves quite well even with just two levels and converges even at the physical pion mass in around 1000 MG-GCR iterations, which is quite interesting. The setup time is quite substantial, however, as you mentioned.

maddyscientist commented 7 years ago

@kostrzewa Excellent. So the main issue that is left to be resolved is the segmentation fault?

kostrzewa commented 7 years ago

@maddyscientist Well, it would be nice if it converged also for tm without a clover term. It would probably also be interesting to understand if accidental zero eigenvalues are really to blame for the non-convergence of low-mass Wilson quarks on a twisted mass sea. The former point is more important though. I'm also not sure why mg_param->generate_all_levels = QUDA_BOOLEAN_YES leads to orthogonalisation problems for the coarse null vectors in a three-level setup.

Another thing I've encountered is an illegal memory access in the tuning phase in a three-level setup in which I use 4^4 blocks on the fine grid and 3^4 blocks on the coarse grid (the configuration is 48^3x96, so this seems to be a good setup for the 16 cards that I use presently). The error I get is as follows:

MAGMA will use device architecture 370.
MG level 2 (GPU): BatchInvertMatrix with n=64 and batch=32
MG level 2 (GPU): LU factorization completed in 0.003755 seconds with GFLOPS = 5.924977
MG level 2 (GPU): Matrix inversion completed in 0.002197 seconds with GFLOPS = 20.369750
MG level 2 (GPU): ....done computing Y field
MG level 2 (GPU): ERROR: Failed to clear error state an illegal memory access was encountered
 (rank 11, host jrc0461, tune.cpp:600 in quda::tuneLaunch())
MG level 2 (GPU):        last kernel called was (name=N4quda14RestrictLaunchIfNS_11RestrictArgINS_11colorspinor12FieldOrderCBIfLi2ELi32ELi1EL16QudaFieldOrder_s2EEES5_NS3_IfLi2ELi32ELi32ELS4_2EEELi2ELi2EEELi2ELi32ELi2ELi32ELi2EEE,volume=4x2x2x2,12x6x6x6,aux=vol=32,stride=16,precision=4,TwistFlavour=-1,vol=2592,stride=1296,precision=4,TwistFlavour=-1)
MG level 2 (GPU): ERROR: Failed to clear error state an illegal memory access was encountered
 (rank 9, host jrc0461, tune.cpp:600 in quda::tuneLaunch())
MG level 2 (GPU):        last kernel called was (name=N4quda14RestrictLaunchIfNS_11RestrictArgINS_11colorspinor12FieldOrderCBIfLi2ELi32ELi1EL16QudaFieldOrder_s2EEES5_NS3_IfLi2ELi32ELi32ELS4_2EEELi2ELi2EEELi2ELi32ELi2ELi32ELi2EEE,volume=4x2x2x2,12x6x6x6,aux=vol=32,stride=16,precision=4,TwistFlavour=-1,vol=2592,stride=1296,precision=4,TwistFlavour=-1)

I don't get any warnings about odd block counts in any dimension on the coarsest level, but I'm not sure what happens once this is e/o preconditioned. If I understand things correctly and didn't make any mistakes, I should have (xyzt) N=4x2x2x2 aggregates of geometric volume 3^4 on each card in this setup.

maddyscientist commented 7 years ago

@kostrzewa Ok, multiple things to get fixed then 😄

One thing I've been meaning to ask you, I see in tmQCD code that you have BiCGStab(l) support. You may have noticed that @weinbe2 has been adding this to QUDA (I think it should help reduce the null-space generation cost). How does this solver behave with twisted mass and twisted clover? I would have naively thought that it would behave a whole lot better than BiCGstab since it should treat complex eigenvalues better.

kostrzewa commented 7 years ago

@maddyscientist Regarding plain BiCGStab, we know that it doesn't converge well (at all) for tm or tmclover. What I've tried is that I move away from maximal twist in the setup phase (kappa_setup ~ kappa_c - 0.003_kappa_c or -0.006_kappa_c), which seems to produce null vectors which are good enough at least for tmclover. This is essential for the prolongator to pass MG::verify(), as far as I can tell. The unconverged null vectors produced at kappa_c (500 iterations of BiCGstab) result in failing MG::verify():

MG level 1 (GPU): Checking 0 = (1 - P P^\dagger) v_k for 32 vectors
MG level 1 (GPU): Vector 0: norms v_k = 1.000000e+00 P^\dagger v_k = 1.000000e+00 P P^\dagger v_k = 1.000000e+00
MG level 1 (GPU): L2 relative deviation = 1.213198e-04

For Wilson quarks, doing this also seems to help in making outer iterations cheaper by making the coarse operator better conditioned without really affecting the outer iteration count, as far as I can see and if I have understood everything correctly. This is in line with the findings on the DDalphaAMG side that increasing the coarse twisted mass parameter improves overall time to solution.

As for BiCGStab-L in tmLQCD, I don't know if that was ever used. Maybe @urbach and I can revive it and run some tests. (I don't think it's really wired up for most operators in tmLQCD, but it should be easy enough to get working again)

cpviolator commented 7 years ago

I can confirm the segfaults when using a vanilla quda-develop and multigrid_invert_tests. When using a twisted-mass dslash and direct solve type there is a segfault. When using twisted-clover and direct then verify() fails in multigrid.cpp. Same errors for normop and normop-pc solves too. I can only get things working with direct-pc.

maddyscientist commented 7 years ago

Ok I will try to reproduce this in the next few days. Does the problem occur for both single and multi GPU?

On progress last week, at the hackathon, we now have an interface which allows the MG state to persist for gauge evolution with the introduction of a new interface call updateMultigridQuda. Also, I've just about got most of the setup off loaded to the GPU so there will soon be a big speedup in this regard.

cpviolator commented 7 years ago

Same story for single and multi-GPU. Right now I'm studying the working test combos --dslash-type twisted-mass --solve-type direct-pc and trying to get our code to work with it. We used to do normop-pc to get the eigenvectors for deflation, but I think that we can do a direct-pc solve if we upgrade to MG. I'll report back when a working setup is found.

cpviolator commented 7 years ago

OK, I'm also seeing with our code that the level 1 BiCGStab solver has trouble getting past 1e-5 after 200 interactions using direct-pc and twisted mass, and that the GCR solver then has trouble getting past 1e-1 for |r|/|b|. I'm relaxing the solver tolerance way down to 5e-1 just to get it to 'converge' and produce data. I'm using the same MG set up as given in multigrid_invert_test: a two level solve with 24 null vectors. I did change the maxiter on GCR level 1 to 200 as prescribed by @kostrzewa just for a little seed up.

Next task is to utilise @weinbe2 BiCGStab(I) code and look for improvements. @maddyscientist, would it be worthwhile to keep the first level(s) very loose on the tolerance, then dial down at say level 3 for a better convergence? Kind of like using the lower (1,2) MG levels as a sloppy preconditioner for the tighter (3,4) levels.

The code we are developing will be used for twisted-clover, so once I have some physical sized test lattices I'll report back with findings.

kostrzewa commented 7 years ago

@cpviolator see here for a working setup for tmclover: https://github.com/kostrzewa/tmLQCD/blob/quda_interface_multigrid/quda_interface.c#L1084

1) lower kappa by 2.5 per-mille (away from kappa_c to get a positive PCAC mass) 2) increase mu in the MG to 5.2*target_mu

Works for 2 and 3 level solves. I used direct_pc as that is the only setup where I don't experience segfaults, as you said above.

ps: in this setup, some of the homogeneous BiCGstab solves converge with maxiter=500

kostrzewa commented 7 years ago

@cpviolator regarding the lattices, can Martha get these or would you like me to provide some?

kostrzewa commented 7 years ago

The ideal situation would be to generate the test vectors iteratively: run O(100) iterations of BiCGstab on the homogeneous systems to build a first approximation MG solver. Then use this solver on the homogeneous systems to improve the test vectors. Finally, at least according to the DDalphaAMG people (and in line with what I see in the QUDA-MG), the coarsest level mu should be increased relative to the finer levels, such that the coarse operator is better conditioned. An alternative would be to stringently limit the number of iterations on the coarsest level, I guess, but I feel that it would be difficult to find an ideal setting for the latter approach.

cpviolator commented 7 years ago

@kostrzewa Thank you very much! I'll run with that setup and see what it brings. We have some lattices knocking around in Cyprus so I should be OK there.

With regards to the convergence problem, I think that this adaptive approach is what's needed too. I can design a small testing program to see what sorts of changes give the best improvements, if any at all.

cpviolator commented 7 years ago

Ok, just reporting back on progress.

I was using an L=48 lattice with 3 and 4 GPUS but ran into an error uninstantiated block sizes. @kostrzewa, can you explain how you instantiated different block sizes? I see the code in the function

void apply(const cudaStream_t &stream)

and the various permissible block sizes. Does one simply add more conditionals? I also see in your setMultigridParams() setup that there are some conditionals in the i,j loop over mg_levels not present in the current develop branch. I'll copy/paste that in to my code.

I'm switching to L=64 T=128 as these are the lattices which ultimately will be used in production so it makes sense to work from there and generalise later.

maddyscientist commented 7 years ago

Yes, just add the correct block sizes instantiations to restrictor.cu. Block size template is half of multigrid aggregate size (parity is on the y axis).

kostrzewa commented 7 years ago

@cpviolator Yes, I simply added some more block sizes to restrictor.cu.

As for the aggregation parameters in "my" setMultigridParams(), I wanted to make the coarser blocking automatic (for level n, say) by choosing either 3, 2 or 1 block aggregates on the coarser levels depending on the dimensions of the local block lattice on level n-1. I get issues as described above with the odd blocking dimensions, so something is still amiss.

cpviolator commented 7 years ago

@kostrzewa I'm not sure that the code you have in https://github.com/kostrzewa/tmLQCD/blob/quda_interface_multigrid/quda_interface.c#L1084 does what it should. From what I've seen, the \mu parameter remains unchanged on the finest levels, but on the coarsest level it gets rescaled. In your code (as far as I can see) the \mu parameter is scaled up at setup time to 5.2 its input value, and remains that way throughout the calculation.

I'm writing code to change the \mu parameter at the nth MG level so that one inputs a \mu_0 and C_mu. One then specifies which value of \mu is to be used at the nth level. For the three level solves we do using DDalpha, we have

level 1: mu = mu_0 level 2: mu = mu_0 level 3: mu = mu_0 * C_mu

As a baseline, I'll emulate this setup in QUDA, but I think a worthwhile experiment would be to use

level 1: mu = mu_0 level 2: mu = mu_0 * (C_mu)^0.5 level 3: mu = mu_0 * C_mu

and similar power law scaling for higher level solves, eg

level 1: mu = mu_0 * (C_mu)^{0/3} level 2: mu = mu_0 * (C_mu)^{1/3} level 3: mu = mu_0 * (C_mu)^{2/3} level 4: mu = mu_0 * (C_mu)^{3/3}

Please let me know if I've read your code incorrectly.

kostrzewa commented 7 years ago

@cpviolator You are of course correct, I simply haven't had the time to poke into the QUDA innards to copy the DDalphaAMG-style mu-rescaling. Increasing mu on all coarse levels seems to kind of work as well and it was a first step towards getting a solver that converges. Especially during the setup phase, mu should ideally be the target mass, although this will likely affect the convergence of BiCGstab.

cpviolator commented 7 years ago

I'm working on code to declare an array for mu which can be accessed by the solvers at various levels. The conventional thought is to increase mu on the coarsest level, but this method should allow for experimentation too. I'll share it when I get convergence.

kostrzewa commented 3 years ago

we've addressed this