Wrong update of quda multigrid when QUDA_DIRECT_PC_SOLVE is used?

sunpho84 commented 6 months ago

In dee6b93956ef94862dbfd1b083ab443b897d6880 I changed QUDA_DIRECT_SOLVE into QUDA_DIRECT_PC_SOLVE following the discussion in https://github.com/sunpho84/nissa/pull/43 with @kostrzewa and @Marcogarofalo

Today I noticed that if the solver is updated (for example after changing the mass), the second solve is not working fine.

In particular the nissa estimate of the residue deviates from the internal estimate, and the deviation grows with the mass difference.

sunpho84 commented 6 months ago

B64.3329085.out.txt

In this file the first four inversions at line 916 1186 1459 and 1729 are done with the same mass parameter 0.0169 (strange mass), then at line 1773 it switches to charm mass (0.2368) and an update is issued, following residues at line 2040 is wrong

The issue arises always when the mass is changed, no matter if the second is lighter or heavier

sunpho84 commented 6 months ago

Tried to add QUDA_DIRECT_PC_SOLVE also in the multigrid parameters 4aaa0f175a7598feff0ff0

https://github.com/sunpho84/nissa/blob/4aaa0f175a7598feff0ff00cbdabdee418083f45/src/base/quda_bridge.cpp#L578

but QUDA complains:

# QUDA: ERROR: Outer MG solver can only use QUDA_DIRECT_SOLVE at present

kostrzewa commented 6 months ago

You have to be a bit careful where you specify the PC_SOLVE. I think what we do in https://github.com/etmc/tmLQCD/blob/quda_work/quda_interface.c is correct, at least it appears to be so from testing different combinations.

The places where "PC_SOLVE" can be specified are:

1) the outer inv_param 2) the smoother_solve_type in the MG param

It cannot be set in the internal QudaInvertParam contained within the QudaMultigridParam.

In tmLQCD this is handled like so:

when the outer solver (the GCR) is supposed to use e/o precon, we pass
- inv_param.solve_type = QUDA_DIRECT_PC_SOLVE (https://github.com/etmc/tmLQCD/blob/3d3c0b8fa509a90f4aa8a7d37642765564548d16/quda_interface.c#L1448)
- we always set DIRECT_PC_SOLVE for smoother_solve_type https://github.com/etmc/tmLQCD/blob/3d3c0b8fa509a90f4aa8a7d37642765564548d16/quda_interface.c#L1946
- when the outer solve is PC_SOLVE, the coarse grid solution type is MATPC_SOLUTION https://github.com/etmc/tmLQCD/blob/3d3c0b8fa509a90f4aa8a7d37642765564548d16/quda_interface.c#L1958
- we always set DIRECT_SOLVE for the internal QudaInvertParam that is inside of QudaMultigridParam https://github.com/etmc/tmLQCD/blob/3d3c0b8fa509a90f4aa8a7d37642765564548d16/quda_interface.c#L1760
the matpc_type of the internal operator is always symmetric https://github.com/etmc/tmLQCD/blob/3d3c0b8fa509a90f4aa8a7d37642765564548d16/quda_interface.c#L1777

sunpho84 commented 6 months ago

Thanks, I have to admit I don't fully understand what this flag imply, however the strange thing is that (forget 4aaa0f175a which was an experiment), the solver works fine with DIRECT_PC_SOLVE until an update of the setup is issued, meaning that the result pass nissa validation test before the update, and fail the test afterwards.

The failure is more and more sever as the mass is changed, but it's never so bad as if the update had not been carried out at all (I believe). Might it be some "minor" part of the operator which is not fully updated? like the clover... or the inverse clover... in the PC operator?

kostrzewa commented 6 months ago

The solver will always work fine: preconditioning the even-odd or non-even-odd solve using the MG are both supported.

Using even-odd precon for the outer operator (which is what this flag does) significantly reduces device memory usage and speeds up the solver as a whole, at least with the MG parameters that the autotuner in tmLQCD provides. One can find MG parameters that are optimal also for the non-even-odd solve.

The behaviour that you see with the update is odd. I'm just now trying to reproduce it, give me a while and I'll get back with a sequence of outputs with several updates in between.

When you perform a change of the quark mass, do you propagate this change to all of the instances of mu between the outer inv_param and the one on the interior of the QudaMultigridParam?

kostrzewa commented 6 months ago

In tmLQCD, the params are set by _setOneFlavourSolverParam (https://github.com/etmc/tmLQCD/blob/3d3c0b8fa509a90f4aa8a7d37642765564548d16/quda_interface.c#L1074 -> https://github.com/etmc/tmLQCD/blob/3d3c0b8fa509a90f4aa8a7d37642765564548d16/quda_interface.c#L1290)

If the MG is in use, this also triggers the MG setup to be created or updated -> https://github.com/etmc/tmLQCD/blob/3d3c0b8fa509a90f4aa8a7d37642765564548d16/quda_interface.c#L1475

First the params are set: https://github.com/etmc/tmLQCD/blob/3d3c0b8fa509a90f4aa8a7d37642765564548d16/quda_interface.c#L1480C5-L1480C27 and then _updateQudaMultigridPreconditioner is called which decides what needs to be done: https://github.com/etmc/tmLQCD/blob/3d3c0b8fa509a90f4aa8a7d37642765564548d16/quda_interface.c#L1485C6-L1485C40

kostrzewa commented 6 months ago

Here are the results of a test on a 96c192 lattice on 8x4 A100s with 80GB of device memory each.

The solves below are using volume sources with |r|^2/|b|^2 set to 1e-20. I thus expect a final absolute r^2 of ~ 1e-20 96^3 192 * 24 ~ 4e-11

`UseEvenOdd = no` set in tmLQCD operator definition -> `inv_param.solve_type = QUDA_DIRECT_SOLVE`

Memory usage:

  67725MiB / 81920MiB

Solve history

# TM_QUDA: using MG solver to invert operator with 2kappamu = 0.000149009948
# TM_QUDA: Reusing MG Preconditioner Setup for gauge_id: 0.000000
# TM_QUDA: Done: 91 iter / 16.4819 secs = 24164.3 Gflops
# Inversion done in 91 iterations, squared residue = 2.906976e-11!

# TM_QUDA: using MG solver to invert operator with 2kappamu = -0.000149009948
# TM_QUDA: Updating MG Preconditioner Setup for gauge_id: 0.000000
# TM_QUDA: Time for MG_Preconditioner_Setup_Update 3.488882e+00 s level: 1 proc_id: 0 /invert_eo_quda/MG_Preconditioner_Setup_Update
# TM_QUDA: Done: 92 iter / 16.1698 secs = 24945.3 Gflops
# Inversion done in 92 iterations, squared residue = 2.763540e-11!

# TM_QUDA: using MG solver to invert operator with 2kappamu = 0.001490099479
# TM_QUDA: Updating MG Preconditioner Setup for gauge_id: 0.000000
# TM_QUDA: Time for MG_Preconditioner_Setup_Update 2.471178e+00 s level: 1 proc_id: 0 /invert_eo_quda/MG_Preconditioner_Setup_Update
# TM_QUDA: Done: 56 iter / 9.80139 secs = 24169.4 Gflops
# Inversion done in 56 iterations, squared residue = 2.779393e-11!

# TM_QUDA: using MG solver to invert operator with 2kappamu = -0.001490099479
# TM_QUDA: Updating MG Preconditioner Setup for gauge_id: 0.000000
# TM_QUDA: Time for MG_Preconditioner_Setup_Update 2.484471e+00 s level: 1 proc_id: 0 /invert_eo_quda/MG_Preconditioner_Setup_Update
# TM_QUDA: Done: 56 iter / 9.48808 secs = 24963.7 Gflops
# Inversion done in 56 iterations, squared residue = 2.849896e-11!

# TM_QUDA: using MG solver to invert operator with 2kappamu = 0.014900994792
# TM_QUDA: Updating MG Preconditioner Setup for gauge_id: 0.000000
# TM_QUDA: Time for MG_Preconditioner_Setup_Update 2.483676e+00 s level: 1 proc_id: 0 /invert_eo_quda/MG_Preconditioner_Setup_Update
# TM_QUDA: Done: 46 iter / 5.96745 secs = 27175.7 Gflops
# Inversion done in 46 iterations, squared residue = 2.544515e-11!

# TM_QUDA: using MG solver to invert operator with 2kappamu = -0.014900994792
# TM_QUDA: Updating MG Preconditioner Setup for gauge_id: 0.000000
# TM_QUDA: Time for MG_Preconditioner_Setup_Update 2.609029e+00 s level: 1 proc_id: 0 /invert_eo_quda/MG_Preconditioner_Setup_Update
# TM_QUDA: Done: 46 iter / 6.06212 secs = 26721.7 Gflops
# Inversion done in 46 iterations, squared residue = 2.655446e-11!

So this is working fine even with mu = 100 mu_l.

In the next comment I'll have the UseEvenOdd = yes -> inv_param.solve_type = QUDA_DIRECT_PC_SOLVE.

kostrzewa commented 6 months ago

Do you have QUDA compiled with with QUDA_DYNAMIC_CLOVER enabled? (ON is the default). Disabling this explicitly might mess with things but I've never tried.

sunpho84 commented 6 months ago

Do you have QUDA compiled with with QUDA_DYNAMIC_CLOVER enabled? (ON is the default). Disabling this explicitly might mess with things but I've never tried.

in facts I've noticed this is off, I don't know why!

sunpho84 commented 6 months ago

this was the original command given to cmake

cmake -DCMAKE_INSTALL_PREFIX:PATH=$PWD/install -DQUDA_GPU_ARCH=sm_80 -DQUDA_MPI=ON -DQUDA_MULTIGRID=ON -DQUDA_BUILD_SHAREDLIB=ON -DQUDA_JITIFY=OFF ../

I have to say that the quda version is rather outdated, corresponding to release 1.1.x but lacking all updates from July 2021.

sunpho84 commented 6 months ago

The solver will always work fine: preconditioning the even-odd or non-even-odd solve using the MG are both supported.

Using even-odd precon for the outer operator (which is what this flag does) significantly reduces device memory usage and speeds up the solver as a whole, at least with the MG parameters that the autotuner in tmLQCD provides. One can find MG parameters that are optimal also for the non-even-odd solve.

The behaviour that you see with the update is odd. I'm just now trying to reproduce it, give me a while and I'll get back with a sequence of outputs with several updates in between.

When you perform a change of the quark mass, do you propagate this change to all of the instances of mu between the outer inv_param and the one on the interior of the QudaMultigridParam?

I've spent some time looking at the code and I would be tempted to say yes, also it would be strange that things work fine if this is not the case, no matter the flag

kostrzewa commented 6 months ago

`UseEvenOdd = yes` set in tmLQCD operator definition -> `inv_param.solve_type = QUDA_DIRECT_PC_SOLVE`

Memory usage:

55865MiB / 81920MiB

(for this ensemble on this machine this makes the difference between being able to run on just 6 nodes or not)

Solve history:

# TM_QUDA: using MG solver to invert operator with 2kappamu = 0.000149009948
# TM_QUDA: Reusing MG Preconditioner Setup for gauge_id: 0.000000
# TM_QUDA: Done: 91 iter / 13.9933 secs = 20065.7 Gflops
# Inversion done in 91 iterations, squared residue = 1.746967e-11!

# TM_QUDA: using MG solver to invert operator with 2kappamu = -0.000149009948
# TM_QUDA: Updating MG Preconditioner Setup for gauge_id: 0.000000
# TM_QUDA: Time for MG_Preconditioner_Setup_Update 3.225770e+00 s level: 1 proc_id: 0 /invert_eo_quda/MG_Preconditioner_Setup_Update
# TM_QUDA: Done: 92 iter / 13.7781 secs = 20617.9 Gflops
# Inversion done in 92 iterations, squared residue = 2.372779e-11!

# TM_QUDA: using MG solver to invert operator with 2kappamu = 0.001490099479
# TM_QUDA: Updating MG Preconditioner Setup for gauge_id: 0.000000
# TM_QUDA: Time for MG_Preconditioner_Setup_Update 3.241568e+00 s level: 1 proc_id: 0 /invert_eo_quda/MG_Preconditioner_Setup_Update
# TM_QUDA: Done: 55 iter / 7.70625 secs = 22123 Gflops
# Inversion done in 55 iterations, squared residue = 1.154781e-11!

# TM_QUDA: using MG solver to invert operator with 2kappamu = -0.001490099479
# TM_QUDA: Updating MG Preconditioner Setup for gauge_id: 0.000000
# TM_QUDA: Time for MG_Preconditioner_Setup_Update 4.379318e+00 s level: 1 proc_id: 0 /invert_eo_quda/MG_Preconditioner_Setup_Update
# TM_QUDA: Done: 55 iter / 8.02102 secs = 21254.8 Gflops
# Inversion done in 55 iterations, squared residue = 1.129516e-11!

# TM_QUDA: using MG solver to invert operator with 2kappamu = 0.014900994792
# TM_QUDA: Updating MG Preconditioner Setup for gauge_id: 0.000000
# TM_QUDA: Time for MG_Preconditioner_Setup_Update 3.253719e+00 s level: 1 proc_id: 0 /invert_eo_quda/MG_Preconditioner_Setup_Update
# TM_QUDA: Done: 46 iter / 5.02598 secs = 21326.3 Gflops
# Inversion done in 46 iterations, squared residue = 1.102640e-11!

# TM_QUDA: using MG solver to invert operator with 2kappamu = -0.014900994792
# TM_QUDA: Updating MG Preconditioner Setup for gauge_id: 0.000000
# TM_QUDA: Time for MG_Preconditioner_Setup_Update 3.196641e+00 s level: 1 proc_id: 0 /invert_eo_quda/MG_Preconditioner_Setup_Update
# TM_QUDA: Done: 46 iter / 5.23218 secs = 20485.8 Gflops
# Inversion done in 46 iterations, squared residue = 1.160401e-11!

So there we are. By doing inv_param.solve_type = QUDA_DIRECT_PC_SOLVE we save around 20% device memory and get a slight speedup (with the same MG setup). The MG setup for the QUDA_DIRECT_SOLVE could be tuned to perform similarly but the tmLQCD autotuner uses QUDA_DIRECT_PC_SOLVE (always) as it is hacked on top of the force calculation for CLOVERDET. (the "dream" is to eventually be able to launch an HMC without specifying an MG setup and a good MG setup being found automatically)

sunpho84 commented 6 months ago

so with your release and/or with the DYNAMIC_CLOVER switched on, everything works fine, I'll try to switch on the DYNAMIC_CLOVER

kostrzewa commented 6 months ago

Switching on DYNAMIC_CLOVER might help, yes. There are other good reasons for using it with twisted mass fermions (double-half CG is more stable, for example).

sunpho84 commented 6 months ago

then I can try with a newer release of quda, which one would be better to use?

kostrzewa commented 6 months ago

I've been using 02391b12 for production a lot but this is quite old already. The tests above were with b930e9.

sunpho84 commented 6 months ago

Switching on DYNAMIC_CLOVER might help, yes. There are other good reasons for using it with twisted mass fermions (double-half CG is more stable, for example).

switching on the flag, the issue is gone! B64.3355447.out.txt

so if this something expected? maybe I should try with the new version, to see if the flag is not needed any longer?

kostrzewa commented 6 months ago

switching on the flag, the issue is gone!

Very good!

You would need to explicitly reupload or trigger the inverse of the clover field to be regenerated after changing parameters if you are not using DYNAMIC_CLOVER. This is because the twisted quark mass enters the inverse. When using DYNAMIC_CLOVER the inverse is computed on the fly as it is being applied and so the problem does not arise.

kostrzewa commented 6 months ago

maybe I should try with the new version, to see if the flag is not needed any longer?

This would certainly nicely test if the defaults are applied correctly. (the old default was OFF)

sunpho84 commented 6 months ago

switching on the flag, the issue is gone!

Very good!

You would need to explicitly reupload or trigger the inverse of the clover field to be regenerated after changing parameters if you are not using DYNAMIC_CLOVER. This is because the twisted quark mass enters the inverse. When using DYNAMIC_CLOVER the inverse is computed on the fly as it is being applied and so the problem does not arise.

That's in fact what I had in mind as a possible interpretation of the issue. I see that by default the flag is indeed activated in the new release that @Marcogarofalo is using on JB, I need to check whether the issue was arising in his recent inclusive work.

On the other hand, I would say that the update of the setup should trigger the update of the inverse if the clover has to be recomputed...

kostrzewa commented 6 months ago

On the other hand, I would say that the update of the setup should trigger the update of the inverse if the clover has to be recomputed...

This would require much more logic in QUDA (in addition to the logic already present for keeping the various field copies in different precisions in synch). I'm pretty sure the devs would welcome a PR into that direction but it's nontrivial, I think. What if you want to keep multiple clover fields resident?

sunpho84 commented 6 months ago

On the other hand, I would say that the update of the setup should trigger the update of the inverse if the clover has to be recomputed...

This would require much more logic in QUDA (in addition to the logic already present for keeping the various field copies in different precisions in synch). I'm pretty sure the devs would welcome a PR into that direction but it's nontrivial, I think. What if you want to keep multiple clover fields resident?

From the way I see it working, at the moment there is only one internal clover field, right?

sunpho84 commented 6 months ago

good! the new version of quda enable by default the flag, and the problem do not arises. We shall discuss if there is a way to make the update more robust, for the time being I will find a workaround, like probing cuda version to give at least a warning, or to catch the flag value to issue an error

sunpho84 commented 6 months ago

Ok the flag can be catched

https://github.com/sunpho84/nissa/blob/65e201ee687b64ddc41a9c536c02329f1d5c99f5/src/base/quda_bridge.cpp#L571-L573

so the issue is "closed"

sunpho84 commented 6 months ago

In case quda is not compiled with DYNAMIC_CLOVER, it rises an error

sunpho84 / nissa