Closed sunpho84 closed 6 months ago
In this file the first four inversions at line 916 1186 1459 and 1729 are done with the same mass parameter 0.0169 (strange mass), then at line 1773 it switches to charm mass (0.2368) and an update is issued, following residues at line 2040 is wrong
The issue arises always when the mass is changed, no matter if the second is lighter or heavier
Tried to add QUDA_DIRECT_PC_SOLVE
also in the multigrid parameters 4aaa0f175a7598feff0ff0
but QUDA complains:
# QUDA: ERROR: Outer MG solver can only use QUDA_DIRECT_SOLVE at present
You have to be a bit careful where you specify the PC_SOLVE. I think what we do in https://github.com/etmc/tmLQCD/blob/quda_work/quda_interface.c is correct, at least it appears to be so from testing different combinations.
The places where "PC_SOLVE" can be specified are:
1) the outer inv_param
2) the smoother_solve_type
in the MG param
It cannot be set in the internal QudaInvertParam contained within the QudaMultigridParam.
In tmLQCD this is handled like so:
inv_param.solve_type = QUDA_DIRECT_PC_SOLVE
(https://github.com/etmc/tmLQCD/blob/3d3c0b8fa509a90f4aa8a7d37642765564548d16/quda_interface.c#L1448)DIRECT_PC_SOLVE
for smoother_solve_type
https://github.com/etmc/tmLQCD/blob/3d3c0b8fa509a90f4aa8a7d37642765564548d16/quda_interface.c#L1946PC_SOLVE
, the coarse grid solution type is MATPC_SOLUTION
https://github.com/etmc/tmLQCD/blob/3d3c0b8fa509a90f4aa8a7d37642765564548d16/quda_interface.c#L1958DIRECT_SOLVE
for the internal QudaInvertParam that is inside of QudaMultigridParam https://github.com/etmc/tmLQCD/blob/3d3c0b8fa509a90f4aa8a7d37642765564548d16/quda_interface.c#L1760matpc_type
of the internal operator is always symmetric https://github.com/etmc/tmLQCD/blob/3d3c0b8fa509a90f4aa8a7d37642765564548d16/quda_interface.c#L1777Thanks, I have to admit I don't fully understand what this flag imply, however the strange thing is that (forget 4aaa0f175a which was an experiment), the solver works fine with DIRECT_PC_SOLVE
until an update of the setup is issued, meaning that the result pass nissa validation test before the update, and fail the test afterwards.
The failure is more and more sever as the mass is changed, but it's never so bad as if the update had not been carried out at all (I believe). Might it be some "minor" part of the operator which is not fully updated? like the clover... or the inverse clover... in the PC operator?
The solver will always work fine: preconditioning the even-odd or non-even-odd solve using the MG are both supported.
Using even-odd precon for the outer operator (which is what this flag does) significantly reduces device memory usage and speeds up the solver as a whole, at least with the MG parameters that the autotuner in tmLQCD provides. One can find MG parameters that are optimal also for the non-even-odd solve.
The behaviour that you see with the update is odd. I'm just now trying to reproduce it, give me a while and I'll get back with a sequence of outputs with several updates in between.
When you perform a change of the quark mass, do you propagate this change to all of the instances of mu between the outer inv_param and the one on the interior of the QudaMultigridParam?
In tmLQCD, the params are set by _setOneFlavourSolverParam
(https://github.com/etmc/tmLQCD/blob/3d3c0b8fa509a90f4aa8a7d37642765564548d16/quda_interface.c#L1074 -> https://github.com/etmc/tmLQCD/blob/3d3c0b8fa509a90f4aa8a7d37642765564548d16/quda_interface.c#L1290)
If the MG is in use, this also triggers the MG setup to be created or updated -> https://github.com/etmc/tmLQCD/blob/3d3c0b8fa509a90f4aa8a7d37642765564548d16/quda_interface.c#L1475
First the params are set: https://github.com/etmc/tmLQCD/blob/3d3c0b8fa509a90f4aa8a7d37642765564548d16/quda_interface.c#L1480C5-L1480C27 and then _updateQudaMultigridPreconditioner
is called which decides what needs to be done: https://github.com/etmc/tmLQCD/blob/3d3c0b8fa509a90f4aa8a7d37642765564548d16/quda_interface.c#L1485C6-L1485C40
Here are the results of a test on a 96c192 lattice on 8x4 A100s with 80GB of device memory each.
The solves below are using volume sources with |r|^2/|b|^2 set to 1e-20. I thus expect a final absolute r^2 of ~ 1e-20 96^3 192 * 24 ~ 4e-11
UseEvenOdd = no
set in tmLQCD operator definition -> inv_param.solve_type = QUDA_DIRECT_SOLVE
67725MiB / 81920MiB
# TM_QUDA: using MG solver to invert operator with 2kappamu = 0.000149009948
# TM_QUDA: Reusing MG Preconditioner Setup for gauge_id: 0.000000
# TM_QUDA: Done: 91 iter / 16.4819 secs = 24164.3 Gflops
# Inversion done in 91 iterations, squared residue = 2.906976e-11!
# TM_QUDA: using MG solver to invert operator with 2kappamu = -0.000149009948
# TM_QUDA: Updating MG Preconditioner Setup for gauge_id: 0.000000
# TM_QUDA: Time for MG_Preconditioner_Setup_Update 3.488882e+00 s level: 1 proc_id: 0 /invert_eo_quda/MG_Preconditioner_Setup_Update
# TM_QUDA: Done: 92 iter / 16.1698 secs = 24945.3 Gflops
# Inversion done in 92 iterations, squared residue = 2.763540e-11!
# TM_QUDA: using MG solver to invert operator with 2kappamu = 0.001490099479
# TM_QUDA: Updating MG Preconditioner Setup for gauge_id: 0.000000
# TM_QUDA: Time for MG_Preconditioner_Setup_Update 2.471178e+00 s level: 1 proc_id: 0 /invert_eo_quda/MG_Preconditioner_Setup_Update
# TM_QUDA: Done: 56 iter / 9.80139 secs = 24169.4 Gflops
# Inversion done in 56 iterations, squared residue = 2.779393e-11!
# TM_QUDA: using MG solver to invert operator with 2kappamu = -0.001490099479
# TM_QUDA: Updating MG Preconditioner Setup for gauge_id: 0.000000
# TM_QUDA: Time for MG_Preconditioner_Setup_Update 2.484471e+00 s level: 1 proc_id: 0 /invert_eo_quda/MG_Preconditioner_Setup_Update
# TM_QUDA: Done: 56 iter / 9.48808 secs = 24963.7 Gflops
# Inversion done in 56 iterations, squared residue = 2.849896e-11!
# TM_QUDA: using MG solver to invert operator with 2kappamu = 0.014900994792
# TM_QUDA: Updating MG Preconditioner Setup for gauge_id: 0.000000
# TM_QUDA: Time for MG_Preconditioner_Setup_Update 2.483676e+00 s level: 1 proc_id: 0 /invert_eo_quda/MG_Preconditioner_Setup_Update
# TM_QUDA: Done: 46 iter / 5.96745 secs = 27175.7 Gflops
# Inversion done in 46 iterations, squared residue = 2.544515e-11!
# TM_QUDA: using MG solver to invert operator with 2kappamu = -0.014900994792
# TM_QUDA: Updating MG Preconditioner Setup for gauge_id: 0.000000
# TM_QUDA: Time for MG_Preconditioner_Setup_Update 2.609029e+00 s level: 1 proc_id: 0 /invert_eo_quda/MG_Preconditioner_Setup_Update
# TM_QUDA: Done: 46 iter / 6.06212 secs = 26721.7 Gflops
# Inversion done in 46 iterations, squared residue = 2.655446e-11!
So this is working fine even with mu = 100 mu_l.
In the next comment I'll have the UseEvenOdd = yes
-> inv_param.solve_type = QUDA_DIRECT_PC_SOLVE
.
Do you have QUDA compiled with with QUDA_DYNAMIC_CLOVER
enabled? (ON
is the default). Disabling this explicitly might mess with things but I've never tried.
Do you have QUDA compiled with with
QUDA_DYNAMIC_CLOVER
enabled? (ON
is the default). Disabling this explicitly might mess with things but I've never tried.
in facts I've noticed this is off, I don't know why!
this was the original command given to cmake
cmake -DCMAKE_INSTALL_PREFIX:PATH=$PWD/install -DQUDA_GPU_ARCH=sm_80 -DQUDA_MPI=ON -DQUDA_MULTIGRID=ON -DQUDA_BUILD_SHAREDLIB=ON -DQUDA_JITIFY=OFF ../
I have to say that the quda version is rather outdated, corresponding to release 1.1.x but lacking all updates from July 2021.
The solver will always work fine: preconditioning the even-odd or non-even-odd solve using the MG are both supported.
Using even-odd precon for the outer operator (which is what this flag does) significantly reduces device memory usage and speeds up the solver as a whole, at least with the MG parameters that the autotuner in tmLQCD provides. One can find MG parameters that are optimal also for the non-even-odd solve.
The behaviour that you see with the update is odd. I'm just now trying to reproduce it, give me a while and I'll get back with a sequence of outputs with several updates in between.
When you perform a change of the quark mass, do you propagate this change to all of the instances of mu between the outer inv_param and the one on the interior of the QudaMultigridParam?
I've spent some time looking at the code and I would be tempted to say yes, also it would be strange that things work fine if this is not the case, no matter the flag
UseEvenOdd = yes
set in tmLQCD operator definition -> inv_param.solve_type = QUDA_DIRECT_PC_SOLVE
55865MiB / 81920MiB
(for this ensemble on this machine this makes the difference between being able to run on just 6 nodes or not)
# TM_QUDA: using MG solver to invert operator with 2kappamu = 0.000149009948
# TM_QUDA: Reusing MG Preconditioner Setup for gauge_id: 0.000000
# TM_QUDA: Done: 91 iter / 13.9933 secs = 20065.7 Gflops
# Inversion done in 91 iterations, squared residue = 1.746967e-11!
# TM_QUDA: using MG solver to invert operator with 2kappamu = -0.000149009948
# TM_QUDA: Updating MG Preconditioner Setup for gauge_id: 0.000000
# TM_QUDA: Time for MG_Preconditioner_Setup_Update 3.225770e+00 s level: 1 proc_id: 0 /invert_eo_quda/MG_Preconditioner_Setup_Update
# TM_QUDA: Done: 92 iter / 13.7781 secs = 20617.9 Gflops
# Inversion done in 92 iterations, squared residue = 2.372779e-11!
# TM_QUDA: using MG solver to invert operator with 2kappamu = 0.001490099479
# TM_QUDA: Updating MG Preconditioner Setup for gauge_id: 0.000000
# TM_QUDA: Time for MG_Preconditioner_Setup_Update 3.241568e+00 s level: 1 proc_id: 0 /invert_eo_quda/MG_Preconditioner_Setup_Update
# TM_QUDA: Done: 55 iter / 7.70625 secs = 22123 Gflops
# Inversion done in 55 iterations, squared residue = 1.154781e-11!
# TM_QUDA: using MG solver to invert operator with 2kappamu = -0.001490099479
# TM_QUDA: Updating MG Preconditioner Setup for gauge_id: 0.000000
# TM_QUDA: Time for MG_Preconditioner_Setup_Update 4.379318e+00 s level: 1 proc_id: 0 /invert_eo_quda/MG_Preconditioner_Setup_Update
# TM_QUDA: Done: 55 iter / 8.02102 secs = 21254.8 Gflops
# Inversion done in 55 iterations, squared residue = 1.129516e-11!
# TM_QUDA: using MG solver to invert operator with 2kappamu = 0.014900994792
# TM_QUDA: Updating MG Preconditioner Setup for gauge_id: 0.000000
# TM_QUDA: Time for MG_Preconditioner_Setup_Update 3.253719e+00 s level: 1 proc_id: 0 /invert_eo_quda/MG_Preconditioner_Setup_Update
# TM_QUDA: Done: 46 iter / 5.02598 secs = 21326.3 Gflops
# Inversion done in 46 iterations, squared residue = 1.102640e-11!
# TM_QUDA: using MG solver to invert operator with 2kappamu = -0.014900994792
# TM_QUDA: Updating MG Preconditioner Setup for gauge_id: 0.000000
# TM_QUDA: Time for MG_Preconditioner_Setup_Update 3.196641e+00 s level: 1 proc_id: 0 /invert_eo_quda/MG_Preconditioner_Setup_Update
# TM_QUDA: Done: 46 iter / 5.23218 secs = 20485.8 Gflops
# Inversion done in 46 iterations, squared residue = 1.160401e-11!
So there we are. By doing inv_param.solve_type = QUDA_DIRECT_PC_SOLVE
we save around 20% device memory and get a slight speedup (with the same MG setup). The MG setup for the QUDA_DIRECT_SOLVE
could be tuned to perform similarly but the tmLQCD autotuner uses QUDA_DIRECT_PC_SOLVE
(always) as it is hacked on top of the force calculation for CLOVERDET
. (the "dream" is to eventually be able to launch an HMC without specifying an MG setup and a good MG setup being found automatically)
so with your release and/or with the DYNAMIC_CLOVER switched on, everything works fine, I'll try to switch on the DYNAMIC_CLOVER
Switching on DYNAMIC_CLOVER
might help, yes. There are other good reasons for using it with twisted mass fermions (double-half CG is more stable, for example).
then I can try with a newer release of quda, which one would be better to use?
I've been using 02391b12 for production a lot but this is quite old already. The tests above were with b930e9.
Switching on
DYNAMIC_CLOVER
might help, yes. There are other good reasons for using it with twisted mass fermions (double-half CG is more stable, for example).
switching on the flag, the issue is gone! B64.3355447.out.txt
so if this something expected? maybe I should try with the new version, to see if the flag is not needed any longer?
switching on the flag, the issue is gone!
Very good!
You would need to explicitly reupload or trigger the inverse of the clover field to be regenerated after changing parameters if you are not using DYNAMIC_CLOVER
. This is because the twisted quark mass enters the inverse. When using DYNAMIC_CLOVER
the inverse is computed on the fly as it is being applied and so the problem does not arise.
maybe I should try with the new version, to see if the flag is not needed any longer?
This would certainly nicely test if the defaults are applied correctly. (the old default was OFF
)
switching on the flag, the issue is gone!
Very good!
You would need to explicitly reupload or trigger the inverse of the clover field to be regenerated after changing parameters if you are not using
DYNAMIC_CLOVER
. This is because the twisted quark mass enters the inverse. When usingDYNAMIC_CLOVER
the inverse is computed on the fly as it is being applied and so the problem does not arise.
That's in fact what I had in mind as a possible interpretation of the issue. I see that by default the flag is indeed activated in the new release that @Marcogarofalo is using on JB, I need to check whether the issue was arising in his recent inclusive work.
On the other hand, I would say that the update of the setup should trigger the update of the inverse if the clover has to be recomputed...
On the other hand, I would say that the update of the setup should trigger the update of the inverse if the clover has to be recomputed...
This would require much more logic in QUDA (in addition to the logic already present for keeping the various field copies in different precisions in synch). I'm pretty sure the devs would welcome a PR into that direction but it's nontrivial, I think. What if you want to keep multiple clover fields resident?
On the other hand, I would say that the update of the setup should trigger the update of the inverse if the clover has to be recomputed...
This would require much more logic in QUDA (in addition to the logic already present for keeping the various field copies in different precisions in synch). I'm pretty sure the devs would welcome a PR into that direction but it's nontrivial, I think. What if you want to keep multiple clover fields resident?
From the way I see it working, at the moment there is only one internal clover field, right?
good! the new version of quda enable by default the flag, and the problem do not arises. We shall discuss if there is a way to make the update more robust, for the time being I will find a workaround, like probing cuda version to give at least a warning, or to catch the flag value to issue an error
Ok the flag can be catched
so the issue is "closed"
In case quda is not compiled with DYNAMIC_CLOVER, it rises an error
In dee6b93956ef94862dbfd1b083ab443b897d6880 I changed
QUDA_DIRECT_SOLVE
intoQUDA_DIRECT_PC_SOLVE
following the discussion in https://github.com/sunpho84/nissa/pull/43 with @kostrzewa and @MarcogarofaloToday I noticed that if the solver is updated (for example after changing the mass), the second solve is not working fine.
In particular the nissa estimate of the residue deviates from the internal estimate, and the deviation grows with the mass difference.