lattice / quda

QUDA is a library for performing calculations in lattice QCD on GPUs.
https://lattice.github.io/quda
Other
292 stars 99 forks source link

Fix crash in computeCloverForceQuda. #1339

Closed SaltyChiang closed 10 months ago

SaltyChiang commented 1 year ago

This might solve crashes in https://github.com/lattice/quda/pull/1330 and https://github.com/lattice/quda/pull/1338, @kostrzewa @Marcogarofalo could you please check this out?

The problem happens in cloverSigmaTraceCompute. It seems that arg.clover2.load(A,x,parity) is always called, and clover2(clover, 1) makes clover2 here initialized as the inverse of the input clover field. Not sure if I understand the meaning of parameters right.

More tests for correctness are needed.

mathiaswagner commented 1 year ago

Jenkins: Can one of the admins verify this patch?

SaltyChiang commented 1 year ago

I also make use of input args to use, return or update momResident.

Marcogarofalo commented 1 year ago

Hi, I still get an error at https://github.com/lattice/quda/blob/29b5dfcb6fc6ca91cd5c6ea9266159294146f885/lib/interface_quda.cpp#L4834

ERROR: Parity spinor volume 256 doesn't match clover checkboard volume 256 (rank 0, host lnode16.cluster.hiskp, dirac_twisted_clover.cpp:43 in virtual void quda::DiracTwistedClover::checkParitySpinor(const quda::ColorSpinorField&, const quda::ColorSpinorField&) const())

I think that the problem is similar to what was observed in #1338 : qParam.twist_flavor not being set when quarkX and quarkP are created.

SaltyChiang commented 1 year ago

I think this is a problem raised from twisted mass clover action, and setting qParam.twist_flavor should fix this. But this is actually another question, as QUDA doesn't provide computeTMCloverForceQuda for now.

This PR solves the unexpected cudaErrorIllegalAddress error in computeCloverForceQuda, which is something like https://github.com/lattice/quda/pull/1338#issuecomment-1320880012 and https://github.com/lattice/quda/pull/1338#issuecomment-1321203371. The hotfix makes computeCloverForceQuda runnable, and should be useful if you want to make a computeTMCloverForceQuda in the future.

kostrzewa commented 1 year ago

This PR solves the unexpected cudaErrorIllegalAddress error in computeCloverForceQuda, which is something like https://github.com/lattice/quda/pull/1338#issuecomment-1320880012 and https://github.com/lattice/quda/pull/1338#issuecomment-1321203371. The hotfix makes computeCloverForceQuda runnable, and should be useful if you want to make a computeTMCloverForceQuda in the future.

@SaltyChiang many thanks for taking a look at this. However, although I have to think about it more and may be wrong, I don't think that the changes are entirely correct. I believe that we have multiple unrelated problems, all of which result in cudaErrorIllegalAddress appearing in various places under certain conditions.

On the one hand, for asymmetric even-odd preconditioning, on one parity of the lattice there should be a contribution from the inverse of the clover term (which I think is what the parity-dependent choice between clover1 and clover2 was meant to implement in the kernel). For symmetric preconditioning, instead, this contribution should appear on both parities. The most detailed account of this, I believe, can be found in Appendix A of https://inspirehep.net/literature/568934 (eq A12).

Without having taken a real look I'd make the following guess: in cloverSigmaTraceCompute, if you do not have the inverse of the clover term available, this should of course lead to problems with accessing clover2.

The problem that I observed at the end of #1338, instead, is an issue with exchangeGhost in computeCloverForce which I still have to look at in more detail to understand fully. On a single GPU, computeTMCloverForceQudaas implemented in #1338 runs without errors (this is output from tmLQCD's https://github.com/etmc/tmLQCD/tree/quda_work_tm_force branch running a simple HMC with a twisted clover clover determinant):

# QUDA: Dslash(x.Odd(), x.Even(), QUDA_ODD_PARITY)
# QUDA: gamma5(tmp, x.Even())
# QUDA: M(p.Even(), tmp)
# QUDA: Dslash(p.Odd(), p.Even(), QUDA_ODD_PARITY)
# QUDA: computeCloverForce(cudaForce, gaugePrecise, quarkX, quarkP, force_coeff);
# QUDA: computeCloverSigmaTrace(oprod, *cloverPrecise, k_csw_ov_8);
# QUDA: computeCloverSigmaOprod(oprod, quarkX, quarkP, ferm_epsilon)
# QUDA: cloverDerivative(cudaForce, gaugeEx, *oprodEx, 1.0, QUDA_ODD_PARITY)
# QUDA: cloverDerivative(cudaForce, gaugeEx, *oprodEx, 1.0, QUDA_EVEN_PARITY)
# QUDA: updateMomentum(gpuMom, -1.0, cudaForce, "tmclover")
# TM_QUDA: Time for computeTMCloverForceQuda 3.765968e-02 s level: 3 proc_id: 0 /HMC/cloverdetlight:cloverdet_derivative/compute_cloverdet_derivative_quda/computeTMCloverForceQuda
# TM_QUDA: Time for reorder_mom_fromQuda 2.001856e-02 s level: 3 proc_id: 0 /HMC/cloverdetlight:cloverdet_derivative/compute_cloverdet_derivative_quda/reorder_mom_fromQuda
# TM_QUDA: Time for add_mom_to_derivative 1.547076e-03 s level: 3 proc_id: 0 /HMC/cloverdetlight:cloverdet_derivative/compute_cloverdet_derivative_quda/add_mom_to_derivative
# TM_QUDA: Time for compute_cloverdet_derivative_quda 7.191028e-02 s level: 2 proc_id: 0 /HMC/cloverdetlight:cloverdet_derivative/compute_cloverdet_derivative_quda
SaltyChiang commented 1 year ago

@kostrzewa Thank you for your comment!

I agree with you about the necessity of the inversed clover term, and only the inversed term should be used in the sigma trace process.

But this is not a problem about inversed clover term not being created. I set compute_clover_inverse=1 while calling loadCloverQuda and the inversed clover term should be created by then. The error is still raised.

I believe we need a mechanism to select whether we should compute the sigma trace on the full site or just the odd parity site. I didn't see such things now in QUDA, or maybe I missed something.

But this is not a problem about not inversed clover term not being created. I use compute_clover_inverse

SaltyChiang commented 1 year ago

DYNAMIC_CLOVER blocks the creation of inversed clover term, but computeCloverSigmaTrace didn't apply the dynamic inverse and finally makes the error.

maddyscientist commented 1 year ago

For now, perhaps it would be a good idea to disable dynamic clover, to allow for the construction of an explicit inverse clover field.

SaltyChiang commented 1 year ago

@maddyscientist I updated the branch to apply dynamic inverse in cloverSigmaTraceCompute, please check if I did that correctly.

maddyscientist commented 1 year ago

I think the changes for computing the dynamic clover inverse on the fly are correct from looking at them.

However, there are other issues with this. Specifically in the clover_outer_product.cu file, you have switched to using ColorSpinorField::exchangeGhost as opposed to the one defined in the file. While this exchange function will work, it doesn't quite do what is required here, since the outer product kernel is expecting the halo buffer to contain the spin-projected elements, not the raw ghost elements.

I don't want too many parallel debugging efforts going on here at once, as this will just lead to confusion. Perhaps for now I will concentrate on supporting @kostrzewa and @Marcogarofalo efforts to get the force working for their use case, and then we can consider your use case.

Does that sound reasonable?

SaltyChiang commented 1 year ago

No problem. I believe their case is not very far from mine, and solving their problem definitely helps in my work.

kostrzewa commented 1 year ago

@SaltyChiang thanks for all this effort. For now I've merged in the changes in how the clover term and its inverse are treated here into the `feature/tm_force' branch (which is only temporary, as discussed in #1338).

I hope that in the coming days @Marcogarofalo and I will be able to invest more time into making some progress on this.

maddyscientist commented 10 months ago

I think we can close this PR in favor of #1338. @SaltyChiang feel free to correct me if I'm missing something.