Closed SaltyChiang closed 10 months ago
Jenkins: Can one of the admins verify this patch?
I also make use of input args to use, return or update momResident
.
Hi, I still get an error at https://github.com/lattice/quda/blob/29b5dfcb6fc6ca91cd5c6ea9266159294146f885/lib/interface_quda.cpp#L4834
ERROR: Parity spinor volume 256 doesn't match clover checkboard volume 256 (rank 0, host lnode16.cluster.hiskp, dirac_twisted_clover.cpp:43 in virtual void quda::DiracTwistedClover::checkParitySpinor(const quda::ColorSpinorField&, const quda::ColorSpinorField&) const())
I think that the problem is similar to what was observed in #1338 :
qParam.twist_flavor
not being set when quarkX
and quarkP
are created.
I think this is a problem raised from twisted mass clover action, and setting qParam.twist_flavor
should fix this. But this is actually another question, as QUDA doesn't provide computeTMCloverForceQuda
for now.
This PR solves the unexpected cudaErrorIllegalAddress
error in computeCloverForceQuda
, which is something like https://github.com/lattice/quda/pull/1338#issuecomment-1320880012 and https://github.com/lattice/quda/pull/1338#issuecomment-1321203371. The hotfix makes computeCloverForceQuda
runnable, and should be useful if you want to make a computeTMCloverForceQuda
in the future.
This PR solves the unexpected cudaErrorIllegalAddress error in computeCloverForceQuda, which is something like https://github.com/lattice/quda/pull/1338#issuecomment-1320880012 and https://github.com/lattice/quda/pull/1338#issuecomment-1321203371. The hotfix makes computeCloverForceQuda runnable, and should be useful if you want to make a computeTMCloverForceQuda in the future.
@SaltyChiang many thanks for taking a look at this. However, although I have to think about it more and may be wrong, I don't think that the changes are entirely correct. I believe that we have multiple unrelated problems, all of which result in cudaErrorIllegalAddress
appearing in various places under certain conditions.
On the one hand, for asymmetric even-odd preconditioning, on one parity of the lattice there should be a contribution from the inverse of the clover term (which I think is what the parity-dependent choice between clover1
and clover2
was meant to implement in the kernel). For symmetric preconditioning, instead, this contribution should appear on both parities. The most detailed account of this, I believe, can be found in Appendix A of https://inspirehep.net/literature/568934 (eq A12).
Without having taken a real look I'd make the following guess: in cloverSigmaTraceCompute
, if you do not have the inverse of the clover term available, this should of course lead to problems with accessing clover2
.
The problem that I observed at the end of #1338, instead, is an issue with exchangeGhost
in computeCloverForce
which I still have to look at in more detail to understand fully. On a single GPU, computeTMCloverForceQuda
as implemented in #1338 runs without errors (this is output from tmLQCD's https://github.com/etmc/tmLQCD/tree/quda_work_tm_force branch running a simple HMC with a twisted clover clover determinant):
# QUDA: Dslash(x.Odd(), x.Even(), QUDA_ODD_PARITY)
# QUDA: gamma5(tmp, x.Even())
# QUDA: M(p.Even(), tmp)
# QUDA: Dslash(p.Odd(), p.Even(), QUDA_ODD_PARITY)
# QUDA: computeCloverForce(cudaForce, gaugePrecise, quarkX, quarkP, force_coeff);
# QUDA: computeCloverSigmaTrace(oprod, *cloverPrecise, k_csw_ov_8);
# QUDA: computeCloverSigmaOprod(oprod, quarkX, quarkP, ferm_epsilon)
# QUDA: cloverDerivative(cudaForce, gaugeEx, *oprodEx, 1.0, QUDA_ODD_PARITY)
# QUDA: cloverDerivative(cudaForce, gaugeEx, *oprodEx, 1.0, QUDA_EVEN_PARITY)
# QUDA: updateMomentum(gpuMom, -1.0, cudaForce, "tmclover")
# TM_QUDA: Time for computeTMCloverForceQuda 3.765968e-02 s level: 3 proc_id: 0 /HMC/cloverdetlight:cloverdet_derivative/compute_cloverdet_derivative_quda/computeTMCloverForceQuda
# TM_QUDA: Time for reorder_mom_fromQuda 2.001856e-02 s level: 3 proc_id: 0 /HMC/cloverdetlight:cloverdet_derivative/compute_cloverdet_derivative_quda/reorder_mom_fromQuda
# TM_QUDA: Time for add_mom_to_derivative 1.547076e-03 s level: 3 proc_id: 0 /HMC/cloverdetlight:cloverdet_derivative/compute_cloverdet_derivative_quda/add_mom_to_derivative
# TM_QUDA: Time for compute_cloverdet_derivative_quda 7.191028e-02 s level: 2 proc_id: 0 /HMC/cloverdetlight:cloverdet_derivative/compute_cloverdet_derivative_quda
@kostrzewa Thank you for your comment!
I agree with you about the necessity of the inversed clover term, and only the inversed term should be used in the sigma trace process.
But this is not a problem about inversed clover term not being created. I set compute_clover_inverse=1
while calling loadCloverQuda
and the inversed clover term should be created by then. The error is still raised.
I believe we need a mechanism to select whether we should compute the sigma trace on the full site or just the odd parity site. I didn't see such things now in QUDA, or maybe I missed something.
But this is not a problem about not inversed clover term not being created. I use compute_clover_inverse
DYNAMIC_CLOVER
blocks the creation of inversed clover term, but computeCloverSigmaTrace
didn't apply the dynamic inverse and finally makes the error.
For now, perhaps it would be a good idea to disable dynamic clover, to allow for the construction of an explicit inverse clover field.
@maddyscientist I updated the branch to apply dynamic inverse in cloverSigmaTraceCompute
, please check if I did that correctly.
I think the changes for computing the dynamic clover inverse on the fly are correct from looking at them.
However, there are other issues with this. Specifically in the clover_outer_product.cu file, you have switched to using ColorSpinorField::exchangeGhost
as opposed to the one defined in the file. While this exchange function will work, it doesn't quite do what is required here, since the outer product kernel is expecting the halo buffer to contain the spin-projected elements, not the raw ghost elements.
I don't want too many parallel debugging efforts going on here at once, as this will just lead to confusion. Perhaps for now I will concentrate on supporting @kostrzewa and @Marcogarofalo efforts to get the force working for their use case, and then we can consider your use case.
Does that sound reasonable?
No problem. I believe their case is not very far from mine, and solving their problem definitely helps in my work.
@SaltyChiang thanks for all this effort. For now I've merged in the changes in how the clover term and its inverse are treated here into the `feature/tm_force' branch (which is only temporary, as discussed in #1338).
I hope that in the coming days @Marcogarofalo and I will be able to invest more time into making some progress on this.
I think we can close this PR in favor of #1338. @SaltyChiang feel free to correct me if I'm missing something.
This might solve crashes in https://github.com/lattice/quda/pull/1330 and https://github.com/lattice/quda/pull/1338, @kostrzewa @Marcogarofalo could you please check this out?
The problem happens in
cloverSigmaTraceCompute
. It seems thatarg.clover2.load(A,x,parity)
is always called, andclover2(clover, 1)
makesclover2
here initialized as the inverse of the input clover field. Not sure if I understand the meaning of parameters right.More tests for correctness are needed.