Closed kostrzewa closed 8 months ago
@Marcogarofalo @simone-romiti fyi
This is similar to @Marcogarofalo's #1330 but I want to focus on getting this to work with tmLQCD first and then generalize the original function from there. As a result, the present PR will be temporary and only to share progress.
The problem https://github.com/lattice/quda/pull/1330#issuecomment-1285797127 is clearly fixed by removing the extra qParam.x[0] /= 2;
.
However, the next problem crops up when the Dslash is supposed to be applied:
# QUDA: Dslash(x.Odd(), x.Even(), QUDA_ODD_PARITY)
# QUDA: ERROR: Parity spinor volume 16384 doesn't match clover checkboard volume 16384 (rank 0, host node-03.bender, dirac_twisted_clover.cpp:43 in virtual void quda::DiracTwistedClover::checkParitySpinor(const quda::ColorSpinorField&, const quda::ColorSpinorField&) const())
# QUDA: last kernel called was (name=N4quda15CopyColorSpinorILi4ELi3ENS_11colorspinor11FloatNOrderIdLi4ELi3ELi2ELb0ELb0EEENS1_21SpaceSpinorColorOrderIdLi4ELi3EEESt5tupleIJRNS_16ColorSpinorFieldERKS7_19QudaFieldLocation_sPdPKdEEEE,volume=8x16x16x8,aux=GPU-offline,vol=16384,parity=1,precision=8,order=9,Ns=4,Nc=3vol=16384,parity=1,precision=8,order=2,Ns=4,Nc=3,NonRelBasis)
and this I find quite strange since 16384 == 16384
:smile:
The issue above occurs because of
void DiracTwistedClover::checkParitySpinor(const ColorSpinorField &out, const ColorSpinorField &in) const
{
Dirac::checkParitySpinor(out, in);
if (out.TwistFlavor() == QUDA_TWIST_SINGLET) {
if (out.Volume() != clover->VolumeCB())
errorQuda("Parity spinor volume %lu doesn't match clover checkboard volume %lu", out.Volume(),
clover->VolumeCB());
} else {
//
if (out.Volume() / 2 != clover->VolumeCB())
errorQuda("Parity spinor volume %lu doesn't match clover checkboard volume %lu", out.Volume(),
clover->VolumeCB());
}
}
and qParam.twist_flavor
not being set when quarkX
and quarkP
are created.
I can update the error message above to indicate that this might be a possibility, but to be honest, maybe we generally should check more frequently if twistFlavor
is set and compatible?
The next small step. Here I do not really understand where the problem comes from.
# QUDA: computeCloverForce(cudaForce, gaugePrecise, quarkX, quarkP, force_coeff);
# QUDA: ERROR: qudaEventSynchronize_ returned CUDA_ERROR_ILLEGAL_ADDRESS
(timer.h:107 in peek())
(rank 0, host node-02.bender, quda_api.cpp:72 in void quda::target::cuda::set_driver_error(CUresult, const char*, const char*, const char*, const char*, bool)())
# QUDA: last kernel called was (name=N4quda11CloverForceIdLi3EL21QudaReconstructType_s18EEE,volume=16x16x16x8,aux=GPU-offline,vol=32768stride=16384precision=8geometry=4Nc=3,exterior,dir=1)
--------------------------------------------------------------------------
The next small step. Here I do not really understand where the problem comes from.
# QUDA: computeCloverForce(cudaForce, gaugePrecise, quarkX, quarkP, force_coeff); # QUDA: ERROR: qudaEventSynchronize_ returned CUDA_ERROR_ILLEGAL_ADDRESS (timer.h:107 in peek()) (rank 0, host node-02.bender, quda_api.cpp:72 in void quda::target::cuda::set_driver_error(CUresult, const char*, const char*, const char*, const char*, bool)()) # QUDA: last kernel called was (name=N4quda11CloverForceIdLi3EL21QudaReconstructType_s18EEE,volume=16x16x16x8,aux=GPU-offline,vol=32768stride=16384precision=8geometry=4Nc=3,exterior,dir=1) --------------------------------------------------------------------------
I got a similar error while trying to use computeCloverForceQuda
. I checked the doc of CUDA and find that cudaEventSynchronize
will not return cudaErrorIllegalAddress
, which is pretty strange.
When I try to disable autotuning via setting QUDA_ENABLE_TUNING=0
, the same error code (cudaErrorIllegalAddress
) happens at another place.
I suspect that the problem is in the spinor comms of computeCloverForce
and the error message is erroneous. I just realised I've been running with a RELEASE build of QUDA and will investigate with a DEBUG build first of all :)
Thread 1 "hmc_tm" received signal CUDA_EXCEPTION_14, Warp Illegal Address.
[Switching focus to CUDA kernel 0, grid 4320, block (0,0,0), thread (0,0,0), device 0, sm 0, warp 0, lane 0]
quda::Kernel1D<quda::Exterior, quda::CloverForceArg<double, 3, (QudaReconstructType_s)18, 3>, false><<<(64,1,1),(32,1,1)>>> ()
at /home/bkostrze/code/quda-tm_force/lib/../include/color_spinor.h:1017 in _ZN4quda18outerProdSpinTraceIdLi3ELi4EEENS_6MatrixINS_7complexIT_EEXT0_EEERKNS_11ColorSpinorIS3_XT0_EXT1_EEES9_ inlined from color_spinor.h:115
it seems that the error does not quite occur where it is reported above, as far as I can tell from a run with cuda-gdb
.
Thanks for the tip about disabling tuning @SaltyChiang, this makes it quite obvious where the issue is:
# QUDA: ERROR: qudaDeviceSynchronize_ returned CUDA_ERROR_ILLEGAL_ADDRESS
(clover_outer_product.cu:100 in exchangeGhost())
(rank 2, host node-02.bender, quda_api.cpp:72 in void quda::target::cuda::set_driver_error(CUresult, const char*, const char*, const char*, const char*, bool)())
# QUDA: last kernel called was (name=N4quda11CloverForceIdLi3EL21QudaReconstructType_s18EEE,volume=16x16x16x8,aux=GPU-offline,vol=32768stride=16384precision=8geometry=4Nc=3,exterior,dir=1)
I think I can confirm that the error in https://github.com/lattice/quda/pull/1338#issuecomment-1321203371 stems from exchangeGhost
of computeCloverForce
since it runs without erroring out with a single GPU (this is output from tmLQCD's https://github.com/etmc/tmLQCD/tree/quda_work_tm_force branch running a simple HMC with a twisted clover clover determinant):
# QUDA: Dslash(x.Odd(), x.Even(), QUDA_ODD_PARITY)
# QUDA: gamma5(tmp, x.Even())
# QUDA: M(p.Even(), tmp)
# QUDA: Dslash(p.Odd(), p.Even(), QUDA_ODD_PARITY)
# QUDA: computeCloverForce(cudaForce, gaugePrecise, quarkX, quarkP, force_coeff);
# QUDA: computeCloverSigmaTrace(oprod, *cloverPrecise, k_csw_ov_8);
# QUDA: computeCloverSigmaOprod(oprod, quarkX, quarkP, ferm_epsilon)
# QUDA: cloverDerivative(cudaForce, gaugeEx, *oprodEx, 1.0, QUDA_ODD_PARITY)
# QUDA: cloverDerivative(cudaForce, gaugeEx, *oprodEx, 1.0, QUDA_EVEN_PARITY)
# QUDA: updateMomentum(gpuMom, -1.0, cudaForce, "tmclover")
# TM_QUDA: Time for computeTMCloverForceQuda 3.765968e-02 s level: 3 proc_id: 0 /HMC/cloverdetlight:cloverdet_derivative/compute_cloverdet_derivative_quda/computeTMCloverForceQuda
I will get more involved with this debug in the next day or so. One thought though: when debugging it’s best to run with CUDA_LAUNCH_BLOCKING=1
, this will ensure that all kernels are synchronized with the host and if an illegal memory access, etc., occurs you will be guaranteed to told exactly which kernel is misbehaving by QUDA’s error checking.
One thought though: when debugging it’s best to run with CUDA_LAUNCH_BLOCKING=1, this will ensure that all kernels are synchronized with the host and if an illegal memory access, etc., occurs you will be guaranteed to told exactly which kernel is misbehaving by QUDA’s error checking.
Thanks a lot, I keep forgetting about this...
On 2 GPUs:
# QUDA: ERROR: qudaLaunchKernel returned an illegal memory access was encountered
(/home/bkostrze/code/quda-tm_force/lib/targets/cuda/quda_api.cpp:152 in qudaLaunchKernel())
(rank 1, host node-04.bender, quda_api.cpp:58 in void quda::target::cuda::set_runtime_error(cudaError_t, const char*, const char*, const char*, const char*, bool)())
# QUDA: last kernel called was (name=N4quda11CloverForceIdLi3EL21QudaReconstructType_s18EEE,volume=16x16x16x16,aux=GPU-offline,vol=65536stride=32768precision=8geometry=4Nc=3,exterior,dir=1)
The issue above occurs because of
void DiracTwistedClover::checkParitySpinor(const ColorSpinorField &out, const ColorSpinorField &in) const { Dirac::checkParitySpinor(out, in); if (out.TwistFlavor() == QUDA_TWIST_SINGLET) { if (out.Volume() != clover->VolumeCB()) errorQuda("Parity spinor volume %lu doesn't match clover checkboard volume %lu", out.Volume(), clover->VolumeCB()); } else { // if (out.Volume() / 2 != clover->VolumeCB()) errorQuda("Parity spinor volume %lu doesn't match clover checkboard volume %lu", out.Volume(), clover->VolumeCB()); } }
and
qParam.twist_flavor
not being set whenquarkX
andquarkP
are created.I can update the error message above to indicate that this might be a possibility, but to be honest, maybe we generally should check more frequently if
twistFlavor
is set and compatible?
To be honest I never really liked this design where a singlet fermion field is distinct from a non-twisted field. The dimensions are identical, and the field can be used as intended, so I don't see the need for the separate field type. I'm not sure what the original design motivation was for this (perhaps @alexstrel knows, as I think he was the originator, but this may be lost in the mists of time?)
A better approach here may be to dump the distinction between QUDA_TWIST_NO
and QUDA_TWIST_SINGLET
and allow these fields to be used interchangeably. This would remove some of this pointless checking, and then we only need to worry if we have a doublet field or not. Thoughts?
One thought though: when debugging it’s best to run with CUDA_LAUNCH_BLOCKING=1, this will ensure that all kernels are synchronized with the host and if an illegal memory access, etc., occurs you will be guaranteed to told exactly which kernel is misbehaving by QUDA’s error checking.
Thanks a lot, I keep forgetting about this...
On 2 GPUs:
# QUDA: ERROR: qudaLaunchKernel returned an illegal memory access was encountered (/home/bkostrze/code/quda-tm_force/lib/targets/cuda/quda_api.cpp:152 in qudaLaunchKernel()) (rank 1, host node-04.bender, quda_api.cpp:58 in void quda::target::cuda::set_runtime_error(cudaError_t, const char*, const char*, const char*, const char*, bool)()) # QUDA: last kernel called was (name=N4quda11CloverForceIdLi3EL21QudaReconstructType_s18EEE,volume=16x16x16x16,aux=GPU-offline,vol=65536stride=32768precision=8geometry=4Nc=3,exterior,dir=1)
Ok, so this suggests the issue isn't the exchangeGhost
and is the CloverForce
kernel. Moreover, we can see from the signature that it's the halo kernel (that's what the exterior
part means). So that would explain why it's a multi-GPU issue.
Maybe I missed it, but what happens what happens when you run on a single GPU?
Maybe I missed it, but what happens what happens when you run on a single GPU?
It works fine (I haven't checked yet whether it's correct, but at least it doesn't crash). I will be able to work some more on this next week hopefully and make some more concrete progress.
I think it is the problem in exchangeGhost
.
Please see this commit in #1339.
After using exchangeGhost
in ColorSpinorField
instead of one in clover_outer_product.cu
, this function doesn't raise cudaErrorIllegalAddress
for 2 GPUs.
exchangeGhost
in clover_outer_product.cu
seems to make the inB.Ghost()
invalid for device function, still don't know why. At least we can go to the next step.
A better approach here may be to dump the distinction between QUDA_TWIST_NO and QUDA_TWIST_SINGLET and allow these fields to be used interchangeably. This would remove some of this pointless checking, and then we only need to worry if we have a doublet field or not. Thoughts?
I agree. The name is also a bit fundamentally confusing in some sense, since I would consider the "flavour" to correspond to either +mu or -mu or equivalently +r or -r in the "physical basis", rather than the distinction between single-flavour and doublet. That is a whole other story, however, and the present solution works well in practice.
On the other hand, having a flag indicating that one is dealing with a twisted mass operator (rather than having to check whether mu is non-zero against best practice, even though also this is done here and there) is not a bad thing. This fact is useful in a couple of places:
1) clover field and its inverse
2) implicit use in transfer->Vectors().TwistFlavor()
(I think)
Before getting rid of it, perhaps it's worth checking if this would not introduce lots of local and equivalent variables (which will likely be bool twisted = param.mu != 0.0
(or a safer variant thereof). One might conclude, of course, that having these defined locally is better for one reason or another.
All good points @kostrzewa regarding TwistFlavour()
. I will have a think about this and some hacking and see what solution makes sense.
I've taken a look at the clover_outer_product.cu file a bit more to see if I could visually isolate what's going on here. One thing I do see is that the halo exchange code is rather nasty, and is badly need of an update to clean this up. It wouldn't surprise me if this code has bit rotted, and that in turn could lead to the crashes you were seeing, e.g., if the ghost buffers weren't being sized correctly, or some such.@SaltyChiang tried replacing with ColorSpinorField::exchangeGhost
in #1339 but that isn't quite the correct thing to do, at least in isolation, since kernel changes would be needed as well.
May I suggest the following (which I think is what you were trending towards anyway):
What's the best way for me to run this function myself? I assume you're calling this from tmLQCD?
May I suggest the following (which I think is what you were trending towards anyway):
- we focus on getting single GPU to work correctly
- once we have that, it will make debugging the multi-GPU variant much easier and more self contained
- moreover, in order to remove the dynamic clover aspect as a source of error, we focus on non-dynamic clover
I agree, this is exactly what I was trying to do to get properly familiar with this part of QUDA and to iron out crashes first so that actual step by step testing can be done.
What's the best way for me to run this function myself? I assume you're calling this from tmLQCD?
If you'd like to test right away, I can send instructions on how to compile tmLQCD+QUDA and to provide an input file which will result in the function being called in the derivative of a twisted clover determinant monomial.
In the meantime perhaps we can get started on a test harness in QUDA in the hope that we can get this under test coverage eventually.
Disregard what was written in https://github.com/lattice/quda/pull/1338#issuecomment-1324701045 before about inv_param
not being correct. We just checked with @Marcogarofalo and in fact all should be fine.
Getting started on a test harness in QUDA does sound like a good immediate step. That would then allow, for example, me to focus on crashes with multi-GPU.
@Marcogarofalo as discussed I merged in @SaltyChiang's additions from #1339 such that we can proceed step by step here.
computeTMCloverForceQuda now gives the same result of tmLQCD
that sounds like a good thing 😄 🍾
Thanks, but it is still working only with one mpi process.
Thanks, but it is still working only with one mpi process.
In that case, wondering if now is a good time to work on an internal QUDA test for this function? With that done, I could then focus on making the multi-GPU results consistent with the single-GPU. @kostrzewa you noted before you could add a numerical derivative for this.
In that case, wondering if now is a good time to work on an internal QUDA test for this function? With that done, I could then focus on making the multi-GPU results consistent with the single-GPU. @kostrzewa you noted before you could add a numerical derivative for this.
@maddyscientist I would very much like to proceed on this but until the beginning of March I have a number of important deadlines that require full dedication unfortunately.
@Marcogarofalo since it seems you have a working test now, and the code is working against tmLQCD, does this mean this is ready for review now?
Hi, thanks for the reply. I am unsure what requirements to meet before starting a review. Maybe I can give a first clean-up before starting.
@Marcogarofalo ideally we have:
I'm happy to help you get this over the line. Most of the hard work is done I think.
Ok, I've started to work on it, to address some of the issues. I've found a few limitations, that would be good to fix:
Also, I am finding that if I run the test command line above:
mpirun -n 2 ./tests/TMCloverForce_test --dslash-type clover --compute-clover 1 --matpc even-even-asym --dim 4 4 4 4 --prec double --gridsize 2 1 1 1 --niter 1 --compute-clover-trlog 1 --verbosity verbose --kappa 1 --mu 1 --clover-csw 1
I get an error:
clover.TrLog()[0]=-nan, clover.TrLog()[1]=-nan
ERROR: Clover trlog has returned -nan, likey due to the clover matrix being singular. (rank 0, host nvsocal2, clover_invert.cu:26 in quda::CloverInvert<store_t>::CloverInvert(quda::CloverField&, bool) [with store_t = double]())
last kernel called was (name=N4quda12CloverInvertIdEE,volume=4x4x4x4,aux=GPU-offline,vol=256precision=8Nc=3,trlog=true,twist=false)
But I guess you don't get this problem?
I get an error:
clover.TrLog()[0]=-nan, clover.TrLog()[1]=-nan ERROR: Clover trlog has returned -nan, likey due to the clover matrix being singular. (rank 0, host nvsocal2, clover_invert.cu:26 in quda::CloverInvert<store_t>::CloverInvert(quda::CloverField&, bool) [with store_t = double]()) last kernel called was (name=N4quda12CloverInvertIdEE,volume=4x4x4x4,aux=GPU-offline,vol=256precision=8Nc=3,trlog=true,twist=false)
@maddyscientist This often happens when I'm trying to build the clover term for a random gauge generated by gaussGaugeQuda. No idea why this happens, but decreasing sigma works.
And I noticed @Marcogarofalo calls computeCloverForce
with ... quarkP, quarkX, ...
instead of ... quarkX, quarkP
(the latter one should be the correct order described by docstring). I tried to do a similar thing (you can see the modification here, which is almost a swap between A,B
and C,D
), and get a reasonable deltaH in a clover HMC process. If this is really an issue, maybe we should change the implementation of computeCloverForce
to match the docstring.
Also, the dynamic inverse of the clover term in clover_trace.cuh now differs from the one in clover_invert.cuh. And I'm wondering what will happen if we disable dynamic clover term?
@maddyscientist I do not get this problem, maybe setting --compute-clover-trlog 0
will avoid the problem for the moment.
@SaltyChiang I saw https://github.com/CLQCD/quda/commit/b47950dd3026e2356298e41199e235d68f593746 but then also the exterior kernel needs to be modified with the same swap.
@Marcogarofalo The modification I made in that commit is not a good one, but just use that to make HMC on a single GPU work. I think it's better to swap x
and p
here.
Ok, I've started to work on it, to address some of the issues. I've found a few limitations, that would be good to fix:
- Add support for Wilson-clover, e.g., no twisted mass term
Add support for wilson-clover should be possible, in principle just setting mu=0 should work. A problem could be whether to use the Hermitian operator or not.
- Support for even-even / odd-odd preconditioning, and not just asymmetric versions
I can type the symmetric version but I do not have any existing reference to test against it
- Test code hard codes sloppy, refinement and precondition preconditions (I can fix this)
Regarding the error
Also, I am finding that if I run the test command line above:
mpirun -n 2 ./tests/TMCloverForce_test --dslash-type clover --compute-clover 1 --matpc even-even-asym --dim 4 4 4 4 --prec double --gridsize 2 1 1 1 --niter 1 --compute-clover-trlog 1 --verbosity verbose --kappa 1 --mu 1 --clover-csw 1
I get an error:
clover.TrLog()[0]=-nan, clover.TrLog()[1]=-nan ERROR: Clover trlog has returned -nan, likey due to the clover matrix being singular. (rank 0, host nvsocal2, clover_invert.cu:26 in quda::CloverInvert<store_t>::CloverInvert(quda::CloverField&, bool) [with store_t = double]()) last kernel called was (name=N4quda12CloverInvertIdEE,volume=4x4x4x4,aux=GPU-offline,vol=256precision=8Nc=3,trlog=true,twist=false)
But I guess you don't get this problem?
I did not get the program because I was using --dslash-type twisted-clover
, with --dslash-type clover
the value of kappa used is too big. With the default kappa value --kappa 0.121951
the test run but it fails.
Ok, I've started to work on it, to address some of the issues. I've found a few limitations, that would be good to fix:
- Add support for Wilson-clover, e.g., no twisted mass term
- Support for even-even / odd-odd preconditioning, and not just asymmetric versions
The problem with even-even / odd-odd symmetric preconditioning is that the force is not obtained from the asymmetric version with a change of the operator, but more algebra is needed, at least from what I understand fromAppendix A of https://arxiv.org/pdf/hep-lat/0112051.pdf.
I do not have a reference implementation in tmLQCD to test against it. I would prefer to return an error if computeTMCloverForceQuda
is called with even-even / odd-odd preconditioning and in the future someone interested can implement it. Does this strategy seem reasonable to you?
The problem with even-even / odd-odd symmetric preconditioning is that the force is not obtained from the asymmetric version with a change of the operator, but more algebra is needed, at least from what I understand fromAppendix A of https://arxiv.org/pdf/hep-lat/0112051.pdf. I do not have a reference implementation in tmLQCD to test against it. I would prefer to return an error if
computeTMCloverForceQuda
is called with even-even / odd-odd preconditioning and in the future someone interested can implement it. Does this strategy seem reasonable to you?
Understood @Marcogarofalo. Returning an error for the symmetric case is fine I think.
Understood @Marcogarofalo. Returning an error for the symmetric case is fine I think.
I implemented it. Maybe it can help if I list the problems that I did not manage to solve:
The ghost exchange in computeCloverForce is not optimal https://github.com/lattice/quda/blob/30e4f52b0f45b259c19ddb7367f04185be51d252/lib/clover_outer_product.cu#L157-L160 It works but there are unnecessary communications done. Also, notice that the order of this exchange is important, for example, the following does not work
inB.exchangeGhost((QudaParity)(1-parity), nFace, dag);
inD.exchangeGhost((QudaParity)(1-parity), nFace, 1-dag);
exchangeGhost(inB, parity, dag);
exchangeGhost(inD, parity, 1-dag);
I think it should be something like the two spinors B and D use the same ghost array.
In the host reference implementation I did not manage to create an extended oprod
(a cudaGaugeField
with QUDA_TENSOR_GEOMETRY
) so I used the device routines to exchange the ghost and then I downloaded on the host https://github.com/lattice/quda/blob/30e4f52b0f45b259c19ddb7367f04185be51d252/tests/host_reference/TMCloverForce_reference.cpp#L197-L203
This is not really a problem but, I found that the cloverInvert
function in https://github.com/lattice/quda/blob/feature/tm_force/include/kernels/clover_invert.cuh#L62-L64 does not really compute the clover inverse for twisted mass. It is missing this part that I add in computeCloverSigmaTrace
https://github.com/lattice/quda/blob/30e4f52b0f45b259c19ddb7367f04185be51d252/include/kernels/clover_trace.cuh#L59-L61
but I guess that changing this is tedious, one needs to change the dirac operator accordingly
@maddyscientist Do you think that we could discuss the first point that @Marcogarofalo raises above in https://github.com/lattice/quda/pull/1338#issuecomment-1790508776 about the ghost exchange? We just discussed and he suggested that perhaps splitting the computation to do one checkerboard first and then the other would provide an easy fix for the fact that currently communication is done twice. Our next "physics" target is to get the force working for our rational approximation for the "1+1" sector (which uses the two-flavour ND operator) as this is the single largest remaining bottleneck and it would be great if we could benefit there from doing the least amount of comms.
@kostrzewa yes this would be good to discuss. I'm planning to push a few updates shortly to this branch, namely to do some cleanup and merge in the latest develop branch. Let's discuss this after I've done this (which I'll hopefully get to in the next day or two).
@kostrzewa @Marcogarofalo I've done further testing and cleanup:
Regarding the redundant communications in exchangeGhost: can you explain what unnecessary comms are being done here?
I've found a few outstanding issues that it need to be fixed:
--partition 12
on a single GPU)@maddyscientist thank you for the help.
Regarding the redundant communications in exchangeGhost: can you explain what unnecessary comms are being done here?
I think that it is here https://github.com/lattice/quda/blob/901ddedfb572385c4f6c8b1a14df392268f70cf7/lib/clover_outer_product.cu#L153-L156
it looks like there is an exchange with inB.exchangeGhost
and then with exchangeGhost
, I do not fully understand this part so maybe I am wrong. With only line 154 the kernel gives CUDA_ERROR_ILLEGAL_ADDRESS
. I add the line 153 in
https://github.com/lattice/quda/pull/1338/commits/1e1af7f561c266efbda76e1a604218b13b30dc22
I also noticed that swapping any of the lines above gives a wrong result.
I've found a few outstanding issues that it need to be fixed:
- The test should test multiple quark fields, e.g., as in RHMC
Sorry I was ignoring this part because in tmLQCD the RHMC it is different, needs the non-degenerate operator. I can work on it
- The test should test with a determinant ratio (ctest should automatically test both paths) ok
- There's a segmentation fault when enabling partitioning: segmentation is in the test code itself (e.g., run with
--partition 12
on a single GPU)
I will have a look
@maddyscientist I hope I fix the issue you listed, I see a failure in the CI but looking at the logs I see
slurmstepd: error: *** STEP 50331581.1 ON nid04394 CANCELLED AT 2023-12-05T20:00:47 DUE TO TIME LIMIT ***
@Marcogarofalo Thanks for the latest fixes. All looks good, and I wouldn't worry about the CI failure at CSCS - this target is a bit temperamental and sometimes takes longer than the maximum time, hence the failure.
I've identified the issue with the exchange, and working a on a fix now.
Ok, I have pushed the fix for the quark field exchange. Issue was that the ghost pointers in the accessor were not being set correctly. This was caused by two ghost pointer getter methods: ColorSpinorField::Ghost()
and ColorSpinorField::Ghost2()
, which do slightly different things (some long-term cruft that I need to address, and unify these two).
I've also pushed some rudimentary OMP parallelization of the host test code which allows for easier testing of larger problem sizes.
@Marcogarofalo @kostrzewa what from your end needs to be done on this PR? I assume the non degenerate force is for a later PR?
@maddyscientist I also think that it is better to keep the non degenerate force for another PR.
My only doubt remaining here is in the reference implementation. I used the device routine to create the extended version of oprod
https://github.com/lattice/quda/blob/c1ee1f7d0df32ad4329d6ef4531d5bc37f7b69a4/tests/host_reference/TMCloverForce_reference.cpp#L236
I didn't manage to make the host version with QUDA_TENSOR_GEOMETRY
, I am not sure whether the current state is acceptable.
Thanks @Marcogarofalo for your thoughts. I'll take a look at the reference code, though I don't view this as critical.
I'm curious: compared to the reference tmLQCD implementation, how is QUDA performance looking? I'll push some more performance updates to this branch shortly.
From a quick test on a 64^3 x 128
lattice using 4 nodes of juwels-booster
https://apps.fz-juelich.de/jsc/hps/juwels/booster-overview.html
assuming that I am using the machine in a reasonable way I get
It is not a physical ensemble, however, it should not matter for this test.
From a quick test on a
64^3 x 128
lattice using 4 nodes of juwels-booster https://apps.fz-juelich.de/jsc/hps/juwels/booster-overview.html assuming that I am using the machine in a reasonable way I get
- tmLQCD native implementation 3.60 s
- computeTMCloverForceQuda 0.35 s
wow! excellent!
This is a temporary copy of
computeCloverForceQuda
with (almost) the minimum amount of functionality to implement the derivative of a twisted clover determinant monomial as a starting point for debugging