lattice / quda

QUDA is a library for performing calculations in lattice QCD on GPUs.
https://lattice.github.io/quda
Other
289 stars 97 forks source link

minimal trial implementation for a twisted clover determinant derivative (pre-draft) #1338

Closed kostrzewa closed 8 months ago

kostrzewa commented 1 year ago

This is a temporary copy of computeCloverForceQuda with (almost) the minimum amount of functionality to implement the derivative of a twisted clover determinant monomial as a starting point for debugging

kostrzewa commented 1 year ago

@Marcogarofalo @simone-romiti fyi

kostrzewa commented 1 year ago

This is similar to @Marcogarofalo's #1330 but I want to focus on getting this to work with tmLQCD first and then generalize the original function from there. As a result, the present PR will be temporary and only to share progress.

kostrzewa commented 1 year ago

The problem https://github.com/lattice/quda/pull/1330#issuecomment-1285797127 is clearly fixed by removing the extra qParam.x[0] /= 2;.

However, the next problem crops up when the Dslash is supposed to be applied:

# QUDA: Dslash(x.Odd(), x.Even(), QUDA_ODD_PARITY)
# QUDA: ERROR: Parity spinor volume 16384 doesn't match clover checkboard volume 16384 (rank 0, host node-03.bender, dirac_twisted_clover.cpp:43 in virtual void quda::DiracTwistedClover::checkParitySpinor(const quda::ColorSpinorField&, const quda::ColorSpinorField&) const())
# QUDA:        last kernel called was (name=N4quda15CopyColorSpinorILi4ELi3ENS_11colorspinor11FloatNOrderIdLi4ELi3ELi2ELb0ELb0EEENS1_21SpaceSpinorColorOrderIdLi4ELi3EEESt5tupleIJRNS_16ColorSpinorFieldERKS7_19QudaFieldLocation_sPdPKdEEEE,volume=8x16x16x8,aux=GPU-offline,vol=16384,parity=1,precision=8,order=9,Ns=4,Nc=3vol=16384,parity=1,precision=8,order=2,Ns=4,Nc=3,NonRelBasis)

and this I find quite strange since 16384 == 16384 :smile:

kostrzewa commented 1 year ago

The issue above occurs because of

  void DiracTwistedClover::checkParitySpinor(const ColorSpinorField &out, const ColorSpinorField &in) const
  {
    Dirac::checkParitySpinor(out, in);

    if (out.TwistFlavor() == QUDA_TWIST_SINGLET) {
      if (out.Volume() != clover->VolumeCB())
        errorQuda("Parity spinor volume %lu doesn't match clover checkboard volume %lu", out.Volume(),
                  clover->VolumeCB());
    } else {
      //
      if (out.Volume() / 2 != clover->VolumeCB())
        errorQuda("Parity spinor volume %lu doesn't match clover checkboard volume %lu", out.Volume(),                                     
                  clover->VolumeCB());
    }
  }

and qParam.twist_flavor not being set when quarkX and quarkP are created.

I can update the error message above to indicate that this might be a possibility, but to be honest, maybe we generally should check more frequently if twistFlavor is set and compatible?

kostrzewa commented 1 year ago

The next small step. Here I do not really understand where the problem comes from.

# QUDA: computeCloverForce(cudaForce, gaugePrecise, quarkX, quarkP, force_coeff);
# QUDA: ERROR: qudaEventSynchronize_ returned CUDA_ERROR_ILLEGAL_ADDRESS
 (timer.h:107 in peek())
 (rank 0, host node-02.bender, quda_api.cpp:72 in void quda::target::cuda::set_driver_error(CUresult, const char*, const char*, const char*, const char*, bool)())
# QUDA:        last kernel called was (name=N4quda11CloverForceIdLi3EL21QudaReconstructType_s18EEE,volume=16x16x16x8,aux=GPU-offline,vol=32768stride=16384precision=8geometry=4Nc=3,exterior,dir=1)
--------------------------------------------------------------------------
SaltyChiang commented 1 year ago

The next small step. Here I do not really understand where the problem comes from.

# QUDA: computeCloverForce(cudaForce, gaugePrecise, quarkX, quarkP, force_coeff);
# QUDA: ERROR: qudaEventSynchronize_ returned CUDA_ERROR_ILLEGAL_ADDRESS
 (timer.h:107 in peek())
 (rank 0, host node-02.bender, quda_api.cpp:72 in void quda::target::cuda::set_driver_error(CUresult, const char*, const char*, const char*, const char*, bool)())
# QUDA:        last kernel called was (name=N4quda11CloverForceIdLi3EL21QudaReconstructType_s18EEE,volume=16x16x16x8,aux=GPU-offline,vol=32768stride=16384precision=8geometry=4Nc=3,exterior,dir=1)
--------------------------------------------------------------------------

I got a similar error while trying to use computeCloverForceQuda. I checked the doc of CUDA and find that cudaEventSynchronize will not return cudaErrorIllegalAddress, which is pretty strange.

When I try to disable autotuning via setting QUDA_ENABLE_TUNING=0, the same error code (cudaErrorIllegalAddress) happens at another place.

kostrzewa commented 1 year ago

I suspect that the problem is in the spinor comms of computeCloverForce and the error message is erroneous. I just realised I've been running with a RELEASE build of QUDA and will investigate with a DEBUG build first of all :)

kostrzewa commented 1 year ago
Thread 1 "hmc_tm" received signal CUDA_EXCEPTION_14, Warp Illegal Address.
[Switching focus to CUDA kernel 0, grid 4320, block (0,0,0), thread (0,0,0), device 0, sm 0, warp 0, lane 0]
quda::Kernel1D<quda::Exterior, quda::CloverForceArg<double, 3, (QudaReconstructType_s)18, 3>, false><<<(64,1,1),(32,1,1)>>> ()
    at /home/bkostrze/code/quda-tm_force/lib/../include/color_spinor.h:1017 in _ZN4quda18outerProdSpinTraceIdLi3ELi4EEENS_6MatrixINS_7complexIT_EEXT0_EEERKNS_11ColorSpinorIS3_XT0_EXT1_EEES9_ inlined from color_spinor.h:115

it seems that the error does not quite occur where it is reported above, as far as I can tell from a run with cuda-gdb.

kostrzewa commented 1 year ago

Thanks for the tip about disabling tuning @SaltyChiang, this makes it quite obvious where the issue is:

# QUDA: ERROR: qudaDeviceSynchronize_ returned CUDA_ERROR_ILLEGAL_ADDRESS
 (clover_outer_product.cu:100 in exchangeGhost())
 (rank 2, host node-02.bender, quda_api.cpp:72 in void quda::target::cuda::set_driver_error(CUresult, const char*, const char*, const char*, const char*, bool)())
# QUDA:        last kernel called was (name=N4quda11CloverForceIdLi3EL21QudaReconstructType_s18EEE,volume=16x16x16x8,aux=GPU-offline,vol=32768stride=16384precision=8geometry=4Nc=3,exterior,dir=1)
kostrzewa commented 1 year ago

I think I can confirm that the error in https://github.com/lattice/quda/pull/1338#issuecomment-1321203371 stems from exchangeGhost of computeCloverForce since it runs without erroring out with a single GPU (this is output from tmLQCD's https://github.com/etmc/tmLQCD/tree/quda_work_tm_force branch running a simple HMC with a twisted clover clover determinant):

# QUDA: Dslash(x.Odd(), x.Even(), QUDA_ODD_PARITY)
# QUDA: gamma5(tmp, x.Even())
# QUDA: M(p.Even(), tmp)
# QUDA: Dslash(p.Odd(), p.Even(), QUDA_ODD_PARITY)
# QUDA: computeCloverForce(cudaForce, gaugePrecise, quarkX, quarkP, force_coeff);
# QUDA: computeCloverSigmaTrace(oprod, *cloverPrecise, k_csw_ov_8);
# QUDA: computeCloverSigmaOprod(oprod, quarkX, quarkP, ferm_epsilon)
# QUDA: cloverDerivative(cudaForce, gaugeEx, *oprodEx, 1.0, QUDA_ODD_PARITY)
# QUDA: cloverDerivative(cudaForce, gaugeEx, *oprodEx, 1.0, QUDA_EVEN_PARITY)
# QUDA: updateMomentum(gpuMom, -1.0, cudaForce, "tmclover")
# TM_QUDA: Time for computeTMCloverForceQuda 3.765968e-02 s level: 3 proc_id: 0 /HMC/cloverdetlight:cloverdet_derivative/compute_cloverdet_derivative_quda/computeTMCloverForceQuda
maddyscientist commented 1 year ago

I will get more involved with this debug in the next day or so. One thought though: when debugging it’s best to run with CUDA_LAUNCH_BLOCKING=1, this will ensure that all kernels are synchronized with the host and if an illegal memory access, etc., occurs you will be guaranteed to told exactly which kernel is misbehaving by QUDA’s error checking.

kostrzewa commented 1 year ago

One thought though: when debugging it’s best to run with CUDA_LAUNCH_BLOCKING=1, this will ensure that all kernels are synchronized with the host and if an illegal memory access, etc., occurs you will be guaranteed to told exactly which kernel is misbehaving by QUDA’s error checking.

Thanks a lot, I keep forgetting about this...

On 2 GPUs:

# QUDA: ERROR: qudaLaunchKernel returned an illegal memory access was encountered
 (/home/bkostrze/code/quda-tm_force/lib/targets/cuda/quda_api.cpp:152 in qudaLaunchKernel())
 (rank 1, host node-04.bender, quda_api.cpp:58 in void quda::target::cuda::set_runtime_error(cudaError_t, const char*, const char*, const char*, const char*, bool)())
# QUDA:        last kernel called was (name=N4quda11CloverForceIdLi3EL21QudaReconstructType_s18EEE,volume=16x16x16x16,aux=GPU-offline,vol=65536stride=32768precision=8geometry=4Nc=3,exterior,dir=1)
maddyscientist commented 1 year ago

The issue above occurs because of

  void DiracTwistedClover::checkParitySpinor(const ColorSpinorField &out, const ColorSpinorField &in) const
  {
    Dirac::checkParitySpinor(out, in);

    if (out.TwistFlavor() == QUDA_TWIST_SINGLET) {
      if (out.Volume() != clover->VolumeCB())
        errorQuda("Parity spinor volume %lu doesn't match clover checkboard volume %lu", out.Volume(),
                  clover->VolumeCB());
    } else {
      //
      if (out.Volume() / 2 != clover->VolumeCB())
        errorQuda("Parity spinor volume %lu doesn't match clover checkboard volume %lu", out.Volume(),                                     
                  clover->VolumeCB());
    }
  }

and qParam.twist_flavor not being set when quarkX and quarkP are created.

I can update the error message above to indicate that this might be a possibility, but to be honest, maybe we generally should check more frequently if twistFlavor is set and compatible?

To be honest I never really liked this design where a singlet fermion field is distinct from a non-twisted field. The dimensions are identical, and the field can be used as intended, so I don't see the need for the separate field type. I'm not sure what the original design motivation was for this (perhaps @alexstrel knows, as I think he was the originator, but this may be lost in the mists of time?)

A better approach here may be to dump the distinction between QUDA_TWIST_NO and QUDA_TWIST_SINGLET and allow these fields to be used interchangeably. This would remove some of this pointless checking, and then we only need to worry if we have a doublet field or not. Thoughts?

maddyscientist commented 1 year ago

One thought though: when debugging it’s best to run with CUDA_LAUNCH_BLOCKING=1, this will ensure that all kernels are synchronized with the host and if an illegal memory access, etc., occurs you will be guaranteed to told exactly which kernel is misbehaving by QUDA’s error checking.

Thanks a lot, I keep forgetting about this...

On 2 GPUs:

# QUDA: ERROR: qudaLaunchKernel returned an illegal memory access was encountered
 (/home/bkostrze/code/quda-tm_force/lib/targets/cuda/quda_api.cpp:152 in qudaLaunchKernel())
 (rank 1, host node-04.bender, quda_api.cpp:58 in void quda::target::cuda::set_runtime_error(cudaError_t, const char*, const char*, const char*, const char*, bool)())
# QUDA:        last kernel called was (name=N4quda11CloverForceIdLi3EL21QudaReconstructType_s18EEE,volume=16x16x16x16,aux=GPU-offline,vol=65536stride=32768precision=8geometry=4Nc=3,exterior,dir=1)

Ok, so this suggests the issue isn't the exchangeGhost and is the CloverForce kernel. Moreover, we can see from the signature that it's the halo kernel (that's what the exterior part means). So that would explain why it's a multi-GPU issue. Maybe I missed it, but what happens what happens when you run on a single GPU?

kostrzewa commented 1 year ago

Maybe I missed it, but what happens what happens when you run on a single GPU?

It works fine (I haven't checked yet whether it's correct, but at least it doesn't crash). I will be able to work some more on this next week hopefully and make some more concrete progress.

SaltyChiang commented 1 year ago

I think it is the problem in exchangeGhost.

Please see this commit in #1339.

https://github.com/lattice/quda/blob/8e1504286db9e9aa48bebcb858176bf0dcf66ed6/lib/clover_outer_product.cu#L145-L158

After using exchangeGhost in ColorSpinorField instead of one in clover_outer_product.cu, this function doesn't raise cudaErrorIllegalAddress for 2 GPUs.

exchangeGhost in clover_outer_product.cu seems to make the inB.Ghost() invalid for device function, still don't know why. At least we can go to the next step.

kostrzewa commented 1 year ago

A better approach here may be to dump the distinction between QUDA_TWIST_NO and QUDA_TWIST_SINGLET and allow these fields to be used interchangeably. This would remove some of this pointless checking, and then we only need to worry if we have a doublet field or not. Thoughts?

I agree. The name is also a bit fundamentally confusing in some sense, since I would consider the "flavour" to correspond to either +mu or -mu or equivalently +r or -r in the "physical basis", rather than the distinction between single-flavour and doublet. That is a whole other story, however, and the present solution works well in practice.

On the other hand, having a flag indicating that one is dealing with a twisted mass operator (rather than having to check whether mu is non-zero against best practice, even though also this is done here and there) is not a bad thing. This fact is useful in a couple of places:

1) clover field and its inverse 2) implicit use in transfer->Vectors().TwistFlavor() (I think)

Before getting rid of it, perhaps it's worth checking if this would not introduce lots of local and equivalent variables (which will likely be bool twisted = param.mu != 0.0 (or a safer variant thereof). One might conclude, of course, that having these defined locally is better for one reason or another.

maddyscientist commented 1 year ago

All good points @kostrzewa regarding TwistFlavour(). I will have a think about this and some hacking and see what solution makes sense.

I've taken a look at the clover_outer_product.cu file a bit more to see if I could visually isolate what's going on here. One thing I do see is that the halo exchange code is rather nasty, and is badly need of an update to clean this up. It wouldn't surprise me if this code has bit rotted, and that in turn could lead to the crashes you were seeing, e.g., if the ghost buffers weren't being sized correctly, or some such.@SaltyChiang tried replacing with ColorSpinorField::exchangeGhost in #1339 but that isn't quite the correct thing to do, at least in isolation, since kernel changes would be needed as well.

May I suggest the following (which I think is what you were trending towards anyway):

What's the best way for me to run this function myself? I assume you're calling this from tmLQCD?

kostrzewa commented 1 year ago

May I suggest the following (which I think is what you were trending towards anyway):

  • we focus on getting single GPU to work correctly
  • once we have that, it will make debugging the multi-GPU variant much easier and more self contained
  • moreover, in order to remove the dynamic clover aspect as a source of error, we focus on non-dynamic clover

I agree, this is exactly what I was trying to do to get properly familiar with this part of QUDA and to iron out crashes first so that actual step by step testing can be done.

What's the best way for me to run this function myself? I assume you're calling this from tmLQCD?

If you'd like to test right away, I can send instructions on how to compile tmLQCD+QUDA and to provide an input file which will result in the function being called in the derivative of a twisted clover determinant monomial.

In the meantime perhaps we can get started on a test harness in QUDA in the hope that we can get this under test coverage eventually.

kostrzewa commented 1 year ago

Disregard what was written in https://github.com/lattice/quda/pull/1338#issuecomment-1324701045 before about inv_param not being correct. We just checked with @Marcogarofalo and in fact all should be fine.

maddyscientist commented 1 year ago

Getting started on a test harness in QUDA does sound like a good immediate step. That would then allow, for example, me to focus on crashes with multi-GPU.

kostrzewa commented 1 year ago

@Marcogarofalo as discussed I merged in @SaltyChiang's additions from #1339 such that we can proceed step by step here.

maddyscientist commented 1 year ago

computeTMCloverForceQuda now gives the same result of tmLQCD

that sounds like a good thing 😄 🍾

Marcogarofalo commented 1 year ago

Thanks, but it is still working only with one mpi process.

maddyscientist commented 1 year ago

Thanks, but it is still working only with one mpi process.

In that case, wondering if now is a good time to work on an internal QUDA test for this function? With that done, I could then focus on making the multi-GPU results consistent with the single-GPU. @kostrzewa you noted before you could add a numerical derivative for this.

kostrzewa commented 1 year ago

In that case, wondering if now is a good time to work on an internal QUDA test for this function? With that done, I could then focus on making the multi-GPU results consistent with the single-GPU. @kostrzewa you noted before you could add a numerical derivative for this.

@maddyscientist I would very much like to proceed on this but until the beginning of March I have a number of important deadlines that require full dedication unfortunately.

maddyscientist commented 12 months ago

@Marcogarofalo since it seems you have a working test now, and the code is working against tmLQCD, does this mean this is ready for review now?

Marcogarofalo commented 11 months ago

Hi, thanks for the reply. I am unsure what requirements to meet before starting a review. Maybe I can give a first clean-up before starting.

maddyscientist commented 11 months ago

@Marcogarofalo ideally we have:

I'm happy to help you get this over the line. Most of the hard work is done I think.

maddyscientist commented 11 months ago

Ok, I've started to work on it, to address some of the issues. I've found a few limitations, that would be good to fix:

maddyscientist commented 11 months ago

Also, I am finding that if I run the test command line above:

mpirun -n 2  ./tests/TMCloverForce_test --dslash-type clover --compute-clover 1 --matpc even-even-asym --dim  4 4 4 4 --prec double --gridsize 2 1 1 1 --niter 1 --compute-clover-trlog 1 --verbosity verbose --kappa 1 --mu 1 --clover-csw 1

I get an error:

clover.TrLog()[0]=-nan, clover.TrLog()[1]=-nan
ERROR: Clover trlog has returned -nan, likey due to the clover matrix being singular. (rank 0, host nvsocal2, clover_invert.cu:26 in quda::CloverInvert<store_t>::CloverInvert(quda::CloverField&, bool) [with store_t = double]())
       last kernel called was (name=N4quda12CloverInvertIdEE,volume=4x4x4x4,aux=GPU-offline,vol=256precision=8Nc=3,trlog=true,twist=false)

But I guess you don't get this problem?

SaltyChiang commented 11 months ago

I get an error:

clover.TrLog()[0]=-nan, clover.TrLog()[1]=-nan
ERROR: Clover trlog has returned -nan, likey due to the clover matrix being singular. (rank 0, host nvsocal2, clover_invert.cu:26 in quda::CloverInvert<store_t>::CloverInvert(quda::CloverField&, bool) [with store_t = double]())
       last kernel called was (name=N4quda12CloverInvertIdEE,volume=4x4x4x4,aux=GPU-offline,vol=256precision=8Nc=3,trlog=true,twist=false)

@maddyscientist This often happens when I'm trying to build the clover term for a random gauge generated by gaussGaugeQuda. No idea why this happens, but decreasing sigma works.

And I noticed @Marcogarofalo calls computeCloverForce with ... quarkP, quarkX, ... instead of ... quarkX, quarkP (the latter one should be the correct order described by docstring). I tried to do a similar thing (you can see the modification here, which is almost a swap between A,B and C,D), and get a reasonable deltaH in a clover HMC process. If this is really an issue, maybe we should change the implementation of computeCloverForce to match the docstring.

Also, the dynamic inverse of the clover term in clover_trace.cuh now differs from the one in clover_invert.cuh. And I'm wondering what will happen if we disable dynamic clover term?

Marcogarofalo commented 11 months ago

@maddyscientist I do not get this problem, maybe setting --compute-clover-trlog 0 will avoid the problem for the moment.

@SaltyChiang I saw https://github.com/CLQCD/quda/commit/b47950dd3026e2356298e41199e235d68f593746 but then also the exterior kernel needs to be modified with the same swap.

SaltyChiang commented 11 months ago

@Marcogarofalo The modification I made in that commit is not a good one, but just use that to make HMC on a single GPU work. I think it's better to swap x and p here.

Marcogarofalo commented 11 months ago

Ok, I've started to work on it, to address some of the issues. I've found a few limitations, that would be good to fix:

  • Add support for Wilson-clover, e.g., no twisted mass term

Add support for wilson-clover should be possible, in principle just setting mu=0 should work. A problem could be whether to use the Hermitian operator or not.

  • Support for even-even / odd-odd preconditioning, and not just asymmetric versions

I can type the symmetric version but I do not have any existing reference to test against it

  • Test code hard codes sloppy, refinement and precondition preconditions (I can fix this)

Regarding the error

Also, I am finding that if I run the test command line above:

mpirun -n 2  ./tests/TMCloverForce_test --dslash-type clover --compute-clover 1 --matpc even-even-asym --dim  4 4 4 4 --prec double --gridsize 2 1 1 1 --niter 1 --compute-clover-trlog 1 --verbosity verbose --kappa 1 --mu 1 --clover-csw 1

I get an error:

clover.TrLog()[0]=-nan, clover.TrLog()[1]=-nan
ERROR: Clover trlog has returned -nan, likey due to the clover matrix being singular. (rank 0, host nvsocal2, clover_invert.cu:26 in quda::CloverInvert<store_t>::CloverInvert(quda::CloverField&, bool) [with store_t = double]())
       last kernel called was (name=N4quda12CloverInvertIdEE,volume=4x4x4x4,aux=GPU-offline,vol=256precision=8Nc=3,trlog=true,twist=false)

But I guess you don't get this problem?

I did not get the program because I was using --dslash-type twisted-clover, with --dslash-type clover the value of kappa used is too big. With the default kappa value --kappa 0.121951 the test run but it fails.

Marcogarofalo commented 11 months ago

Ok, I've started to work on it, to address some of the issues. I've found a few limitations, that would be good to fix:

  • Add support for Wilson-clover, e.g., no twisted mass term
  • Support for even-even / odd-odd preconditioning, and not just asymmetric versions

The problem with even-even / odd-odd symmetric preconditioning is that the force is not obtained from the asymmetric version with a change of the operator, but more algebra is needed, at least from what I understand fromAppendix A of https://arxiv.org/pdf/hep-lat/0112051.pdf. I do not have a reference implementation in tmLQCD to test against it. I would prefer to return an error if computeTMCloverForceQuda is called with even-even / odd-odd preconditioning and in the future someone interested can implement it. Does this strategy seem reasonable to you?

maddyscientist commented 10 months ago

The problem with even-even / odd-odd symmetric preconditioning is that the force is not obtained from the asymmetric version with a change of the operator, but more algebra is needed, at least from what I understand fromAppendix A of https://arxiv.org/pdf/hep-lat/0112051.pdf. I do not have a reference implementation in tmLQCD to test against it. I would prefer to return an error if computeTMCloverForceQuda is called with even-even / odd-odd preconditioning and in the future someone interested can implement it. Does this strategy seem reasonable to you?

Understood @Marcogarofalo. Returning an error for the symmetric case is fine I think.

Marcogarofalo commented 10 months ago

Understood @Marcogarofalo. Returning an error for the symmetric case is fine I think.

I implemented it. Maybe it can help if I list the problems that I did not manage to solve:

kostrzewa commented 9 months ago

@maddyscientist Do you think that we could discuss the first point that @Marcogarofalo raises above in https://github.com/lattice/quda/pull/1338#issuecomment-1790508776 about the ghost exchange? We just discussed and he suggested that perhaps splitting the computation to do one checkerboard first and then the other would provide an easy fix for the fact that currently communication is done twice. Our next "physics" target is to get the force working for our rational approximation for the "1+1" sector (which uses the two-flavour ND operator) as this is the single largest remaining bottleneck and it would be great if we could benefit there from doing the least amount of comms.

maddyscientist commented 9 months ago

@kostrzewa yes this would be good to discuss. I'm planning to push a few updates shortly to this branch, namely to do some cleanup and merge in the latest develop branch. Let's discuss this after I've done this (which I'll hopefully get to in the next day or two).

maddyscientist commented 9 months ago

@kostrzewa @Marcogarofalo I've done further testing and cleanup:

Regarding the redundant communications in exchangeGhost: can you explain what unnecessary comms are being done here?

I've found a few outstanding issues that it need to be fixed:

Marcogarofalo commented 9 months ago

@maddyscientist thank you for the help.

Regarding the redundant communications in exchangeGhost: can you explain what unnecessary comms are being done here?

I think that it is here https://github.com/lattice/quda/blob/901ddedfb572385c4f6c8b1a14df392268f70cf7/lib/clover_outer_product.cu#L153-L156

it looks like there is an exchange with inB.exchangeGhost and then with exchangeGhost, I do not fully understand this part so maybe I am wrong. With only line 154 the kernel gives CUDA_ERROR_ILLEGAL_ADDRESS. I add the line 153 in https://github.com/lattice/quda/pull/1338/commits/1e1af7f561c266efbda76e1a604218b13b30dc22 I also noticed that swapping any of the lines above gives a wrong result.

I've found a few outstanding issues that it need to be fixed:

  • The test should test multiple quark fields, e.g., as in RHMC

Sorry I was ignoring this part because in tmLQCD the RHMC it is different, needs the non-degenerate operator. I can work on it

  • The test should test with a determinant ratio (ctest should automatically test both paths) ok
  • There's a segmentation fault when enabling partitioning: segmentation is in the test code itself (e.g., run with --partition 12 on a single GPU)

I will have a look

Marcogarofalo commented 9 months ago

@maddyscientist I hope I fix the issue you listed, I see a failure in the CI but looking at the logs I see slurmstepd: error: *** STEP 50331581.1 ON nid04394 CANCELLED AT 2023-12-05T20:00:47 DUE TO TIME LIMIT ***

maddyscientist commented 9 months ago

@Marcogarofalo Thanks for the latest fixes. All looks good, and I wouldn't worry about the CI failure at CSCS - this target is a bit temperamental and sometimes takes longer than the maximum time, hence the failure.

I've identified the issue with the exchange, and working a on a fix now.

maddyscientist commented 9 months ago

Ok, I have pushed the fix for the quark field exchange. Issue was that the ghost pointers in the accessor were not being set correctly. This was caused by two ghost pointer getter methods: ColorSpinorField::Ghost() and ColorSpinorField::Ghost2(), which do slightly different things (some long-term cruft that I need to address, and unify these two).

I've also pushed some rudimentary OMP parallelization of the host test code which allows for easier testing of larger problem sizes.

maddyscientist commented 9 months ago

@Marcogarofalo @kostrzewa what from your end needs to be done on this PR? I assume the non degenerate force is for a later PR?

Marcogarofalo commented 9 months ago

@maddyscientist I also think that it is better to keep the non degenerate force for another PR. My only doubt remaining here is in the reference implementation. I used the device routine to create the extended version of oprod https://github.com/lattice/quda/blob/c1ee1f7d0df32ad4329d6ef4531d5bc37f7b69a4/tests/host_reference/TMCloverForce_reference.cpp#L236 I didn't manage to make the host version with QUDA_TENSOR_GEOMETRY, I am not sure whether the current state is acceptable.

maddyscientist commented 9 months ago

Thanks @Marcogarofalo for your thoughts. I'll take a look at the reference code, though I don't view this as critical.

I'm curious: compared to the reference tmLQCD implementation, how is QUDA performance looking? I'll push some more performance updates to this branch shortly.

Marcogarofalo commented 9 months ago

From a quick test on a 64^3 x 128 lattice using 4 nodes of juwels-booster https://apps.fz-juelich.de/jsc/hps/juwels/booster-overview.html assuming that I am using the machine in a reasonable way I get

It is not a physical ensemble, however, it should not matter for this test.

urbach commented 9 months ago

From a quick test on a 64^3 x 128 lattice using 4 nodes of juwels-booster https://apps.fz-juelich.de/jsc/hps/juwels/booster-overview.html assuming that I am using the machine in a reasonable way I get

  • tmLQCD native implementation 3.60 s
  • computeTMCloverForceQuda 0.35 s

wow! excellent!