lattice / quda

QUDA is a library for performing calculations in lattice QCD on GPUs.
https://lattice.github.io/quda
Other
287 stars 94 forks source link

Add HYP smearing. Also add 3D/4D version of STOUT and APE smearing. #1426

Closed SaltyChiang closed 6 months ago

SaltyChiang commented 7 months ago

Use QudaGaugeSmearParam::dir_ignore to tell QUDA not to perform smearing in this direction.

I've checked the result with Chroma, which looks good. There are some residuals caused by different SU(3) projection implementations.

We might should set different default values of dir_ignore for different smearing types to make the interface compatible with before.

maddyscientist commented 7 months ago

@cpviolator any chance you can review this?

maddyscientist commented 7 months ago

@Jenkins test this please

cpviolator commented 7 months ago

@maddyscientist Sure thing!

cpviolator commented 7 months ago

@SaltyChiang

This is excellent work! I spent some time revising the HYP algorithm and your implementation here in QUDA is very much in line with QUDA coding style. I very much like the implementation of arbitrary orthogonal dimension for smearing.

I tried to find ways to reduce code repetition by using the existing staple computation, but the differences between APE style and Stout style means there's very little overlap in the code. The only code change I would like to request for the moment is that you place the templates with name structure computeStaple(3D)Level(N) in `gauge_utils.cuh. In the future we will likely implement analytic smearing for HMC or other smearing algorithms that could make use of the general HYP templates.

Before merging with develop I'd like to do a performance test of the kernels to see if we're not missing any obvious speed-up we can take advantage of with minimal code changes. I'll post here with the results.

cpviolator commented 7 months ago

OK, using the command:

for SMEAR in ape stout ovrimp-stout hyp wilson symanzik ; do

    ./su3_test --verbosity verbose --su3-smear-type ${SMEAR} --verify false --dim 32 32 32 32 --prec double --su3-smear-steps 5 > log_${SMEAR}_perf.log                                                            
    wait                                                                                                                                                                                                           

    ./su3_test --verbosity verbose --su3-smear-type ${SMEAR} --verify false --dim 32 32 32 32 --prec double --su3-smear-steps 500 > log_${SMEAR}_WC.log
    wait
done
I collected the following data: SMEAR TYPE Kernel Perf (Gflop/s) Bandwidth (GB/s) Compute Wall Clock (secs)
APE 1505.78 1564.45 4.729
Stout 1860.01 1932.48 4.402
OvrImp Stout 2956.79 1850.34 13.422
Wilson Flow Step 1 1421.67 1590.68 -
Wilson Flow Step 2 1271.89 1565.40 -
Wilson Flow Step 3 1573.99 1849.16 -
Wilson Flow time - - 13.137
Symanzik Flow Step 1 2625.85 1662.57 -
Symanzik Flow Step 2 2569.70 1670.99 -
Symanzik Flow Step 3 2720.21 1745.59 -
Symanzik Flow time - - 42.275
HYP Step 1 838.20 745.07 -
HYP Step 2 702.43 681.14 -
HYP Step 3 933.52 969.89 -
HYP time - - 33.840

So I think there may be some improvements for the future, but certainly nothing holding up a PR. It may require a slightly different approach to kernel launches and memory transfer, or something as simple as revising the flop/byte count in statistics.

Either way, this is a great addition to the QUDA library!

SaltyChiang commented 7 months ago

@cpviolator Thank you for the testing. I'll check the performance issue.

SaltyChiang commented 7 months ago

The performance regression comes from (using HYP<Arg::level=1> version as the example):

If I use staple[4] to get staples and then just save one staple (not real HYP, but for testing) in the end, I see the performance of HYP<Arg::level=1> ~2x that of stout. Maybe we could use 4 vector gauges instead of 2 tensor gauges to save these staples? (though we would waste 1/4 of the allocated memory)

Please let me know if you have any ideas!

cpviolator commented 7 months ago

I had the same I idea and I've flattened out the loops as best as I can, but now I'm chasing a numerical error.

@SaltyChiang may I ask that you join us on slack? @maddyscientist can arrange the invite if so.

maddyscientist commented 7 months ago

@SaltyChiang can you share the Chroma code that you are using to call this? It would be great to have an example of how to call the smearing routines in QUDA with Chroma. Thanks.

SaltyChiang commented 7 months ago

@maddyscientist Oh, I don't use Chroma to call the smearing routines in QUDA.

I've checked the result with Chroma, which looks good. There are some residuals caused by different SU(3) projection implementations.

The checking follows the steps:

weinbe2 commented 6 months ago

@SaltyChiang there are a few convention changes I want to put into your PR, plus reconcile the merge conflicts and perform a clang-format. I think I see how to address some of the perf issues as well. I'll create a local copy of your branch, put this in, and then you can merge it into your branch and check it---if all's good, we can then merge it into develop. Does that sound like a good plan to you?

SaltyChiang commented 6 months ago

@weinbe2 Sure, this sounds good to me. Glad to hear some progress on the performance issue.

weinbe2 commented 6 months ago

Using the same command as Dean in my branch---on a different GPU (A100-80GB-PCIe), I got:

SMEAR TYPE Kernel Perf (Gflop/s) Bandwidth (GB/s) Compute Wall Clock (secs)
APE 2351.23 2442.84 2.162
Stout 3401.47 3534.00 1.700
OvrImp Stout 5344.44 3344.51 6.556
Wilson Flow Step 1 3349.50 3747.69 -
Wilson Flow Step 2 3172.13 3915.25 -
Wilson Flow Step 3 3290.95 3872.04 -
Wilson Flow time - - 5.559
Symanzik Flow Step 1 5285.13 3346.31 -
Symanzik Flow Step 2 5156.00 3355.53 -
Symanzik Flow Step 3 5184.04 3328.04 -
Symanzik Flow time - - 20.648
HYP Step 1 1744.10 1550.31 -
HYP Step 2 1580.78 1532.87 -
HYP Step 3 2205.11 2291.02 -
HYP time - - 13.763

HYP time is now ~66% of Symanzik time as opposed to ~75%, but it's hard to say if that's due to optimization or because A100 has a larger cache... in any case, it's relatively faster.

weinbe2 commented 6 months ago

@SaltyChiang I'm done with my updates---can you please merge my branch (lattice:feature/gauge-smear) into your local branch and give it a fresh comparison against Chroma? As far as my tests have gone (both single GPU and partitioned) I've preserved correctness, but an independent check is always good.

Assuming it passes your Chroma comparison and everything else looks good, I imagine we can get this merged in.

SaltyChiang commented 6 months ago

@weinbe2 I checked the performance again before/after merging your additions on Tesla P100.

SMEAR TYPE Kernel Perf (Gflop/s) Bandwidth (GB/s) Compute Wall Clock (secs)
APE 973.97 1011.91 7.85193
Stout 1343.91 1396.27 7.08628
Ovrimp-stout 1717.53 1074.82 23.5821
Wilson flow 1 1206.77 1350.23 -
Wilson flow 2 1084.00 1337.95 -
Wilson flow 3 1194.42 1405.32 -
Wilson flow time - - 15.9663
Symanzik flow 1 1601.43 1013.95 -
Symanzik flow 2 1582.12 1029.65 -
Symanzik flow 3 1619.81 1039.88 -
Symanzik flow time - - 68.0921
HYP 1 before 609.19 541.50 -
HYP 2 before 522.48 506.65 -
HYP 3 before 721.51 749.62 -
HYP time before - - 47.2621
HYP 1 after 703.21 625.07 -
HYP 2 after 559.32 542.37 -
HYP 3 after 750.97 780.22 -
HYP time after - - 38.4669

And yes, the new code returns the same result as Chroma. Again, they differ by 1e-6~1e-7, which comes from the different SU(3) projection algorithms. I plan to implement a similar projection as Chroma in quda just for checking, do you think this is acceptable to push such a thing upstream?

@cpviolator Do you have any further ideas about the performance issue?

weinbe2 commented 6 months ago

Thanks @SaltyChiang , I'm glad it passes the correctness check on your end. I'd like to get this merged in sooner as opposed to later, further optimizations can go into a future PR.

As for the Chroma algorithm---I like the idea of checking it locally, and if you see a deviation file after that please file an issue and we can go from there.

maddyscientist commented 6 months ago

@Jenkins ok to test

@SaltyChiang I expect once we fix #1436, this will give a boost to double-precision performance of all the smearing routines that use SU(3) projection. We'll handle this in a follow up PR though.

weinbe2 commented 6 months ago

@SaltyChiang I had to put in a few quick updates to address the CI issues---can you please merge my branch back into yours again?

cpviolator commented 6 months ago

@SaltyChiang @weinbe2 I think these improvements are great and an excellent contribution! As we can all see from the algorithm, there are some unavoidable aspects when it comes to memory that prohibit HYP from performing as well as STOUT based algorithms or vanilla APE.

SaltyChiang commented 6 months ago

@weinbe2 @maddyscientist Done with merge. The default value of dir_ignore leads to 4D smearing for all smear_type. Shall we change the default value to 3 for the stout and APE smearing as they were applied space only before?

weinbe2 commented 6 months ago

@weinbe2 @maddyscientist Done with merge. The default value of dir_ignore leads to 4D smearing for all smear_type. Shall we change the default value to 3 for the stout and APE smearing as they were applied space only before?

It might be best to preserve that behavior, yes. Thoughts @cpviolator ? You're the real-world smearing workflow expert here

cpviolator commented 6 months ago

@weinbe2 yeah, I think that most users of non analytical smearing in QUDA are using it for spectroscopic measurement so defaulting to excluding the temporal dimension would be best in my opinion. A wiki addition on this behaviour would be good.

weinbe2 commented 6 months ago

Thanks @cpviolator --- @SaltyChiang , you should change the default to 3 per that recommendation. Creating a wiki page can be done orthogonal to this PR.

SaltyChiang commented 6 months ago

@weinbe2 @cpviolator Change the default value of dir_ignore to -1, and any negative value will apply APE and STOUT smearing in spatial directions only. The corresponding comments are also added.