Closed SaltyChiang closed 6 months ago
@cpviolator any chance you can review this?
@Jenkins test this please
@maddyscientist Sure thing!
@SaltyChiang
This is excellent work! I spent some time revising the HYP algorithm and your implementation here in QUDA is very much in line with QUDA coding style. I very much like the implementation of arbitrary orthogonal dimension for smearing.
I tried to find ways to reduce code repetition by using the existing staple computation, but the differences between APE style and Stout style means there's very little overlap in the code. The only code change I would like to request for the moment is that you place the templates with name structure computeStaple(3D)Level(N)
in `gauge_utils.cuh. In the future we will likely implement analytic smearing for HMC or other smearing algorithms that could make use of the general HYP templates.
Before merging with develop I'd like to do a performance test of the kernels to see if we're not missing any obvious speed-up we can take advantage of with minimal code changes. I'll post here with the results.
OK, using the command:
for SMEAR in ape stout ovrimp-stout hyp wilson symanzik ; do
./su3_test --verbosity verbose --su3-smear-type ${SMEAR} --verify false --dim 32 32 32 32 --prec double --su3-smear-steps 5 > log_${SMEAR}_perf.log
wait
./su3_test --verbosity verbose --su3-smear-type ${SMEAR} --verify false --dim 32 32 32 32 --prec double --su3-smear-steps 500 > log_${SMEAR}_WC.log
wait
done
I collected the following data: | SMEAR TYPE | Kernel Perf (Gflop/s) | Bandwidth (GB/s) | Compute Wall Clock (secs) |
---|---|---|---|---|
APE | 1505.78 | 1564.45 | 4.729 | |
Stout | 1860.01 | 1932.48 | 4.402 | |
OvrImp Stout | 2956.79 | 1850.34 | 13.422 | |
Wilson Flow Step 1 | 1421.67 | 1590.68 | - | |
Wilson Flow Step 2 | 1271.89 | 1565.40 | - | |
Wilson Flow Step 3 | 1573.99 | 1849.16 | - | |
Wilson Flow time | - | - | 13.137 | |
Symanzik Flow Step 1 | 2625.85 | 1662.57 | - | |
Symanzik Flow Step 2 | 2569.70 | 1670.99 | - | |
Symanzik Flow Step 3 | 2720.21 | 1745.59 | - | |
Symanzik Flow time | - | - | 42.275 | |
HYP Step 1 | 838.20 | 745.07 | - | |
HYP Step 2 | 702.43 | 681.14 | - | |
HYP Step 3 | 933.52 | 969.89 | - | |
HYP time | - | - | 33.840 |
So I think there may be some improvements for the future, but certainly nothing holding up a PR. It may require a slightly different approach to kernel launches and memory transfer, or something as simple as revising the flop/byte count in statistics.
Either way, this is a great addition to the QUDA library!
@cpviolator Thank you for the testing. I'll check the performance issue.
The performance regression comes from (using HYP<Arg::level=1>
version as the example):
cnt
in computeStapleLevel(N)
cannot be unrolled by the compiler. Using staple[4] instead of staple[3] as the input argument (then we can use staple[nu]
instead of staple[cnt]
) can significantly improve the performance of the computeStapleLevel1
part.If I use staple[4]
to get staples and then just save one staple (not real HYP, but for testing) in the end, I see the performance of HYP<Arg::level=1>
~2x that of stout. Maybe we could use 4 vector gauges instead of 2 tensor gauges to save these staples? (though we would waste 1/4 of the allocated memory)
Please let me know if you have any ideas!
I had the same I idea and I've flattened out the loops as best as I can, but now I'm chasing a numerical error.
@SaltyChiang may I ask that you join us on slack? @maddyscientist can arrange the invite if so.
@SaltyChiang can you share the Chroma code that you are using to call this? It would be great to have an example of how to call the smearing routines in QUDA with Chroma. Thanks.
@maddyscientist Oh, I don't use Chroma to call the smearing routines in QUDA.
I've checked the result with Chroma, which looks good. There are some residuals caused by different SU(3) projection implementations.
The checking follows the steps:
LINK_SMEAR
with <LinkSmearingType>HYP_SMEAR</LinkSmearingType>
hyp_smear.lime
and make diff of the two smeared field@SaltyChiang there are a few convention changes I want to put into your PR, plus reconcile the merge conflicts and perform a clang-format
. I think I see how to address some of the perf issues as well. I'll create a local copy of your branch, put this in, and then you can merge it into your branch and check it---if all's good, we can then merge it into develop
. Does that sound like a good plan to you?
@weinbe2 Sure, this sounds good to me. Glad to hear some progress on the performance issue.
Using the same command as Dean in my branch---on a different GPU (A100-80GB-PCIe), I got:
SMEAR TYPE | Kernel Perf (Gflop/s) | Bandwidth (GB/s) | Compute Wall Clock (secs) |
---|---|---|---|
APE | 2351.23 | 2442.84 | 2.162 |
Stout | 3401.47 | 3534.00 | 1.700 |
OvrImp Stout | 5344.44 | 3344.51 | 6.556 |
Wilson Flow Step 1 | 3349.50 | 3747.69 | - |
Wilson Flow Step 2 | 3172.13 | 3915.25 | - |
Wilson Flow Step 3 | 3290.95 | 3872.04 | - |
Wilson Flow time | - | - | 5.559 |
Symanzik Flow Step 1 | 5285.13 | 3346.31 | - |
Symanzik Flow Step 2 | 5156.00 | 3355.53 | - |
Symanzik Flow Step 3 | 5184.04 | 3328.04 | - |
Symanzik Flow time | - | - | 20.648 |
HYP Step 1 | 1744.10 | 1550.31 | - |
HYP Step 2 | 1580.78 | 1532.87 | - |
HYP Step 3 | 2205.11 | 2291.02 | - |
HYP time | - | - | 13.763 |
HYP time is now ~66% of Symanzik time as opposed to ~75%, but it's hard to say if that's due to optimization or because A100 has a larger cache... in any case, it's relatively faster.
@SaltyChiang I'm done with my updates---can you please merge my branch (lattice:feature/gauge-smear
) into your local branch and give it a fresh comparison against Chroma? As far as my tests have gone (both single GPU and partitioned) I've preserved correctness, but an independent check is always good.
Assuming it passes your Chroma comparison and everything else looks good, I imagine we can get this merged in.
@weinbe2 I checked the performance again before/after merging your additions on Tesla P100.
SMEAR TYPE | Kernel Perf (Gflop/s) | Bandwidth (GB/s) | Compute Wall Clock (secs) |
---|---|---|---|
APE | 973.97 | 1011.91 | 7.85193 |
Stout | 1343.91 | 1396.27 | 7.08628 |
Ovrimp-stout | 1717.53 | 1074.82 | 23.5821 |
Wilson flow 1 | 1206.77 | 1350.23 | - |
Wilson flow 2 | 1084.00 | 1337.95 | - |
Wilson flow 3 | 1194.42 | 1405.32 | - |
Wilson flow time | - | - | 15.9663 |
Symanzik flow 1 | 1601.43 | 1013.95 | - |
Symanzik flow 2 | 1582.12 | 1029.65 | - |
Symanzik flow 3 | 1619.81 | 1039.88 | - |
Symanzik flow time | - | - | 68.0921 |
HYP 1 before | 609.19 | 541.50 | - |
HYP 2 before | 522.48 | 506.65 | - |
HYP 3 before | 721.51 | 749.62 | - |
HYP time before | - | - | 47.2621 |
HYP 1 after | 703.21 | 625.07 | - |
HYP 2 after | 559.32 | 542.37 | - |
HYP 3 after | 750.97 | 780.22 | - |
HYP time after | - | - | 38.4669 |
And yes, the new code returns the same result as Chroma. Again, they differ by 1e-6~1e-7, which comes from the different SU(3) projection algorithms. I plan to implement a similar projection as Chroma in quda just for checking, do you think this is acceptable to push such a thing upstream?
@cpviolator Do you have any further ideas about the performance issue?
Thanks @SaltyChiang , I'm glad it passes the correctness check on your end. I'd like to get this merged in sooner as opposed to later, further optimizations can go into a future PR.
As for the Chroma algorithm---I like the idea of checking it locally, and if you see a deviation file after that please file an issue and we can go from there.
@Jenkins ok to test
@SaltyChiang I expect once we fix #1436, this will give a boost to double-precision performance of all the smearing routines that use SU(3) projection. We'll handle this in a follow up PR though.
@SaltyChiang I had to put in a few quick updates to address the CI issues---can you please merge my branch back into yours again?
@SaltyChiang @weinbe2 I think these improvements are great and an excellent contribution! As we can all see from the algorithm, there are some unavoidable aspects when it comes to memory that prohibit HYP from performing as well as STOUT based algorithms or vanilla APE.
@weinbe2 @maddyscientist Done with merge. The default value of dir_ignore
leads to 4D smearing for all smear_type
. Shall we change the default value to 3 for the stout and APE smearing as they were applied space only before?
@weinbe2 @maddyscientist Done with merge. The default value of
dir_ignore
leads to 4D smearing for allsmear_type
. Shall we change the default value to 3 for the stout and APE smearing as they were applied space only before?
It might be best to preserve that behavior, yes. Thoughts @cpviolator ? You're the real-world smearing workflow expert here
@weinbe2 yeah, I think that most users of non analytical smearing in QUDA are using it for spectroscopic measurement so defaulting to excluding the temporal dimension would be best in my opinion. A wiki addition on this behaviour would be good.
Thanks @cpviolator --- @SaltyChiang , you should change the default to 3 per that recommendation. Creating a wiki page can be done orthogonal to this PR.
@weinbe2 @cpviolator Change the default value of dir_ignore
to -1, and any negative value will apply APE and STOUT smearing in spatial directions only. The corresponding comments are also added.
Use
QudaGaugeSmearParam::dir_ignore
to tell QUDA not to perform smearing in this direction.I've checked the result with Chroma, which looks good. There are some residuals caused by different SU(3) projection implementations.
We might should set different default values of
dir_ignore
for different smearing types to make the interface compatible with before.