lattice / quda

QUDA is a library for performing calculations in lattice QCD on GPUs.
https://lattice.github.io/quda
Other
289 stars 97 forks source link

HISQ fat-link generation has massive performance degradation on Maxwell #324

Closed maddyscientist closed 5 years ago

maddyscientist commented 9 years ago

MILC reports a 32x performance reduction versus an equivalent Kepler. While Maxwell doesn't have strong DP performance, this regression is significantly higher than expected.

mathiaswagner commented 9 years ago

If you have any further information I can try to look into that. Where was that seen? I will try to reproduce that with our internal tests on K40 / Titan X.

maddyscientist commented 9 years ago

It was seen on a MILC RHMC benchmark, comparing a K80 to dual GM204s. For CG solver, gauge force and fermion force, the Maxwells were 0.6-0.8x the performance of the K80, but fat-link was 0.03 the performance.

mathiaswagner commented 9 years ago

Ok, I assume all double precision. Do you happen to know the lattice size?

maddyscientist commented 9 years ago

Yes, all DP. I suspect this will be easy to reproduce, and it will occur for all lattice sizes.

mathiaswagner commented 9 years ago

My first quick impression was that fatlinkis ok and unitarization is slower but not by 0.03. Have to ask the 'Was tuning active' question? Will continue the investigation tomorrow.

maddyscientist commented 9 years ago

Ah, it could be the SVD algorithm that is killing performance then. I don't what the default is but isn't this invoked depending on parameters and how singular the matrix is?

This email message is for the sole use of the intended recipient(s) and may contain confidential information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by

reply email and destroy all copies of the original message.

mathiaswagner commented 9 years ago

One observation: My question about using was pointless as the code for the hisqlink does not seem to use tuning at all. At least no entries show up in the tunecache and when starting without a tunecache non e is created.

mathiaswagner commented 9 years ago

Running the internal tests I get the results (24^4, double)

runlink.sm35:link computation time =1081.98 ms, flops= 17.89 Gflops
runlink.sm35:link computation time =1081.92 ms, flops= 17.89 Gflops
runlink.sm35:link computation time =929.51 ms, flops= 20.82 Gflops
runlink.sm35:link computation time =1083.12 ms, flops= 17.87 Gflops
runlink.sm52:link computation time =971.66 ms, flops= 19.92 Gflops
runlink.sm52:link computation time =1110.86 ms, flops= 17.42 Gflops
runlink.sm52:link computation time =1120.10 ms, flops= 17.28 Gflops
runlink.sm52:link computation time =1125.62 ms, flops= 17.20 Gflops

and

runlink.sm35:Unitarization time: 6.12 ms
runlink.sm35:Unitarization time: 6.125 ms
runlink.sm35:Unitarization time: 6.134 ms
runlink.sm35:Unitarization time: 6.128 ms
runlink.sm52:Unitarization time: 14.913 ms
runlink.sm52:Unitarization time: 14.933 ms
runlink.sm52:Unitarization time: 14.916 ms
runlink.sm52:Unitarization time: 14.988 ms

So I see the Titan X a bit faster than the K40 for the faulting and something like a factor 2.5 slower for the unitarization. Note that the K40 is running with ECC and 745 MHz.

More troubling is that the unitarization on the K40 shows one 1 error while it completes without error on the Titan X :-(

One more thing that might affect the unitarization time: I did not load a gauge configuration for this first test.

mathiaswagner commented 5 years ago

With a lot of rewrite and Maxwell not a commonly architecture for Quda. Closing.