Closed maddyscientist closed 5 years ago
If you have any further information I can try to look into that. Where was that seen? I will try to reproduce that with our internal tests on K40 / Titan X.
It was seen on a MILC RHMC benchmark, comparing a K80 to dual GM204s. For CG solver, gauge force and fermion force, the Maxwells were 0.6-0.8x the performance of the K80, but fat-link was 0.03 the performance.
Ok, I assume all double precision. Do you happen to know the lattice size?
Yes, all DP. I suspect this will be easy to reproduce, and it will occur for all lattice sizes.
My first quick impression was that fatlinkis ok and unitarization is slower but not by 0.03. Have to ask the 'Was tuning active' question? Will continue the investigation tomorrow.
This email message is for the sole use of the intended recipient(s) and may contain confidential information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by
One observation: My question about using was pointless as the code for the hisqlink does not seem to use tuning at all. At least no entries show up in the tunecache and when starting without a tunecache non e is created.
Running the internal tests I get the results (24^4, double)
runlink.sm35:link computation time =1081.98 ms, flops= 17.89 Gflops
runlink.sm35:link computation time =1081.92 ms, flops= 17.89 Gflops
runlink.sm35:link computation time =929.51 ms, flops= 20.82 Gflops
runlink.sm35:link computation time =1083.12 ms, flops= 17.87 Gflops
runlink.sm52:link computation time =971.66 ms, flops= 19.92 Gflops
runlink.sm52:link computation time =1110.86 ms, flops= 17.42 Gflops
runlink.sm52:link computation time =1120.10 ms, flops= 17.28 Gflops
runlink.sm52:link computation time =1125.62 ms, flops= 17.20 Gflops
and
runlink.sm35:Unitarization time: 6.12 ms
runlink.sm35:Unitarization time: 6.125 ms
runlink.sm35:Unitarization time: 6.134 ms
runlink.sm35:Unitarization time: 6.128 ms
runlink.sm52:Unitarization time: 14.913 ms
runlink.sm52:Unitarization time: 14.933 ms
runlink.sm52:Unitarization time: 14.916 ms
runlink.sm52:Unitarization time: 14.988 ms
So I see the Titan X a bit faster than the K40 for the faulting and something like a factor 2.5 slower for the unitarization. Note that the K40 is running with ECC and 745 MHz.
More troubling is that the unitarization on the K40 shows one 1 error while it completes without error on the Titan X :-(
One more thing that might affect the unitarization time: I did not load a gauge configuration for this first test.
With a lot of rewrite and Maxwell not a commonly architecture for Quda. Closing.
MILC reports a 32x performance reduction versus an equivalent Kepler. While Maxwell doesn't have strong DP performance, this regression is significantly higher than expected.