Closed mathiaswagner closed 8 years ago
I started work on that in the branch _hotfix/timertotal
Output currently looks like:
initQuda Total time = 2.10581 secs
loadGaugeQuda Total time = 0.346129 secs
download = 0.314558 secs ( 90.9%), with 1 calls at 3.145580e+05 us per call
init = 0.025151 secs ( 7.27%), with 1 calls at 2.515100e+04 us per call
compute = 0.006406 secs ( 1.85%), with 1 calls at 6.406000e+03 us per call
free = 0.000008 secs (0.00231%), with 1 calls at 8.000000e+00 us per call
total accounted = 0.346123 secs ( 100%)
total missing = 0.000006 secs (0.00173%)
invertQuda Total time = 1.68672 secs
download = 0.012077 secs ( 0.716%), with 1 calls at 1.207700e+04 us per call
upload = 0.002828 secs ( 0.168%), with 1 calls at 2.828000e+03 us per call
init = 0.006256 secs ( 0.371%), with 1 calls at 6.256000e+03 us per call
preamble = 0.000003 secs (0.000178%), with 1 calls at 3.000000e+00 us per call
compute = 1.644207 secs ( 97.5%), with 1 calls at 1.644207e+06 us per call
epilogue = 0.002261 secs ( 0.134%), with 1 calls at 2.261000e+03 us per call
free = 0.000000 secs ( 0%), with 1 calls at 0.000000e+00 us per call
total accounted = 1.667632 secs ( 98.9%)
total missing = 0.019085 secs ( 1.13%)
endQuda Total time = 0.385656 secs
QUDA Total time = 4.52432 secs
download = 0.326635 secs ( 7.22%), with 2 calls at 1.633175e+05 us per call
upload = 0.002829 secs (0.0625%), with 1 calls at 2.829000e+03 us per call
init = 0.031409 secs ( 0.694%), with 2 calls at 1.570450e+04 us per call
preamble = 0.000002 secs (4.42e-05%), with 1 calls at 2.000000e+00 us per call
compute = 1.650614 secs ( 36.5%), with 2 calls at 8.253070e+05 us per call
epilogue = 0.002261 secs ( 0.05%), with 1 calls at 2.261000e+03 us per call
free = 0.000008 secs (0.000177%), with 2 calls at 4.000000e+00 us per call
pack kernel = 0.015555 secs ( 0.344%), with 1444 calls at 1.077216e+01 us per call
dslash kernel = 0.032304 secs ( 0.714%), with 2888 calls at 1.118560e+01 us per call
gather = 0.019014 secs ( 0.42%), with 2888 calls at 6.583795e+00 us per call
scatter = 0.019664 secs ( 0.435%), with 2888 calls at 6.808864e+00 us per call
event record = 0.006780 secs ( 0.15%), with 5776 calls at 1.173823e+00 us per call
event query = 0.578113 secs ( 12.8%), with 523853 calls at 1.103579e+00 us per call
stream wait event = 0.004256 secs (0.0941%), with 4332 calls at 9.824561e-01 us per call
comms start = 0.013333 secs ( 0.295%), with 5776 calls at 2.308345e+00 us per call
comms query = 0.100898 secs ( 2.23%), with 158435 calls at 6.368416e-01 us per call
constant = 0.002172 secs ( 0.048%), with 9 calls at 2.413333e+02 us per call
total accounted = 2.805847 secs ( 62%)
total missing = 1.718470 secs ( 38%)
Device memory used = 195.3 MB
Page-locked host memory used = 132.0 MB
Total host memory used >= 207.9 MB
It might need some further test and I am not sure whether the inclusion of the lower-level timers is desirable.
This looks good. I tend to agree that we don't need any of the low-level data here: I'd suggest removing everything from pack kernel
to comms query
here.
It would be good to be able to report the total time spent since initQuda
to endQuda
, so we can see how much time is spent outside of QUDA.
Actually looking at your breakdown above, it looks like total time missing might include this already. If that's the case, I think we need to differentiate between total time in QUDA but not accounted for, and total time spent outside of QUDA.
The output is a simple copy of the routine already used. I will clean that up. Otherwise these ate probably the lost timings that Justin wanted to get printed in an older issue.
Adding a time between init and end should be easy. I had that already on my list. Only shortcoming is that it depends on when the calling application calls these. E.g. Some MILC applications do a lot of output after the calculation but before endQuda.
Anyway, still good to have the timer. One only must be aware of what is measured.
I added a Timer from initQuda to endQuda.
initQuda Total time = 9.45775 secs
loadGaugeQuda Total time = 0.505286 secs
download = 0.362774 secs ( 71.8%), with 1 calls at 3.627740e+05 us per call
init = 0.140969 secs ( 27.9%), with 1 calls at 1.409690e+05 us per call
compute = 0.001215 secs ( 0.24%), with 1 calls at 1.215000e+03 us per call
free = 0.000321 secs (0.0635%), with 1 calls at 3.210000e+02 us per call
total accounted = 0.505279 secs ( 100%)
total missing = 0.000007 secs (0.00139%)
invertQuda Total time = 1.07117 secs
download = 0.014515 secs ( 1.36%), with 1 calls at 1.451500e+04 us per call
upload = 0.004383 secs ( 0.409%), with 1 calls at 4.383000e+03 us per call
init = 0.005207 secs ( 0.486%), with 1 calls at 5.207000e+03 us per call
preamble = 0.000000 secs ( 0%), with 1 calls at 0.000000e+00 us per call
compute = 1.037057 secs ( 96.8%), with 1 calls at 1.037057e+06 us per call
epilogue = 0.001595 secs ( 0.149%), with 1 calls at 1.595000e+03 us per call
free = 0.000001 secs (9.34e-05%), with 1 calls at 1.000000e+00 us per call
total accounted = 1.062758 secs ( 99.2%)
total missing = 0.008417 secs ( 0.786%)
endQuda Total time = 0.417433 secs
init2endQuda Total time = 15.3865 secs
QUDA Total time = 11.4516 secs
download = 0.377289 secs ( 3.29%), with 2 calls at 1.886445e+05 us per call
upload = 0.004383 secs (0.0383%), with 1 calls at 4.383000e+03 us per call
init = 0.146178 secs ( 1.28%), with 2 calls at 7.308900e+04 us per call
preamble = 0.000001 secs (8.73e-06%), with 1 calls at 1.000000e+00 us per call
compute = 1.038272 secs ( 9.07%), with 2 calls at 5.191360e+05 us per call
epilogue = 0.001595 secs (0.0139%), with 1 calls at 1.595000e+03 us per call
free = 0.000322 secs (0.00281%), with 2 calls at 1.610000e+02 us per call
dslash kernel = 0.010537 secs ( 0.092%), with 1314 calls at 8.019026e+00 us per call
constant = 0.002414 secs (0.0211%), with 9 calls at 2.682222e+02 us per call
total accounted = 1.580991 secs ( 13.8%)
total missing = 9.870652 secs ( 86.2%)
That time is different from the QUDA Total time. I might have to do a more careful check which timers are use at what time.
Just some more sample output ( from MILC RHMC ):
initQuda Total time = 9.40036 secs
loadGaugeQuda Total time = 30.4038 secs
download = 28.088332 secs ( 92.4%), with 50 calls at 5.617666e+05 us per call
init = 2.314063 secs ( 7.61%), with 50 calls at 4.628126e+04 us per call
compute = 0.000041 secs (0.000135%), with 50 calls at 8.200000e-01 us per call
free = 0.001211 secs (0.00398%), with 50 calls at 2.422000e+01 us per call
total accounted = 30.403647 secs ( 100%)
total missing = 0.000198 secs (0.000651%)
invertQuda Total time = 6.69018 secs
download = 0.131989 secs ( 1.97%), with 4 calls at 3.299725e+04 us per call
upload = 0.051616 secs ( 0.772%), with 4 calls at 1.290400e+04 us per call
init = 0.043231 secs ( 0.646%), with 4 calls at 1.080775e+04 us per call
preamble = 0.000000 secs ( 0%), with 4 calls at 0.000000e+00 us per call
compute = 6.363281 secs ( 95.1%), with 4 calls at 1.590820e+06 us per call
epilogue = 0.035293 secs ( 0.528%), with 4 calls at 8.823250e+03 us per call
free = 0.000002 secs (2.99e-05%), with 4 calls at 5.000000e-01 us per call
total accounted = 6.625412 secs ( 99%)
total missing = 0.064770 secs ( 0.968%)
invertMultiShiftQuda Total time = 30.2083 secs
download = 0.510359 secs ( 1.69%), with 20 calls at 2.551795e+04 us per call
upload = 3.030169 secs ( 10%), with 20 calls at 1.515084e+05 us per call
init = 0.196616 secs ( 0.651%), with 20 calls at 9.830800e+03 us per call
preamble = 0.000010 secs (3.31e-05%), with 20 calls at 5.000000e-01 us per call
compute = 22.451351 secs ( 74.3%), with 20 calls at 1.122568e+06 us per call
epilogue = 3.460743 secs ( 11.5%), with 20 calls at 1.730371e+05 us per call
free = 0.120916 secs ( 0.4%), with 20 calls at 6.045800e+03 us per call
total accounted = 29.770164 secs ( 98.5%)
total missing = 0.438116 secs ( 1.45%)
computeKSLinkQuda Total time = 13.9739 secs
download = 3.870691 secs ( 27.7%), with 8 calls at 4.838364e+05 us per call
upload = 7.871224 secs ( 56.3%), with 16 calls at 4.919515e+05 us per call
init = 0.271004 secs ( 1.94%), with 28 calls at 9.678714e+03 us per call
compute = 1.812989 secs ( 13%), with 12 calls at 1.510824e+05 us per call
free = 0.143273 secs ( 1.03%), with 16 calls at 8.954563e+03 us per call
total = 0.003951 secs (0.0283%), with 8 calls at 4.938750e+02 us per call
total accounted = 13.973132 secs ( 100%)
total missing = 0.000803 secs (0.00575%)
computeGaugeForceQuda Total time = 6.68004 secs
download = 3.424195 secs ( 51.3%), with 12 calls at 2.853496e+05 us per call
upload = 1.207801 secs ( 18.1%), with 6 calls at 2.013002e+05 us per call
init = 0.074818 secs ( 1.12%), with 12 calls at 6.234833e+03 us per call
compute = 1.933443 secs ( 28.9%), with 6 calls at 3.222405e+05 us per call
free = 0.036860 secs ( 0.552%), with 6 calls at 6.143333e+03 us per call
total = 0.002880 secs (0.0431%), with 12 calls at 2.400000e+02 us per call
total accounted = 6.679997 secs ( 100%)
total missing = 0.000039 secs (0.000584%)
updateGaugeFieldQuda Total time = 12.6246 secs
download = 7.324691 secs ( 58%), with 9 calls at 8.138546e+05 us per call
upload = 5.076005 secs ( 40.2%), with 9 calls at 5.640006e+05 us per call
init = 0.223411 secs ( 1.77%), with 9 calls at 2.482344e+04 us per call
compute = 0.000439 secs (0.00348%), with 9 calls at 4.877778e+01 us per call
total accounted = 12.624546 secs ( 100%)
total missing = 0.000036 secs (0.000285%)
computeStaggeredOprodQuda Total time = 10.7973 secs
download = 6.426532 secs ( 59.5%), with 64 calls at 1.004146e+05 us per call
upload = 4.281884 secs ( 39.7%), with 4 calls at 1.070471e+06 us per call
init = 0.085276 secs ( 0.79%), with 68 calls at 1.254059e+03 us per call
compute = 0.003381 secs (0.0313%), with 60 calls at 5.635000e+01 us per call
total accounted = 10.797073 secs ( 100%)
total missing = 0.000203 secs (0.00188%)
computeHISQForceQuda Total time = 23.6115 secs
download = 5.793608 secs ( 24.5%), with 10 calls at 5.793608e+05 us per call
upload = 0.750801 secs ( 3.18%), with 4 calls at 1.877002e+05 us per call
init = 0.146680 secs ( 0.621%), with 4 calls at 3.667000e+04 us per call
compute = 16.885342 secs ( 71.5%), with 10 calls at 1.688534e+06 us per call
free = 0.034562 secs ( 0.146%), with 2 calls at 1.728100e+04 us per call
total accounted = 23.610993 secs ( 100%)
total missing = 0.000464 secs (0.00197%)
plaqQuda Total time = 0.031472 secs
init = 0.000000 secs ( 0%), with 2 calls at 0.000000e+00 us per call
compute = 0.031471 secs ( 100%), with 2 calls at 1.573550e+04 us per call
total accounted = 0.031471 secs ( 100%)
total missing = 0.000001 secs (0.00318%)
endQuda Total time = 0.770814 secs
init2endQuda Total time = 244.321 secs
QUDA Total time = 145.192 secs
download = 55.570453 secs ( 38.3%), with 177 calls at 3.139574e+05 us per call
upload = 22.269524 secs ( 15.3%), with 63 calls at 3.534845e+05 us per call
init = 3.355103 secs ( 2.31%), with 197 calls at 1.703098e+04 us per call
preamble = 0.000009 secs (6.2e-06%), with 24 calls at 3.750000e-01 us per call
compute = 49.481740 secs ( 34.1%), with 173 calls at 2.860216e+05 us per call
epilogue = 3.496039 secs ( 2.41%), with 24 calls at 1.456683e+05 us per call
free = 0.336824 secs ( 0.232%), with 98 calls at 3.436980e+03 us per call
total accounted = 134.607613 secs ( 92.7%)
total missing = 10.584622 secs ( 7.29%)
Total time reported by MILC:
Time = 2.290183e+02 seconds
Time measured by time
on command line:
real 4m10.063s
user 3m27.071s
sys 0m41.158s
Merged w/ #330.
Not just for the components as right now to get a total time the application spend in QUDA and the total time spend in compute, H2D, ...