lattice / quda

QUDA is a library for performing calculations in lattice QCD on GPUs.
https://lattice.github.io/quda
Other
279 stars 94 forks source link

Add a TimeProfile for total time spend in QUDA #314

Closed mathiaswagner closed 8 years ago

mathiaswagner commented 9 years ago

Not just for the components as right now to get a total time the application spend in QUDA and the total time spend in compute, H2D, ...

mathiaswagner commented 9 years ago

I started work on that in the branch _hotfix/timertotal

Output currently looks like:


               initQuda Total time = 2.10581 secs

          loadGaugeQuda Total time = 0.346129 secs
              download     = 0.314558 secs (  90.9%), with        1 calls at 3.145580e+05 us per call
                  init     = 0.025151 secs (  7.27%), with        1 calls at 2.515100e+04 us per call
               compute     = 0.006406 secs (  1.85%), with        1 calls at 6.406000e+03 us per call
                  free     = 0.000008 secs (0.00231%), with        1 calls at 8.000000e+00 us per call
     total accounted       = 0.346123 secs (   100%)
     total missing         = 0.000006 secs (0.00173%)

             invertQuda Total time = 1.68672 secs
              download     = 0.012077 secs ( 0.716%), with        1 calls at 1.207700e+04 us per call
                upload     = 0.002828 secs ( 0.168%), with        1 calls at 2.828000e+03 us per call
                  init     = 0.006256 secs ( 0.371%), with        1 calls at 6.256000e+03 us per call
              preamble     = 0.000003 secs (0.000178%), with        1 calls at 3.000000e+00 us per call
               compute     = 1.644207 secs (  97.5%), with        1 calls at 1.644207e+06 us per call
              epilogue     = 0.002261 secs ( 0.134%), with        1 calls at 2.261000e+03 us per call
                  free     = 0.000000 secs (     0%), with        1 calls at 0.000000e+00 us per call
     total accounted       = 1.667632 secs (  98.9%)
     total missing         = 0.019085 secs (  1.13%)

                endQuda Total time = 0.385656 secs

                   QUDA Total time = 4.52432 secs
              download     = 0.326635 secs (  7.22%), with        2 calls at 1.633175e+05 us per call
                upload     = 0.002829 secs (0.0625%), with        1 calls at 2.829000e+03 us per call
                  init     = 0.031409 secs ( 0.694%), with        2 calls at 1.570450e+04 us per call
              preamble     = 0.000002 secs (4.42e-05%), with        1 calls at 2.000000e+00 us per call
               compute     = 1.650614 secs (  36.5%), with        2 calls at 8.253070e+05 us per call
              epilogue     = 0.002261 secs (  0.05%), with        1 calls at 2.261000e+03 us per call
                  free     = 0.000008 secs (0.000177%), with        2 calls at 4.000000e+00 us per call
           pack kernel     = 0.015555 secs ( 0.344%), with     1444 calls at 1.077216e+01 us per call
         dslash kernel     = 0.032304 secs ( 0.714%), with     2888 calls at 1.118560e+01 us per call
                gather     = 0.019014 secs (  0.42%), with     2888 calls at 6.583795e+00 us per call
               scatter     = 0.019664 secs ( 0.435%), with     2888 calls at 6.808864e+00 us per call
          event record     = 0.006780 secs (  0.15%), with     5776 calls at 1.173823e+00 us per call
           event query     = 0.578113 secs (  12.8%), with   523853 calls at 1.103579e+00 us per call
     stream wait event     = 0.004256 secs (0.0941%), with     4332 calls at 9.824561e-01 us per call
           comms start     = 0.013333 secs ( 0.295%), with     5776 calls at 2.308345e+00 us per call
           comms query     = 0.100898 secs (  2.23%), with   158435 calls at 6.368416e-01 us per call
              constant     = 0.002172 secs ( 0.048%), with        9 calls at 2.413333e+02 us per call
     total accounted       = 2.805847 secs (    62%)
     total missing         = 1.718470 secs (    38%)

Device memory used = 195.3 MB
Page-locked host memory used = 132.0 MB
Total host memory used >= 207.9 MB

It might need some further test and I am not sure whether the inclusion of the lower-level timers is desirable.

maddyscientist commented 8 years ago

This looks good. I tend to agree that we don't need any of the low-level data here: I'd suggest removing everything from pack kernel to comms query here.

It would be good to be able to report the total time spent since initQuda to endQuda, so we can see how much time is spent outside of QUDA.

maddyscientist commented 8 years ago

Actually looking at your breakdown above, it looks like total time missing might include this already. If that's the case, I think we need to differentiate between total time in QUDA but not accounted for, and total time spent outside of QUDA.

mathiaswagner commented 8 years ago

The output is a simple copy of the routine already used. I will clean that up. Otherwise these ate probably the lost timings that Justin wanted to get printed in an older issue.

Adding a time between init and end should be easy. I had that already on my list. Only shortcoming is that it depends on when the calling application calls these. E.g. Some MILC applications do a lot of output after the calculation but before endQuda.

Anyway, still good to have the timer. One only must be aware of what is measured.

mathiaswagner commented 8 years ago

I added a Timer from initQuda to endQuda.


               initQuda Total time = 9.45775 secs

          loadGaugeQuda Total time = 0.505286 secs
              download     = 0.362774 secs (  71.8%), with        1 calls at 3.627740e+05 us per call
                  init     = 0.140969 secs (  27.9%), with        1 calls at 1.409690e+05 us per call
               compute     = 0.001215 secs (  0.24%), with        1 calls at 1.215000e+03 us per call
                  free     = 0.000321 secs (0.0635%), with        1 calls at 3.210000e+02 us per call
     total accounted       = 0.505279 secs (   100%)
     total missing         = 0.000007 secs (0.00139%)

             invertQuda Total time = 1.07117 secs
              download     = 0.014515 secs (  1.36%), with        1 calls at 1.451500e+04 us per call
                upload     = 0.004383 secs ( 0.409%), with        1 calls at 4.383000e+03 us per call
                  init     = 0.005207 secs ( 0.486%), with        1 calls at 5.207000e+03 us per call
              preamble     = 0.000000 secs (     0%), with        1 calls at 0.000000e+00 us per call
               compute     = 1.037057 secs (  96.8%), with        1 calls at 1.037057e+06 us per call
              epilogue     = 0.001595 secs ( 0.149%), with        1 calls at 1.595000e+03 us per call
                  free     = 0.000001 secs (9.34e-05%), with        1 calls at 1.000000e+00 us per call
     total accounted       = 1.062758 secs (  99.2%)
     total missing         = 0.008417 secs ( 0.786%)

                endQuda Total time = 0.417433 secs

           init2endQuda Total time = 15.3865 secs

                   QUDA Total time = 11.4516 secs
              download     = 0.377289 secs (  3.29%), with        2 calls at 1.886445e+05 us per call
                upload     = 0.004383 secs (0.0383%), with        1 calls at 4.383000e+03 us per call
                  init     = 0.146178 secs (  1.28%), with        2 calls at 7.308900e+04 us per call
              preamble     = 0.000001 secs (8.73e-06%), with        1 calls at 1.000000e+00 us per call
               compute     = 1.038272 secs (  9.07%), with        2 calls at 5.191360e+05 us per call
              epilogue     = 0.001595 secs (0.0139%), with        1 calls at 1.595000e+03 us per call
                  free     = 0.000322 secs (0.00281%), with        2 calls at 1.610000e+02 us per call
         dslash kernel     = 0.010537 secs ( 0.092%), with     1314 calls at 8.019026e+00 us per call
              constant     = 0.002414 secs (0.0211%), with        9 calls at 2.682222e+02 us per call
     total accounted       = 1.580991 secs (  13.8%)
     total missing         = 9.870652 secs (  86.2%)

That time is different from the QUDA Total time. I might have to do a more careful check which timers are use at what time.

mathiaswagner commented 8 years ago

Just some more sample output ( from MILC RHMC ):

               initQuda Total time = 9.40036 secs

          loadGaugeQuda Total time = 30.4038 secs
              download     = 28.088332 secs (  92.4%), with       50 calls at 5.617666e+05 us per call
                  init     = 2.314063 secs (  7.61%), with       50 calls at 4.628126e+04 us per call
               compute     = 0.000041 secs (0.000135%), with       50 calls at 8.200000e-01 us per call
                  free     = 0.001211 secs (0.00398%), with       50 calls at 2.422000e+01 us per call
     total accounted       = 30.403647 secs (   100%)
     total missing         = 0.000198 secs (0.000651%)

             invertQuda Total time = 6.69018 secs
              download     = 0.131989 secs (  1.97%), with        4 calls at 3.299725e+04 us per call
                upload     = 0.051616 secs ( 0.772%), with        4 calls at 1.290400e+04 us per call
                  init     = 0.043231 secs ( 0.646%), with        4 calls at 1.080775e+04 us per call
              preamble     = 0.000000 secs (     0%), with        4 calls at 0.000000e+00 us per call
               compute     = 6.363281 secs (  95.1%), with        4 calls at 1.590820e+06 us per call
              epilogue     = 0.035293 secs ( 0.528%), with        4 calls at 8.823250e+03 us per call
                  free     = 0.000002 secs (2.99e-05%), with        4 calls at 5.000000e-01 us per call
     total accounted       = 6.625412 secs (    99%)
     total missing         = 0.064770 secs ( 0.968%)

   invertMultiShiftQuda Total time = 30.2083 secs
              download     = 0.510359 secs (  1.69%), with       20 calls at 2.551795e+04 us per call
                upload     = 3.030169 secs (    10%), with       20 calls at 1.515084e+05 us per call
                  init     = 0.196616 secs ( 0.651%), with       20 calls at 9.830800e+03 us per call
              preamble     = 0.000010 secs (3.31e-05%), with       20 calls at 5.000000e-01 us per call
               compute     = 22.451351 secs (  74.3%), with       20 calls at 1.122568e+06 us per call
              epilogue     = 3.460743 secs (  11.5%), with       20 calls at 1.730371e+05 us per call
                  free     = 0.120916 secs (   0.4%), with       20 calls at 6.045800e+03 us per call
     total accounted       = 29.770164 secs (  98.5%)
     total missing         = 0.438116 secs (  1.45%)

      computeKSLinkQuda Total time = 13.9739 secs
              download     = 3.870691 secs (  27.7%), with        8 calls at 4.838364e+05 us per call
                upload     = 7.871224 secs (  56.3%), with       16 calls at 4.919515e+05 us per call
                  init     = 0.271004 secs (  1.94%), with       28 calls at 9.678714e+03 us per call
               compute     = 1.812989 secs (    13%), with       12 calls at 1.510824e+05 us per call
                  free     = 0.143273 secs (  1.03%), with       16 calls at 8.954563e+03 us per call
                 total     = 0.003951 secs (0.0283%), with        8 calls at 4.938750e+02 us per call
     total accounted       = 13.973132 secs (   100%)
     total missing         = 0.000803 secs (0.00575%)

   computeGaugeForceQuda Total time = 6.68004 secs
              download     = 3.424195 secs (  51.3%), with       12 calls at 2.853496e+05 us per call
                upload     = 1.207801 secs (  18.1%), with        6 calls at 2.013002e+05 us per call
                  init     = 0.074818 secs (  1.12%), with       12 calls at 6.234833e+03 us per call
               compute     = 1.933443 secs (  28.9%), with        6 calls at 3.222405e+05 us per call
                  free     = 0.036860 secs ( 0.552%), with        6 calls at 6.143333e+03 us per call
                 total     = 0.002880 secs (0.0431%), with       12 calls at 2.400000e+02 us per call
     total accounted       = 6.679997 secs (   100%)
     total missing         = 0.000039 secs (0.000584%)

   updateGaugeFieldQuda Total time = 12.6246 secs
              download     = 7.324691 secs (    58%), with        9 calls at 8.138546e+05 us per call
                upload     = 5.076005 secs (  40.2%), with        9 calls at 5.640006e+05 us per call
                  init     = 0.223411 secs (  1.77%), with        9 calls at 2.482344e+04 us per call
               compute     = 0.000439 secs (0.00348%), with        9 calls at 4.877778e+01 us per call
     total accounted       = 12.624546 secs (   100%)
     total missing         = 0.000036 secs (0.000285%)

   computeStaggeredOprodQuda Total time = 10.7973 secs
              download     = 6.426532 secs (  59.5%), with       64 calls at 1.004146e+05 us per call
                upload     = 4.281884 secs (  39.7%), with        4 calls at 1.070471e+06 us per call
                  init     = 0.085276 secs (  0.79%), with       68 calls at 1.254059e+03 us per call
               compute     = 0.003381 secs (0.0313%), with       60 calls at 5.635000e+01 us per call
     total accounted       = 10.797073 secs (   100%)
     total missing         = 0.000203 secs (0.00188%)

   computeHISQForceQuda Total time = 23.6115 secs
              download     = 5.793608 secs (  24.5%), with       10 calls at 5.793608e+05 us per call
                upload     = 0.750801 secs (  3.18%), with        4 calls at 1.877002e+05 us per call
                  init     = 0.146680 secs ( 0.621%), with        4 calls at 3.667000e+04 us per call
               compute     = 16.885342 secs (  71.5%), with       10 calls at 1.688534e+06 us per call
                  free     = 0.034562 secs ( 0.146%), with        2 calls at 1.728100e+04 us per call
     total accounted       = 23.610993 secs (   100%)
     total missing         = 0.000464 secs (0.00197%)

               plaqQuda Total time = 0.031472 secs
                  init     = 0.000000 secs (     0%), with        2 calls at 0.000000e+00 us per call
               compute     = 0.031471 secs (   100%), with        2 calls at 1.573550e+04 us per call
     total accounted       = 0.031471 secs (   100%)
     total missing         = 0.000001 secs (0.00318%)

                endQuda Total time = 0.770814 secs

           init2endQuda Total time = 244.321 secs

                   QUDA Total time = 145.192 secs
              download     = 55.570453 secs (  38.3%), with      177 calls at 3.139574e+05 us per call
                upload     = 22.269524 secs (  15.3%), with       63 calls at 3.534845e+05 us per call
                  init     = 3.355103 secs (  2.31%), with      197 calls at 1.703098e+04 us per call
              preamble     = 0.000009 secs (6.2e-06%), with       24 calls at 3.750000e-01 us per call
               compute     = 49.481740 secs (  34.1%), with      173 calls at 2.860216e+05 us per call
              epilogue     = 3.496039 secs (  2.41%), with       24 calls at 1.456683e+05 us per call
                  free     = 0.336824 secs ( 0.232%), with       98 calls at 3.436980e+03 us per call
     total accounted       = 134.607613 secs (  92.7%)
     total missing         = 10.584622 secs (  7.29%)

Total time reported by MILC:

Time = 2.290183e+02 seconds

Time measured by time on command line:

real    4m10.063s
user    3m27.071s
sys 0m41.158s
mathiaswagner commented 8 years ago

Merged w/ #330.