rouson commented 6 years ago

Lay the groundwork for parallel performance analysis by profiling with the Tuning and Analysis Utilities (TAU). Comparing the profiling information that TAU produces from the existing serial code with the previous profiles will validate the initial TAU output but, more importantly, will provide a straightforward path to parallel performance analysis.
Identify bottlenecks and estimate the ideal speedup that can be achieved though one or more performance-enhancement strategies, including coarray parallelization.

rouson commented 6 years ago

Attached are 4 profile plots of zgoubi procedures in serial and embarrassingly parallel execution:

The serial execution profile corresponds precisely to the zgoubi-integration test that Travis-CI that now executes automatically using ctest every time commits are pushed to the develop branch of the Sourcery Institute fork of zgoubi. And because Travis-CI builds zgoubi with cmake, item 1 above leverages the work in issues #1, #2, and #10.

The embarrassingly profiles in items 1-3 above lay a foundation for work on issue #5 in a future SOW. The parallel profiles were created as follows:

Compiling zgoubi with the OpenCoarrays compiler wrapper (caf) built atop MPICH 3.2.1 and GCC 8.1.0.
Executing zgoubi by using TAU Commander to invoke the OpenCoarrays program launcher (cafrun).

With this configuration, each coarray Fortran image is an MPI rank and each image executes the exact same program statements simultaneously as might be employed to conduct a parametric study.

The conclusions from this one example demonstrate that percentage of runtime occupied by the most computationally intensive zgoubi procedures in serial execution are

chamk1: 33.3%
spntrk: 5.2%
devtra: 3.1%
maplim: 2.9%
pvect: 1.4%
fitmm: 1.3% with each remaining procedure occupying less than 1% of the runtime. Clearly chamk1 is the chief bottleneck and initial performance enhancement efforts should be focused on speeding up that procedure.

All procedures shown in the serial profile (plot 1) have a total of approximately 50% of the execution time with the bulk of the remaining execution time going to calls to the Fortran runtime library to perform input/output (I/O), which is not shown in the provided plots but was determined through additional analysis using TAU. Assuming a fixed problem size (strong scaling) and assuming that performance enhancement efforts focus on the physics code, not including I/O, one can conclude from Amdahl's law that the maximum speedup attainable in serial execution is likely to be approximately a factor of 2.

Plot 2 is a 3D representation of the runtime profile representative of an embarrassingly parallel scenario with a problem size that expands linearly with the available compute resources (weak scaling). The 3D profile demonstrates that the same procedures dominate as dominated the serial run and showing relatively even load balancing across the four images executed -- all as would be expected for this simple form of parallelization. This form of parallelism effectively offers unlimited speedup that is linearly proportional to the number of images employed.

Plots 3 for the parallel execution shows a profile similar to the serial case obtains from averaging each procedure's execution time averaged across the four executing images. Plot 4 is similar to Plot 3, but also includes the underlying MPI procedures' execution time in the profile.

rouson commented 6 years ago

The above comment constitutes the deliverable for this issue and is ready for review.

robnagler commented 6 years ago

Thanks for providing this summary. This is great progress, which of course opens up questions:

I didn't see the caf/cafrun support in the develop branch. Is that checked in?

I would like to understand why I/O comprises 50% of the run-time.

I'm a little confused about the parallelism. What did you modify to get the embarrassingly parallel execution?

robnagler commented 6 years ago

I don't think I got answers to my questions. It got autoclosed by referencing a commit.

zbeekman commented 6 years ago

RE: caf/cafrun

I didn't see the caf/cafrun support in the develop branch. Is that checked in?

No, this is trivial and is essentially a system/compiler property, not something that should be tracked in the repository. For compilation, the quick and dirty hack is just to say

export FC=caf
export CC=mpicc # gcc would probably be fine here too, since the Fortran "compiler" is used to invoke the linker

after installing OpenCoarrays. There is recent CMake support for compiler wrappers/tools, that we can look into implementing, if so desired.

To run: cafrun -np 1 ./zgoubi etc.

RE: IO & profiling

I would like to understand why I/O comprises 50% of the run-time.

The answer to this question is a bit involved. I try to be concise below and provide more or less an executive summary. I highly doubt that this profile is representative of production zgoubi usage (but this is speculation).

TL;DR;

This is not intended to be a comprehensive performance analysis. Rather, the purpose is to verify that TAU/TAU Commander are compatible with zgoubi, and demonstrate the utility and potential for applying TAU to zgoubi EDIT: and to get some estimates of where bottle necks are and estimates for speedups.

The reason why so much time is spent in IO was not investigated in detail, however, the most likely reasons are:

The test case we ran here was not computationally intensive enough. Per word, the number of clock cycles wasted on IO is at least an order of magnitude larger than moving data from RAM to cache/register, which is at least an order of magnitude larger than the clock cycles per operation on a word in CPU register. Reading files (especially ascii files) is very time consuming and dependent on the system hardware.
The tests were conducted in a virtual machine, so there may have been additional IO slow-down/overhead associated with that

Longer answer

Saying ~50% of runtime is in IO, is not 100% accurate. We observed that the only other significant time spent was in the GFortran runtime, most of which appeared to have to do with IO. We pruned sample and summary statistics from the profiles for code regions outside of those in zgoubi itself, because, they are somewhat confusing and not necessarily indicative of zgoubi's runtime performance. The generated profiles were using event based sampling with TAU & TAU Commander. The time not accounted for directly in zgoubi was either:

In libgfortran (Fortran runtime) performing mostly what appear to be IO operations
In system calls doing IO
Unresolved

Unlike other tools, like gprof, TAU tries to avoid time dilation artifacts caused by instrumentation/profiling. After all, what good is a profile if you're just measuring how long it takes to start and stop timers? There are different ways that TAU can be configured to collect different types of data, but by far the easiest data to collect is using event-based sampling. zgoubi looks like it has many, many very short, near-trivial procedures which may not contain a lot of computation. With event based sampling on a short, small test case, it is possible that too few, or no samples were collected in these very short procedures.

In addition, the test case run when these profiles were collected was a short one. It's more than possible that there simply is not enough computational work in the test case for the computations to take significantly longer than the IO performed in reading input files, input data and writing results. The tests were performed inside of a linux VM, which may also introduce bottlenecks associated with IO, etc. that might not be observed in a real system.

Furthermore, these tests were run using OpenCoarrays with GFortran, so it is likely that there is some additional runtime overhead in libgfortran when setting up the parallel runtime. This may also account for some of the observed time in libgfortran. (Most of it appeared to be IO, however.)

The purpose of this, was not to develop a detailed performance analysis of zgoubi, but rather to ensure that profiling data could be collected with TAU/TAU Commander and show the utility and potential of these tools when working to modernize, parallelize and improve zgoubi EDIT: and to estimate margins for performance gains.

To obtain a more useful and accurate profile, a test case representative of real workflows should be used, on production hardware or representative hardware. The next step is typically to use event based sampling to get a decent idea of where time is being spent. Some parameters can be tuned if the defaults are not adequate, including the sampling rate, etc. If event based sampling still appears insufficient in terms of detail and accuracy, the next step is source instrumentation. This is achieved with automatic or compiler based source instrumentation. This process has TAU & PDT go in and insert timing calls in the source code (automatically), or uses compiler options/interfaces to trigger calls to TAU's timers. This is more accurate than sampling, with the caveat that, you can cause artificial time dilation if instrumentation is indiscriminately added everywhere. TAU has throttling capabilities that help (it turns off instrumentation for procedures called too frequently, with too little time spent per call) but using an event based sampling profile, it can also generate a selective instrumentation file that will cause the automatic or compiler based instrumentation to be restricted to the important areas.

Moving forwards, the next steps are:

Compare this profile to previously obtained profiles
Ensure that a representative test case is being profiled on representative hardware
Conduct a more detailed performance analysis which, after 1 & 2 above requires:
- Source based instrumentation (inserts calls to timers)
- Collection of hardware counter data via TAU & PAPI (Allows for data on cache misses & miss rates, stalled cycles, FLOPs, arithmetic intensity, etc.)
- Turning on other measurement features like IO instrumentation to determine IO bandwidth.

Having only briefly perused the source code myself, I would not be surprised if we can get more than a 2X speedup. Typically IO optimizations can yield 10x-100x speedups for the subroutines doing the IO, and beyond parallelizing, other tuning can be performed to better take advantage of today's NUMA architectures.

RE: Parallelism

I'm a little confused about the parallelism. What did you modify to get the embarrassingly parallel execution?

We simply compiled the code with OpenCoarray's caf compiler wrapper script and then ran it with cafrun -np 4 ./zgoubi .... (No algorithms in zgoubi were parallelized, it is still a serial code AFAIK.) This has the effect of the GFortran runtime calling MPI_Init and MPI_Finalize to setup the behind the scenes MPI transport layer for coarray fortran. The zgoubi code itself required no modifications, it was just built with OpenCoarrays. At runtime the only real difference to spawning 4 identical zgoubi processes at the same time is the initial coordination to setup the parallel runtime and then the cleanup when each process finishes. So, essentially, the same problem is simultaneously being solved by N instances of zgoubi.

zbeekman commented 6 years ago

OK, I need to amend my comments from above RE profiling. I didn't read the initial comments carefully enough. I will edit my previous comment, and try to make clear what has changed.

robnagler commented 6 years ago

I was confused by the statement "one can conclude from Amdahl's law that the maximum speedup attainable in serial execution is likely to be approximately a factor of 2." I don't think we can draw this conclusion with the data we have.

In general, a unit test that runs in a few seconds is not going to give very much information about scaling or performance issues.

I would appreciate it in the future if you say what you do in plainer language. It helps avoid confusion. For example:

The zgoubi binary was run unmodified under MPI
tests/centeredHelix_FIT_save_nofinal_150226.res was used as input
TAU seems to have worked

rouson commented 6 years ago

Thanks for the feedback and suggestion, @robnagler. Thanks for the excellent explanation, @zbeekman!

Rob, does Zaak's comment fully answer your questions? Is there anything else required to close this issue?

robnagler commented 6 years ago

I don't think this task has been done.

TAU should be added to the build process somehow so we know how to run it. It could be a separate script.
We haven't identified any bottlenecks. We need to profile with a large enough example to "bottlenecks and estimate the ideal speedup". Here is an example performance report on Zgoubi that gave some information about bottlenecks.

zbeekman commented 6 years ago

@robnagler @dtabell Using the test cases from https://github.com/radiasoft/rszgoubi/tree/master/results/run-2017-03-17T22_17_00%2B00_00 is highly non-obvious and non-trivial. I can't follow through the myriad of docker containers and different repositories where the originating source code for zgoubi is and what input files, if any, are required beyond the zgoubi.dat file. I have generated the corresponding zgoubi.dat files, but I get an error at runtime:

$ ../../zgoubi

  Zgoubi, author's dvlpmnt version.
  Job  started  on  07-08-2018,  at  10:34:47
   Copying  zgoubi.dat  into  zgoubi.res,
   numbering  and  labeling  elements...

 Title : Natural emittance (from zpop/7/7[2,3]), i.e., for skew=0, epsy=0) :

    7/ 1905 SRLOSS    /                    /
+
Program received signal SIGBUS: Access to an undefined portion of a memory object.

Backtrace for this error:
#0  0x1227781ac
#1  0x122777553
#2  0x7fff51423f59
Bus error: 10

This is with zgoubi.dat copied from coupling.1100.11.dat in the other project: zgoubi.dat.txt

Furthermore, copying and pasting the instructions from the README.md at https://github.com/radiasoft/rszgoubi results in some errors, like missing setup.py and failure to execute the command within the docker image.

zbeekman commented 6 years ago

Here is the updated profile information and status. I hope this is sufficient detail, along with the profiling script on the profile-TAU-cmdr branch, to consider this issue resolved. Please let is know what, if anything, else is required to close this out for SOW 1.

Methods

A larger test case was run to better match the results from the performance report on Zgoubi. We chose to run the largest test case of 11 turns and 1100 particles. A copy of the input file can be found here: coupling.1100.11.dat.txt The testing was done using the unstable branch of TAU Commander at f1cf73ac44006569cae31e88aeaa76d84d51cae2. While an RPM was made available and could be installed within the docker image via

yum install -y https://depot.radiasoft.org/foss/rscode-taucmdr-dev.rpm

(This installs into /home/vagrant/.local/taucmdr, so you need to add /home/vagrant/.local/taucmdr/bin to the $PATH.) However, despite a connection speed of ~200Mb/sec (download) the docker image was pulling the RPM at only 750 kB/sec and it was faster to just perform a minimal install directly in the docker image.

Special notes

Here are notes on settings and changes to default measurements to ensure decent sampling of Zgoubi using TAU & TAU Commander.

It was found that compiling with -fbacktrace caused problems for TAU Commander so it was disabled
The code should be compiled with debug symbols enabled to resolve the samples (-g)
The many, many very small procedures appear to be hard to capture when sampling. The original results don't show an obvious un-accounted for chunk of time, but it's likely that Kcachegrind was using a similar sampling technique.
To capture an adequate amount of data, some adjustments were made to the default sampling configuration:
- --unwind-depth 12 was set for additional call stack unwinding/resolution
- Throttling was turned completely off with --throttle False.
- Because TAU either uses LD_PRELOAD or hooks into the MPI Tools interface, Zgoubi was compiled against MPI, and MPI_init() and MPI_finalize() were wrapped around zgoubi_main.f to ensure TAU was initialized and started before Zgoubi started in earnest. (Otherwise, zgoubi is usually statically linked, so there is no obvious hook into TAU. We experimented with dynamically linking zgoubi, but calling MPI is the safest way to ensure proper initialization.)

This is the sequence of commands used to generate the profile data:

# Clone zgoubi & cd into the top level of the source directory
mkdir cmake-build
# initialized the tau project
export FC=gfortran
export CC=gcc
export CXX=g++
tau initialize --mpi T --target=beamsim-p1 --application zgoubi --project radiasoft --cc $CC --cxx $CXX --fc $FC
tau measurement edit sample --throttle False --unwind-depth 12
tau select sample
cd cmake-build
cmake -DCMAKE_Fortran_COMPILER_LAUNCHER=tau -DCMAKE_INSTALL_PREFIX=$(pyenv prefix) ..
make VERBOSE=1 -j $(nproc) # VERBOSE=1 just to see what CMake is doing
cp <path-to>/coupling.1100.11.dat ./zgoubi.dat
tau mpirun -n 1 ./zgoubi
tau dashboard
tau trial list
tau trial export <trial-num> # requires java
tau trial show <trial-num> --profile-tools pprof | head -n 50
tau trial show <trial-num> # only do this if you have java, will launch paraprof gui

Findings

The baseline (no sampling, no instrumentation) execution time for this test case running inside the radisasoft/beamsim-part1 docker container on a Late 2014, 3.5 GHz Intel Core i5 iMac with 24 GB 1600 MHz DDR3 was 297 seconds. With sampling turned on the total runtime was 304 seconds. This corresponds to a total time dilation of only 2.35%. It is possible that we could try increasing the sampling rate in the future. Here are the profile results for inclusive time:

And the same graph with times shown as a percent of total runtime:

Besides zgoubi_main.f the most time (inclusive) is spent within

trans.f
quasex.f
integr.f
cofin.f
event.f
zgoubi.f
raysyn.f

However, a more useful metric is exclusive time. This shows the time spent in the actual procedure, rather than in the procedure and within child calls. Below are profiles showing inclusive times in seconds and as a percentage of the total program runtime for the top most time expensive procedures. Sampling will also resolve some entities from the GFortran compiler runtime, system libraries, etc.

A full 10.5% of the runtime is devoted to computing sin() and cos() values. This __sincos() procedure is not resolving to a particular file, and is likely in a system library, the gfortran runtime or an intrinsic instruction. The next most time expensive procedures are:

razdrv.f 10.3%
devtra.f 8.66%
debstr.f 6.54%
depla.f 6.28
finstr.f 3.82%

Estimating Speedups

Without PAPI hardware counter metrics, which were unavailable running under docker on the iMac these experiments were conducted on, it is hard to get a very concrete sense of what speedups are attainable. My initial impression of the application is that there are a number of significant opportunities for optimization:

Elimination of loop or procedure invariant recalculation: Some code regions appear to be performing duplicate or unnecessary work
Improve data access patterns
- Some loops may benefit from loop tuning optimizations.
  - A number of places look like they are traversing large arrays multiple times without loop carried data dependencies, or data dependencies that can be accounted for with an index offset. Loop fusion may improve the cache locality in these circumstances
Excessive branching
- Invariant if statements can be hoisted
Other opportunities for scalar promotions
Usage of optimized linear algebra libraries may reduce code complexity and increase performance
Parallelization
- Opportunities exist for parallelization across particles
  - non-interacting particles are trivial
  - interacting particles are harder, however a vast body of work for N-Body simulations can be borrowed from. For example, astro-dynamics simulations often use an octree representation to allow the influence of groups of particles to be felt as a single force, when they are sufficiently far from the particle in question. See Barnes-Hut simulations

As of now, estimates of speedups are more art than science, without additional work on platforms that allow the collection of hardware performance counter data or additional analysis with other tools. These estimates are my professional opinion, so take them with a grain of salt. I would estimate that at least an order of magnitude speedup is obtainable before parallelization if issues like

inefficient loop nesting
poor data locality
redundant computation
insufficient local work in procedures and loops

can be resolved. From my visual inspection of code, and procedures like devtra, the chief techniques to achieving this will include:

Function inlining
Loop Fusion
Scalar promotion (i.e. invariant hoisting)

Future performance studies

We could setup the docker image to KCacheGrind. From the previous report, it looks like the instrumentation it is performing under the hood is similar to TAU Commander's event based sampling. In addition it may be counting load-store instructions or performing some sort of simulation.

Alternatively, I have access to a Intel Xeon (Broadwell) system that I know has access to CPU performance counter registers. Profiling zgoubi on this platform can get us:

L2 cache miss rate & total misses
Stalled cycle fraction & total stalled cycles
Fraction of Load/store instructions issued vs total instructions retired

In addition we can try installing ParaTools ThreadSpotter in the docker container (or on bare metal on an appropriate platform); ParaTools ThreadSpotter uses sampling, and then emulation and modeling to generate detailed reports. These reports rank memory bandwidth and latency problems, provide suggestions for potential fixes, and quantify total fetches, write backs, fetch utilization, WB utilization etc. Installation requires GCC 4.9. The reports are browseable with a web browser like firefox and the issues are highlighted in the source code along with their recommended fix. After sampling data is collected the performance on other architectures can be simulated in post processing.

In addition, once zgoubi has been further modernized, TAU has automatic source instrumentation capabilities, where the source files are parsed and calls to instrumentation procedures are injected. These tend to be more accurate than any sampling method, however, currently, the parsers cannot handle zgoubi source.

robnagler commented 6 years ago

Great stuff, thanks @zbeekman!

I need to digest, but yes, we're done with SOW1.

zbeekman commented 6 years ago

Not to beat a dead horse, but depending on the level of parallelism available (i.e. if particles can be treated as non-interacting) then you can likely scale to many particles in close to constant time with parallelization, as @rouson had pointed out earlier. Now I don't know if that is an attractive use case or not, and whether you can accurately treat the particles as non-interactive, but that strikes me as some of the lowest hanging fruit.

radiasoft / zgoubi

Profile with TAU #4

RE: caf/cafrun

RE: IO & profiling

TL;DR;

Longer answer

RE: Parallelism

Methods

Special notes

Findings

Estimating Speedups

Future performance studies