radiasoft / zgoubi

Git repo for zgoubi source code
https://sourceforge.net/projects/zgoubi/
GNU General Public License v2.0
9 stars 3 forks source link

Identify Computational Bottlenecks in Zgoubi #37

Open dtabell opened 6 years ago

zbeekman commented 6 years ago

This has been at least partially done in SOW1. The takeaway is:

screenshot 2018-08-20 18 58 29

A full 10.5% of the runtime is devoted to computing sin() and cos() values. This __sincos() procedure is not resolving to a particular file, and is likely in a system library, the gfortran runtime or an intrinsic instruction. The next most time expensive procedures are:

  1. razdrv.f 10.3%
  2. devtra.f 8.66%
  3. debstr.f 6.54%
  4. depla.f 6.28
  5. finstr.f 3.82%

razdrv.f is just zeroing and initializing variables. __sincos() might be able to be optimized with some compiler options, if GNU has a similar option to Intel (something like -ffast-transcendentals). devtra.f has some real work and possibly a number of opportunities to simplify.

What additional information are we looking for to consider this issue resolved?

Some work can be done on a different platform to look at metrics like L2 cache miss rate, % of cycles stalled waiting on resources, or a memory BW/latency problem identification with ParaTools ThreadSpotter. However, these profiling results agree pretty well with the first study.

robnagler commented 6 years ago

There are many things we need to look at here. This is for SOW2, which has yet to begin.

zbeekman commented 6 years ago

I agree, I’m just looking to better define the problem.

rouson commented 6 years ago

@robnagler Are we authorized to start work on SOW 2? If so, @dtabell shall we plan to talk at the usual time next Tuesday?

robnagler commented 6 years ago

SOW2 conversation is happening offline.

The acceptance requirements for this Issue are still undecided.

Here are my thoughts about what was already provided in #4.

When I look at the inclusive graph, I see that transf.f consumes 88 seconds to __sincos's 32 seconds. Maybe I'm not reading it right.

I can't "scroll right" on the png, but there are many unknown and direct libc calls from transf.f directly.

One thing I would like is more raw data and fewer graphs. I like being able to sort the results. I don't know what TAU can output, but a simple CSV or other text output would make analysis much easier. This would also allow us to measure improvements programmatically.

What I would expect is less discussion about cache utilization, and more understanding of what algorithms are being used that possibly could be replaced by more modern approaches. For example, in the __sincos case, would a table lookup or a memoized solution be more efficient? What's doing all those sin/cos calls and why? Is there a better algorithm at the outer scope?

Once we understand the algorithms, we'll have a better idea of how parallelization can proceed.

rouson commented 6 years ago

@robnagler and @dtabell

@zbeekman is ready to start working on this issue. This is a good time for a teleconference on this to better define the objectives, an important outcome of which would be checkboxes that specify what accomplishments would represent the completion of this task.

zbeekman commented 5 years ago

NOTE: This post is currently a WIP

Performance Study of Zgoubi

NOTE: To make images larger, please click on them.

Abstract

TAU Commander was used to gather performance data for zgoubi. TAU Commander provides and abstraction of the performance engineering workflow, and organizes instances of performance experiments, i.e., trials, into a database. Different types of measurements were setup, with varying degrees of fidelity and precision. Trials from experiments using these measurements were performed to characterize the performance of both the development branch of zgoubi as well as the new particle update branch, for the four test cases which are regularly run during development. It was found that the new particle update procedure resulted in a slow-down of the code, although this seems to be due to data motion and copying required for the proof-of-concept implementation, as well as insufficient work in the innermost kernels. In general, TAU Commander observed that the innermost kernels have relatively little work. It is likely that with additional compiler optimization flags strategies like function in-lining and loop-level manual or automatic (compiler based) optimizations may result in a substantial speedup. Furthermore, manual inlining or refactoring of portions of the code may result in increased performance thanks to better data locality, the creation of fewer temporary arrays by the compiler, the reliance on fewer temporary variables by the programmer, and improved code motion opportunities for the compiler. Besides very high call/trip counts in inner-kernels with low work, some of the test cases were found to spend a lot of time (in some cases nearly all of the time) performing I/O. These two factors, I/O and high trip counts with low work, were found to be the biggest performance bottlenecks in the cases examined.

Methodology and Background

TAU and TAU Commander Basics and Background

TAU Commander is a front end for the TAU Performance System ®️ developed at the University of Oregon. TAU Commander simplifies the TAU Performance System ®️ by using a structured workflow approach that gives context to a TAU user’s actions. This eliminates the troubleshooting step inherent in the traditional TAU workflow and avoids invalid TAU configurations. TAU Commander reduces the unique steps in the workflow by ~50% and the commands a user must know from around eight to exactly one.

![TAU-vs-cmdr-flow](https://user-images.githubusercontent.com/279612/57377576-c9f40880-7170-11e9-940a-ea8ad3b7f483.png) ![image](https://user-images.githubusercontent.com/279612/57377625-e5f7aa00-7170-11e9-9bc0-ce3e9e932c9e.png)

TAU Commander activities are associated around three basic components: Target, Application and Measurement (TAM). This is illustrated in the figure below. The first, target, describes the environment where data is collected. This includes the platform the code runs on, its operating system, CPU architecture, interconnectivity fabric, compilers, and installed software. The second is the application. The application consists of the underlying items associated with the application - whether the application uses MPI, OpenMP, threads, CUDA, OpenCL and such. The measurements define what data will be collected and in what format. Even though an application uses OpenMP or MPI, the measurements may or may not measure those items In addition to that basic structure there are a couple of more components to complete the TAU Commander interface. The first is a project. A project is the container for the developers grouping of defined activities, settings and system environments. Last is the experiment. An experiment consists of one target, one application and one measurement. One experiment is active and that is what will be executed when developers collect data. When an experiment is run and data is collected and that completed data set is a trial.

The project configuration, including any applications, targets, measurements, experiments and their associated trials are stored together in a database managed by TAU Commander. A high level snapshot of this database and associated entities is provided by the tau dashboard command. This output can be seen below.

Screenshot 2019-05-08 09 43 45

Project Specific TAU Commander Usage and Setup

TAU Commander was installed into the user's home directory from the unstable development branch using the following commands:

cd ~/src
git clone --branch=unstable https://github.com/ParaToolsInc/taucmdr
cd taucmdr
make INSTALLDIR=${HOME}/taucmdr USE_MINICONDA=false install
export PATH="${HOME}/taucmdr/bin:$PATH"

Once installed, a new project can be initialized for zgoubi:

cd ~/src/zgoubi
tau init --linkage dynamic --compilers GNU

This will set the target name to v, the hostname of the RadiaSoft development virtual machine, the project name to zgoubi and the application name to zgoubi by default. (The name of the parent directory is used for these.) An additional application can be created to organize the performance experiments and compare the new particle update algorithm/implementation with the development branch. To do this I ran:

tau application copy zgoubi z-ptcl

where z-ptcl is the name given to the zgoubi application generated from the refactor-new-ptcl-update branch.

By default, TAU Commander will create 5 measurements:

The three measurements that are anticipated to be of value to the daily development of zgoubi are: baseline, sample and compiler-inst. It is also important to note that, for compiler-inst to not incur excessive runtime-dilation from the profiling overhead, it should be used in conjunction with a "selective instrumentation file". Two versions of selective instrumentation files have been generated for both the development and the new particle update branches. One version of the selective instrumentation file eliminates the instrumentation of all I/O-heavy procedures in addition to those eliminated by the other selective instrumentation file. ParaProf, the Java-based, GUI profile explorer that is part of TAU and TAU Commander can help the user create selective instrumentation files.

In general, whether using sampling or instrumentation, TAU will throttle instrumentation of procedures which are called more than 100,000 times, and take less than 10 ms to execute. This still results in calls to TAU's API, but then TAU returns immediately without pushing or popping a timer on the stack. Despite this, this can still incur significant overhead, which is why the addition of a selective instrumentation file will improve the accuracy of the generated profiles and minimize artificial runtime dilation. The time (or hardware counter metrics) associated with a procedure that is throttled or skipped end up being attributed to the first parent call that is not excluded or throttled.

For the purpose of the present study additional measurements were created and used, however they are either unavailable for daily use or not anticipated to be of much additional benefit to the zgoubi developer. Some are unavailable because VirtualBox does not allow direct access to CPU hardware performance counters, and, even if it did, these are CPU specific and often only available on server-grade CPUs. The others are not anticipated to be useful on a daily basis because they may incur additional overhead and cause the results to be difficult to interpret.

Results

Quantification of Measurement Overhead

First, the most general results are presented, and the measurement overhead is quantified for sampling and compiler based instrumentation for each of the 4 test cases, across the development version and the new particle update version of the code. The table below lists the total runtime for each case and the measurement induced runtime dilation, t_d. These results are not averaged over multiple runs, partially due to a lack of time and also to give a sense of the potential noise in the measurements from one run to the next. As will be shown later, a significant amount of time for some of these tests is spent performing I/O, often to the terminal. These tests have, unless otherwise stated, been run on a Radiasoft provided virtual machine running under VMware on an iMac with an Intel(R) Core(TM) i5-4690 CPU @ 3.50GHz with 4 physical cores, 1 numa node, 32k of L1 data cache, 32k of L1 instruction cache, 256k of L2 cache and 6144k of L3 cache. As can be seen in the table below, the I/O heavy workload in combination with running in a virtualized environment on a machine that is running numerous other processes causes some noise in the measurements.

case t (s) sampled (s) t_d (%) comp-inst (s) t_d (%) no I/O comp-inst (s) t_d (%)
warmSnake-dev 2.22 2.11 -5.0 2.22 0.0 2.06 -7.2
spin18GeV-dev 7.02 6.54 -6.8 7.55 7.5 5.41 -22.9
spintSatrn-dev 18.0 17.5 -2.8 18.3 1.7 14.9 -17.2
solenoid-dev 19.4 18.9 -2.6 19.7 1.5 19.3 0.5
warmSnake-new 2.47 2.51 1.6 3.11 25.9 2.39 3.2
spin18GeV-new 12.6 12.8 1.6 20.9 65.9 11.5 -8.7
spintSatrn-new 28.7 27.7 -3.5 45.9 59.9 25.0 12.9
solenoid-new 19.1 20.6 7.9 21.9 14.7 20.7 8.3

The first four rows of the table represent the 4 test cases running the un-updated development branch of the code. The second 4 rows are those same test cases running the version of zgoubi with the new particle update algorithm. The first column of data gives the total program runtime as measured by the baseline measurement. The data was only collected for one run, but the error bars on this data are approximately ± 10%. The next column shows the total runtime of each case when event based sampling has been turned on. Event based sampling essentially polls the program at a near constant rate to determine how much time has passed and which procedure the program is currently processing. These data are accumulated and analyzed from this TAU can attribute time to individual procedures ([SUMMARY] statistics) as well as lines or regions that may take a particularly long time [SAMPLE] statistics). From this table it is evident that the total program runtime from sampling is well within the noise. Most cases appear to execute slightly faster while being sampled than the original baseline timings. There are potentially to causes for this: 1) The executable, libraries, and input/data files needed were likely already cached in memory or 2) a portion of the GFortran runtime libraries initialization and program was not measured because TAU is using dynamic library preloading to intercept system calls and hook into the program. Despite this, these numbers are in very good agreement, giving confidence that the baseline timings are approximately accurate, and that there is no excessive overhead being incurred during sampling by the sample measurement.

The "comp-inst" (compiler instrumentation) data and the "no I/O comp-inst" is less straightforward. A selective instrumentation file was used to limit instrumentation to the most time consuming procedures, but different test cases spend time in different locations. The "comp-inst" measurement excludes all procedures that are throttled by TAU (procedures that are called more than 100,000 times and return within 10 ms), except for some IO procedures that account for a significant portion of time in some of the tests. The total runtimes reported for the development branch version of zgoubi seem relatively respectable: very good agreement with the corresponding uninstrumented runtimes, with the exception of the spin18GeV case, in which computation outweighs I/O, but a significant, non-trivial quantity of I/O still occurs. Here the instrumentation overhead appears to be about 7.5%.

For the updated particle tracking algorithm, instrumentation appears to come at a bigger cost: 3 out of 4 cases took over 20% longer. In order to combat this apparent inflation of measured runtime, an additional measurement was created, using a different selective instrumentation file which also excludes procedures that are dominated by I/O. The result of this was that the total runtime of the new particle update cases seemed to be within the measurement noise, however the tests using the old particle update appeared to run faster than the un-instrumented program.

Despite these noisy findings, relative percentages of time spent in different program regions may be examined. One should keep in mind the following points when analyzing results:

  1. Timing errors may be as large as 5 - 10% due to system jitter; further studies may be performed to remove this through ensemble averaging
  2. While the sampling measurement appears to agree best with the baseline measurement in terms of overall runtime, the interpolation and extrapolation used to attribute time from sampling to specific code regions is typically less precise than well placed instrumentation
  3. Computing the total program runtime (especially for non-MPI programs) is the least accurate part of compiler based instrumentation compiler-inst, assuming overhead is kept in check. This is due to the way that TAU starts up and instruments the entire program.
  4. I/O is especially susceptible to run-to-run variation, since the OS, or virtualization provider may cache files in memory for a period of time after they have been read from disk. Depending on whether the file is cached or not can and will influence the time it takes to load the program, libraries and input files.

Sampling Profiles

Sampling is usually the fastest path to performance data and can give you line numbers for expensive lines and call path information. However, it will also pickup system libraries and can be a bit difficult to interpret. The plot below shows exclusive time for the four test cases, warmSnake, spinESRF_18GeV, spinSaturne and solenoid in that order, from top to bottom. These profiles are using the development branch The times are listed in seconds. Time spent in .TAU application is the total program runtime. Any entry containing [SUMMARY] indicates TAU's best estimation of the total time spent in that procedure. Items containing [SAMPLE] indicate a sampling region where significant time was spent.

Screenshot 2019-05-08 15 52 31

From inspecting compiler-instrumentation profiles, presented further below, it is clear that the first case (window 0, corresponding to warmSnake) and the last case (window 3, corresponding to solenoid) are completely dominated by I/O heavy routines. Looking at the sampling based profiles above, we see that these two cases appear to spend a lot of time in libgfortran where the samples are unresolved because GCC/GFortran was not built without debug symbols enabled, and in other system level or compiler level functions which are mostly I/O related. It is likely that the unresolved samples in libgfortran are also I/O related.

The updated tracking algorithm's sampling based profile for the top six most sampled procedures matches closely to the findings above for the two I/O dominated cases shown in windows 0 and 3, warmSnake and solenoid. However, for the two computationally intensive spin calculations the new particle update procedures make their way into the top six most sampled procedures. The sampling profiles for these cases is shown below.

Screenshot 2019-05-08 16 13 41

In particular line 70 of particle_impl.f90 is flagged as being particularly expensive:

derivB(:,k) = matmul( dnB(:, 1:orderEnd(k)), monomsU(1:orderEnd(k)) )

There are three potential issues with this line:

  1. Array slices as arguments to the Fortran intrinsic function matmul may trigger the creation of array temporaries at runtime, triggering unneeded copying.
  2. The orderEnd integer array produces a level of indirection that inhibits compiler optimizations and, since the integer array is known at compile time, it should be replaced with a compile time constant, or better yet, completely eliminated through another means. (Specialized procedures or the use of a strategy pattern are two ideas that spring to mind.)
  3. The intrinsic matmul can be replaced with calls to linear algebra libraries, often automatically. GFortran has certain default size thresholds that trigger this, and compiler flags that can tweak this threshold and force or disable using BLAS/LAPACK matrix multiply procedures. The matrices in question are relatively small, and it should be investigated as to whether or not GFortran is calling into specialized linear algebra libraries, and experiment with turning this off and on.

Instrumentation Based Profiles

While automatic source based instrumentation with TAU often yields the most accurate results with the least overhead, the parsers used by PDT and TAU have difficulty with both the very old deprecated constructs such as ENTRY used by legacy portions of zgoubi, as well as some of the very modern features such as submodules in the refactored portions. As a result, compiler-based instrumentation has been utilized to achieve robustness, at, perhaps, a slight decrease in flexibility and slightly higher runtime overhead.

Direct instrumentation, whether compiler based, manual or automatic using PDT and parsers, results in calls to TAU's API being inserted at the entrance to--and exit from--every procedure within zgoubi's source code. As a consequence of this, it is both the most accurate form of profiling, but also poses the greatest risk to artificially inflate runtimes if calls to the TAU instrumentation API are made too frequently and/or by procedures which do not contain any measurable work. As such, the performance engineer will typically start by instrumenting everything, and then narrowing the files and procedures instrumented by use of a selective instrumentation file. This file consists of a list of procedures and/or files to exclude (blacklist) or include (whitelist). Using this approach, only the innermost kernels responsible for the majority of program runtime are instrumented, while procedures called with great frequency, but containing little work can be excluded. (Any time spent in such a procedure will end up attributed to the first instrumented parent caller on the current call-stack.) Because testing was conducted across multiple cases with different paths taken through the code, two selective instrumentation files were generated for the development version of zgoubi and two were generated for the new particle update version. The first one excludes all but the most time expensive computational and I/O procedures for both versions of the app. The second excludes everything except for the most computationally expensive procedures, including I/O procedures that may be important for the two most I/O heavy test cases. Below profiles are presented where case 0 and 3 (warmSnake and solenoid) are presented using the selective instrumentation file that includes I/O. The two middle cases, 1 and 2 (spin_ESRF18GeV and spinSaturne) are not dominated by I/O, and therefore in order to accurately profile the computationally intensive procedures the more restrictive instrumentation file was used that excludes I/O to minimize inaccuracy and scrutinize the particle update procedures.

Screenshot 2019-05-08 17 05 13

The profile above shows the compiler-based instrumentation profile for zgoubi from the main development branch. The profile shows exclusive time in each procedure as a percentage of total program runtime. Unlike the sampling based profiles shown earlier, here the .TAU application entry is time that cannot be attributed to other parts of the program, likely associated with initializing the Fortran runtime library and environment, or other forms of overhead.

fmapw_, impplt_ and fitwda_ are all I/O intensive procedures. This is known both from inspection of the source, and also sampling based profiles with full callpaths turned on. (Call paths were hidden in the sampling based profiles presented here, to reduce confusion and increase clarity.) These three procedures account for over 70% of the total runtime of the warmSnake case (window 0) and impplt_ alone accounts for over 95% of the runtime of the solenoid case. A fair amount of debugging and diagnostic output is enabled for these two cases, so using a different set of input file parameters should go a long way in speeding up these computations. However additional steps such as reading and writing files from a RAM disk (/dev/shm on many clusters) may yield improved performance. GFortran buffers I/O by default in 8k blocks, although it is unclear as to whether or not output to stdout and stderr are ever buffered. The intel compiler has options to enable and disable I/O buffering, and the ability to tune the block size. Beyond this, writing to files instead of stdout/stderr may yield performance improvements, as well as switching to stream access, binary file I/O may increase performance. Explicit aggregation of data between writes may also be a strategy worth investigating to tune I/O, or usage of a double-buffered read paired with Fortran's asynchronous I/O capabilities could allow for some of the cost of I/O to be amortized by overlapping computation. (See Clerman and Spector, Modern Fortran, Listing 8.4).

Examining the profiles for the two computationally intensive spin cases (case 1 and 2) it becomes evident that the majority of time is attributed to the devtra_ procedure, where other expensive procedures include quasex_, pckup, and cavite_. Some of these procedures have one or more child calls that end up getting throttled (take less than 10ms and are called more than 100,000) when run without a selective instrumentation file, and that have not been instrumented here in order to prevent inaccurate profiles with excessive instrumentation-induced runtime dilation. Of particular interest is integr_ which has child calls to devtra_ which then, in turn has child calls to particle update procedures. Any time in these child calls is attributed to integr_.

If we compare the profiles for these two computational intensive cases to the profiles produced using the same methodology for the new particle update code, shown below, a larger fraction of total runtime is spent in devtra_. Since the only difference between the codes is a different code path through the new particle update code, it is reasonable to attribute this additional time to the new procedures.

Screenshot 2019-05-08 17 34 48

Performing a direct comparison between the old, development branch code, and the new particle update algorithm shows that integr_ and child calls are certainly responsible for the slow-down exhibited by the application. I direct comparison between the two versions of the code is shown below for both the spinESRF18GeV case and the spinSaturne case. In both cases there is negligible change in the amount of time spent in other procedures including `quasex,pckup_and the main programMAIN__. Case 1, spinESRF_18GeV shows a greater than 3X slow down inintegr_` and child calls, and case 2 exhibits just shy of a 5x slow down.

Screenshot 2019-05-08 17 44 53

Despite this, the new particle update algorithm still shows promise. In addition to the 3 potential optimizations identified above concerning line 70 of particle_impl.f90, a number of opportunities exist to fix the implementation of the particle update algorithm, which, by design, requires fewer operations than the code it is replacing. The first potential issue is that particle data is being repacked into a different array in the particle update kernels. Updating the entire code to rely on the new algorithm would prevent some or all of this repacking. This data motion and repacking may be resulting in cache-misses which each typically cost on the order of tens of CPU clock cycles.

While instrumenting procedures that would normally be throttled by TAU and have very little work in them is a bad idea if one cares about accurate timings, such an approach is somewhat viable for comparing relative timings, and is completely plausible for comparing hardware counter metrics. In this vein additional experiments were performed on a HPE SGI 8600 system equipped with Intel Xeon Platinum 8168 (Skylake) CPUs. First, TAU throttling was turned completely off, and instrumentation was created for devtra_ and its child calls to the new particle update code. This resulted in massive runtime dilation, and about a 10 times slow down in the instrumented code, but the relative time in each of the new subroutines as well as an indication of the number of calls, and time-per-call is informative. Profiles for the two spin cases with this excessive instrumentation are shown in the two figures below.

Screenshot 2019-05-08 20 12 35 Screenshot 2019-05-08 20 30 24

The statistics table view for the spinSaturne case is shown below. The portions of the call graph which have been instrumented are shown. Times are given in microseconds. Note the number of calls for each procedure, the cross procedure is called over 250 million times, spending less than three tenths of a microsecond in this function. Functions like this are good candidates for inlining, since even the slightest overhead of creating a function call will add up after 250 million calls, and be large relative to the quantity of time spent actually performing work.

Screenshot 2019-05-08 20 17 08

Next Steps

An obvious first step is to increase the aggressiveness of compiler optimizations. Enabling link-time optimizations and more aggressive inlining may reduce some of the overhead associated with making many procedure calls, each having very little work. In addition, some if statements are present in the new particle update code. Since this code represents and inner kernel and is called many times, these if statements have the potential to create conditional branch instructions which may hinder efficient execution.

Beyond testing additional cases and continuing to work with the zgoubi development team, an additional study using ThreadSpotter may yield some additional insight and indicate potential fixes. ThreadSpotter uses runtime sampling (often accompanied by extreme runtime dilation, but this doesn't matter since you're not timing the code) paired parsing the x86 opcodes issued by the program to analyze memory and threading issues. It produces a report, annotating the original program source code ranking problems in order of importance, characterizing them, and suggesting fixes explicitly. (e.g., fuse the loop on line 27 with the loop on line 43, or try implementing loop tiling/blocking on the outermost loop at line 76.) This can typically be done relatively quickly and easily and the report offers the programmer good, actionable suggestions to improve performance.

In addition, iterative performance experiments, guided by profiles produced using TAU commander can help guide optimization efforts. The suggestions offered for improving the line calling matmul should be tested. Additional efforts to consolidate loops, eliminate branches and inline the innermost kernels should be considered, attempted and evaluated. Finally, GFortran provides the ability to perform runtime checks, including checking for the creation of array temporaries. (The compiler flag to enable this is -fcheck-array-temporaries.) Enabling this will help to catch other areas where memory copies may be unintentionally occurring.

Fundamentally, the algorithms, both the original zgoubi particle update procedure and the new particle update seem to be viable, however their implementations should be further examined. Further work collecting hardware performance counters may yield additional insight into the impact of conditional branch instructions, and how efficient cache utilization is. Transforming the code so that the innermost loop is over the first index in an array of particles (or multiple arrays for different properties of the particles) might lead to improved utilization of SIMD instructions on modern processors, better prefetching accuracy and cache utilization. This is analogous to transforming the code to use a Structure of Arrays rather than an Array of Structures (or scalar loops over structures) and will fundamentally change how data is packed in memory.

I have attached the TAU Commander project directory here, which contains the TAU Commander project database, as well as the TAU profile files. The profile files are stored in .tau/zgoubi/<experiment-name>/<trial-number>. You can use this python script to parse profiles and import them as pandas data frames. An example jupyter notebook showing this in action is available here.