scalability of simulations

We have performed a number of performance tests of a sampling of our executables, and we have found a couple of salient features:

When scaling from 1 core to a full node, our parallel efficiency (defined as the ratio ((1-core wall time) / N) / (N-core wall time)) diminishes substantially; often to around 40-60% at a full node.
When scaling to several nodes, we see parallel efficiency that preserves, or occasionally slightly improves upon, the efficiency at one full node, provided:
- We make good use of the best available comm utilities on the hpc system
- Our element distribution is well clustered, either with Z-curve, or a graph-based communication-aware charm load balancer (e.g. RecBipartLB or ScotchLB)
- The ratio of the number of elements to number of cores remains sufficiently high. 'sufficiently high' depends on the amount of work that the simulation does per element per time step, and can be as little as ~10 for a GH simulation with a large number of points, or as high as 100 for cheaper simulations with few grid points.

An example of the necessity of well-chosen element distribution is demonstrated by the plot I included in the Morton curve pull request:

And the combined effect of a ScotchLB and number of elements for a Generalized Harmonic evolution is displayed here: gh_scaling (note that in this plot, the larger run in blue curve has an added anchoring point at 1 core -- that point does not represent an actual run, as the caltech HPC system did not have enough RAM to support the run on a single node. Instead, I have assumed the 1core->4node scaling of the smaller simulation to plot the two curves on a comparable scale, which assumes equal efficiency at 4nodes of the two runs.)

The steady loss of parallel efficiency is apparent in all attempted cases, and only weakly depends on the element distribution, load, etc. Profiling with vtune has indicated that, at least for generalized harmonic, much of the time is spent in the heavy computation routines associated with calculating right-hand sides and fluxes. Our current best guess for the reason for the effect of loss of parallel efficiency is saturated memory buses, e.g. from cache misses. There could be other resource competition at work, but the hypothesis of bus saturation is partially supported by tests using a simple 'load balancing test array' that can be used to model loads and distributions across chares in the Spectre system -- in that case, we would expect to probe the maximum parallel efficiency of the system, as it does virtually no allocation, and in the limit of high load parameter should spend almost all time in computation. Those tests found no better than 80% efficiency at 16 cores with extremely high load, so there's indication of some underlying resource competition at some fairly low level that has little to do with communications or allocations during the computation.

Therefore, there are two main parts of spectre scalability, each with some targets:

Multi-node:

Ensure good element distribution, by consistently using communication-based balancers or Morton curve. Morton curve is currently default, but assumes homogeneous load.

Single-node:

Reduce memory allocations -- this will make more of a difference in some executables than others.
- Non-owning variables
- use of Charm messages rather than parameter marshalling to reduce the cost of allocating and copying the messages between chares. This may help multi-node performance also, depending on whether the parts associated with message passing are resources in competition during high node-count runs.

sxs-collaboration / spectre

scalability of simulations #3155