Open moxcodes opened 3 years ago
@moxcodes FYI, here's a commit where I have implemented basic support for fixed-sized messages, essentially as a drop-in replacement for the internals of simple_action
: https://github.com/nilsleiffischer/spectre/commit/7806655ff56b798039d920c29f1d1c19406dea62. Perhaps this is useful for you or others looking into Charm messages.
We have performed a number of performance tests of a sampling of our executables, and we have found a couple of salient features:
An example of the necessity of well-chosen element distribution is demonstrated by the plot I included in the Morton curve pull request:
And the combined effect of a ScotchLB and number of elements for a Generalized Harmonic evolution is displayed here: (note that in this plot, the larger run in blue curve has an added anchoring point at 1 core -- that point does not represent an actual run, as the caltech HPC system did not have enough RAM to support the run on a single node. Instead, I have assumed the 1core->4node scaling of the smaller simulation to plot the two curves on a comparable scale, which assumes equal efficiency at 4nodes of the two runs.)
The steady loss of parallel efficiency is apparent in all attempted cases, and only weakly depends on the element distribution, load, etc. Profiling with vtune has indicated that, at least for generalized harmonic, much of the time is spent in the heavy computation routines associated with calculating right-hand sides and fluxes. Our current best guess for the reason for the effect of loss of parallel efficiency is saturated memory buses, e.g. from cache misses. There could be other resource competition at work, but the hypothesis of bus saturation is partially supported by tests using a simple 'load balancing test array' that can be used to model loads and distributions across chares in the Spectre system -- in that case, we would expect to probe the maximum parallel efficiency of the system, as it does virtually no allocation, and in the limit of high load parameter should spend almost all time in computation. Those tests found no better than 80% efficiency at 16 cores with extremely high load, so there's indication of some underlying resource competition at some fairly low level that has little to do with communications or allocations during the computation.
Therefore, there are two main parts of spectre scalability, each with some targets:
Multi-node:
Single-node: