researcherben / structural-simulation-toolkit

Sandia SST
GNU General Public License v3.0
4 stars 0 forks source link

brainstorm methods for graphs that span MPI nodes #26

Open bhpayne opened 3 years ago

bhpayne commented 3 years ago

Currently (SST v11) builds graphs on a single MPI node, then distributes parts of the graph to each MPI node. That limits how big the graph can be to how much graph can be constructed in the memory of a single MPI node.

What are the different ways to get a distributed graph in SST?

bhpayne commented 3 years ago

TCL said

  1. If there were some notional "template" inputs and a top-level configuration manifest, each node could read in and initialize the appropriate nodes and links. This would reduce the disk storage requirements (and subsequently the I/O requirements) to load the graph from disk. However, this would likely increase the latency to initialize the graph in memory as we would require more processing to initialize each template to create the appropriate objects with the correct parameters. One additional benefit to this would be that the templates could be utilized to drive a configurable number of MPI ranks at runtime (as opposed to static configurations).

  2. Alternatively, we could use the approach with larger input files that are fully annotated and a top-level manifest file that describes the links spanning nodes. All MPI ranks would be required to load a minimum of two files: the top-level manifest and its respective node configuration. One downside to this approach is that the configuration manifest(s) would need to be tailored to the correct number of MPI ranks (and possibly threads per rank). In which case, the external tools (and users) would need to determine the correct scale of the simulation a priori.

bhpayne commented 3 years ago

@bhpayne replied

I'm going to first try to restate your suggestions.

Assuming N nodes in an MPI cluster,

  1. provide SST the desired graph pattern and have SST construct each of the N graph-per-nodes and the list-of-node-spanning-links. Pros:

    • Avoids large files on disk
    • dynamic rescaling Cons:
    • incurs a specification of the template input syntax
  2. Build the graph outside of SST, partition outside of SST, then provide SST each of the N graph-per-nodes and list-of-node-spanning-links. Pros:

    • decouples graph creation from SST Cons:
    • Incurs large files on disk per node.
    • Altering MPI cluster size requires regenerating static per-node-graphs and list-of-node-spanning-links

Are there alternative approaches to those two?

I don't have a good grasp of how threads play into mapping the graphs to nodes.

For scenario 2, and assuming a single simulation instance, I'm not worried about

Those aren't concerns because 1) I haven't measured them yet and 2) the time and disk costs are probably small compared to simulation time.

I'll try to measure the expected size-on-disk of a naive input file to get a better sense.

Where the "giant input file" approach could be problematic is with trade-space analysis or a parameter sweep.

My other questions include

bhpayne commented 3 years ago

TCL said

We can come up with some additional tools/scripts externally that would allow us to use paradigm [2] and reconfigure the graph input files for a different number of MPI ranks + threads. e.g.,

sst-reconfig --input-ranks=64 --output-ranks=many -o /path/to/new.json /path/to/input.json

Once we have the necessary algorithms and I/O tools in place, it should be fairly straightforward to re-shard the graphs appropriately.

bhpayne commented 3 years ago

Suggestion for a third approach (as a modification to suggestion 1): graph input should leverage the inclusion of hierarchical knowledge in SST. That way the initial graph construction phase doesn't need to instantiate every instance of a component since it knows how composite models are constructed. With this knowledge, the memory issues could be bypassed by having a two stage graph instantiation. The first phase distributes partial graph construction to each rank but not constructing the full graph itself on any one node.

Support for approach 2: probably way faster to implement and suffices for what it is needed for.

In the short term, suggestion 2 would be beneficial for the purposes of getting something working quickly, especially since it shouldn't require significant modification to SST, whereas the use of hierarchical graph features would.

There is interest in pursuing the hierarchical idea (suggestion 3) further, but starting work on that after we get something working would be desirable.

bhpayne commented 3 years ago

Recap:

the first decision is the scope of what is in SST versus what is done outside SST. I think the four options are

Independent of which of those four options is selected, there is another choice: How SST consumes the user's definition of the graph. I think the three choices are

The first choice presented is associated with files. How many input files does the user provide SST?

If the input file is static, there are a few options for specifying the graph

There are trade-offs: a static (JSON) brute-force graph specification with partitioning outside SST and graph distribution outside SST might minimize changes needed to SST by offloading the burden to the user, relying on lots of disk space, and fast I/O.

A long-term and robust solution with a well-documented API for large graphs might be distinct from or evolve from a short-term hack that provides a proof-of-concept capability.

bhpayne commented 3 years ago

As far as the SST graph format, I would recommend

I like the idea of the hierarchical graphs or a similar construct. Our application structure is such that I am concerned it will not parallelize well. Imagine a graph structure similar to a tree. In our case, the bottom tree nodes are expanded into tendrils that dangle down from the bottom of the tree. Our parallelism comes from sending messages from the top of the tree to the bottom of the tendrils, with the number of messages growing significantly as we move down the tree. However, the way the conservative DES scheduler works (as I understand it), there is a global barrier across all the tendrils as messages propagate down. This means we have global barriers for very small time steps. In reality, though, since the tendrils are independent and do not talk to each other, we could propagate the messages independently. One way is a hierarchical representation. Alternatively, maybe the conservative time-stepping scheme could be changed to identify these independent structures (one input one output chains)?

bhpayne commented 3 years ago

Meeting notes from video call

currently, if rank==0, then read graph is implemented. So changing to parallel read-in should be straightforward.

The issue isn't the time-to-generate; the constraint is memory used for graph construction based on number of nodes.

Q: what information is needed when constructing the graph? Answer: {component ID (unique across nodes), list of links and ports, subcomponents, parameters} Would need to distinguish what's specific to the node and what is in the halo. The halo needs the link ID.

SST could use a C++ generator interface.

another method of reducing the scale of the graph is hierarchical specification; e.g., supernodes.

SST generators existed two versions ago, but this was removed due to lack of use. Serial generators would be easy to recover; parallel would take work to support. Need to define the C++ API. We want to specify the C++ API such that a generator and Python could use the same API.

Q: when is the freeze for SST v11.1 planned? Answer: second week of Oct 2021