brainstorm methods for graphs that span MPI nodes

researcherben / structural-simulation-toolkit

Sandia SST

GNU General Public License v3.0

4 stars 0 forks source link

brainstorm methods for graphs that span MPI nodes #26

Open bhpayne opened 3 years ago

bhpayne commented 3 years ago

Currently (SST v11) builds graphs on a single MPI node, then distributes parts of the graph to each MPI node. That limits how big the graph can be to how much graph can be constructed in the memory of a single MPI node.

What are the different ways to get a distributed graph in SST?

bhpayne commented 3 years ago

TCL said

If there were some notional "template" inputs and a top-level configuration manifest, each node could read in and initialize the appropriate nodes and links. This would reduce the disk storage requirements (and subsequently the I/O requirements) to load the graph from disk. However, this would likely increase the latency to initialize the graph in memory as we would require more processing to initialize each template to create the appropriate objects with the correct parameters. One additional benefit to this would be that the templates could be utilized to drive a configurable number of MPI ranks at runtime (as opposed to static configurations).
Alternatively, we could use the approach with larger input files that are fully annotated and a top-level manifest file that describes the links spanning nodes. All MPI ranks would be required to load a minimum of two files: the top-level manifest and its respective node configuration. One downside to this approach is that the configuration manifest(s) would need to be tailored to the correct number of MPI ranks (and possibly threads per rank). In which case, the external tools (and users) would need to determine the correct scale of the simulation a priori.

bhpayne commented 3 years ago

@bhpayne replied

I'm going to first try to restate your suggestions.

Assuming N nodes in an MPI cluster,

provide SST the desired graph pattern and have SST construct each of the N graph-per-nodes and the list-of-node-spanning-links. Pros:
- Avoids large files on disk
- dynamic rescaling Cons:
- incurs a specification of the template input syntax
Build the graph outside of SST, partition outside of SST, then provide SST each of the N graph-per-nodes and list-of-node-spanning-links. Pros:
- decouples graph creation from SST Cons:
- Incurs large files on disk per node.
- Altering MPI cluster size requires regenerating static per-node-graphs and list-of-node-spanning-links

Are there alternative approaches to those two?

I don't have a good grasp of how threads play into mapping the graphs to nodes.

For scenario 2, and assuming a single simulation instance, I'm not worried about

the size of input files on disk or
the amount of time to read the files into each node or
the amount time to generate input files or
regeneration of inputs when MPI cluster size changes

Those aren't concerns because 1) I haven't measured them yet and 2) the time and disk costs are probably small compared to simulation time.

I'll try to measure the expected size-on-disk of a naive input file to get a better sense.

Where the "giant input file" approach could be problematic is with trade-space analysis or a parameter sweep.

My other questions include

is there a significant difference in the time-to-implement in terms of your effort? I would assume offloading the graph construction to be outside of SST makes the implementation of 2 easier.
is the cost of switching paradigms asymmetric (e.g., from 1 to 2 versus from 2 to 1)? For example, if we find that 2 is a bad idea after implementing, switching to 1 easier than if we start with 1 and revert to 2.

bhpayne commented 3 years ago

TCL said

Is there a significant difference in time to implement for us?
- Yes. Implementation of [2] is probably faster/easier/less prone to bugs. Our available scratch space should be more than sufficient.
Is the cost of switching paradigms asymmetric?
- Switching from 1 to 2 would probably "lose" the most code, but may be of minimal impact. Reading a smaller number of files in a similar format is probably easier than scaling up.

We can come up with some additional tools/scripts externally that would allow us to use paradigm [2] and reconfigure the graph input files for a different number of MPI ranks + threads. e.g.,

sst-reconfig --input-ranks=64 --output-ranks=many -o /path/to/new.json /path/to/input.json

Once we have the necessary algorithms and I/O tools in place, it should be fairly straightforward to re-shard the graphs appropriately.

bhpayne commented 3 years ago

Suggestion for a third approach (as a modification to suggestion 1): graph input should leverage the inclusion of hierarchical knowledge in SST. That way the initial graph construction phase doesn't need to instantiate every instance of a component since it knows how composite models are constructed. With this knowledge, the memory issues could be bypassed by having a two stage graph instantiation. The first phase distributes partial graph construction to each rank but not constructing the full graph itself on any one node.

Support for approach 2: probably way faster to implement and suffices for what it is needed for.

In the short term, suggestion 2 would be beneficial for the purposes of getting something working quickly, especially since it shouldn't require significant modification to SST, whereas the use of hierarchical graph features would.

There is interest in pursuing the hierarchical idea (suggestion 3) further, but starting work on that after we get something working would be desirable.

bhpayne commented 3 years ago

Recap:

the first decision is the scope of what is in SST versus what is done outside SST. I think the four options are

graph construction and graph partitioning and graph distribution to each MPI node as part of SST process. (Current practice)
graph construction outside SST, graph partitioning and graph distribution to each MPI node as part of SST process
graph construction and graph partitioning outside of SST, graph distribution to each of MPI node as part of SST process
graph construction and graph partitioning and graph distribution to each MPI node outside of SST process

Independent of which of those four options is selected, there is another choice: How SST consumes the user's definition of the graph. I think the three choices are

user provides static input file (JSON, XML, graphML, etc)
user provides script (e.g., Python)
user provides a C++ library that SST can call into (e.g., a function that generates the graph)

The first choice presented is associated with files. How many input files does the user provide SST?

one per MPI node
one per simulation configuration

If the input file is static, there are a few options for specifying the graph

brute-force/naive graph specification: list every node and every link. Uses lots of disk space/IO, needs to be regenerated when configuration changes, simple interface specification
nested/hierarchical graph specification. Small amount of disk space/IO. Easier for human to edit. More complicated interface specification
template-based graph specification

There are trade-offs: a static (JSON) brute-force graph specification with partitioning outside SST and graph distribution outside SST might minimize changes needed to SST by offloading the burden to the user, relying on lots of disk space, and fast I/O.

A long-term and robust solution with a well-documented API for large graphs might be distinct from or evolve from a short-term hack that provides a proof-of-concept capability.

bhpayne commented 3 years ago

As far as the SST graph format, I would recommend

define a file format defining the local graph and required meta-data (e,g, external connections), one file per MPI rank
at this point, assume we will write our own tools for building/partitioning the graph
any reasonable format is fine (e.g., JSON); keep it simple

I like the idea of the hierarchical graphs or a similar construct. Our application structure is such that I am concerned it will not parallelize well. Imagine a graph structure similar to a tree. In our case, the bottom tree nodes are expanded into tendrils that dangle down from the bottom of the tree. Our parallelism comes from sending messages from the top of the tree to the bottom of the tendrils, with the number of messages growing significantly as we move down the tree. However, the way the conservative DES scheduler works (as I understand it), there is a global barrier across all the tendrils as messages propagate down. This means we have global barriers for very small time steps. In reality, though, since the tendrils are independent and do not talk to each other, we could propagate the messages independently. One way is a hierarchical representation. Alternatively, maybe the conservative time-stepping scheme could be changed to identify these independent structures (one input one output chains)?

bhpayne commented 3 years ago

Meeting notes from video call

currently, if rank==0, then read graph is implemented. So changing to parallel read-in should be straightforward.

The issue isn't the time-to-generate; the constraint is memory used for graph construction based on number of nodes.

Q: what information is needed when constructing the graph? Answer: {component ID (unique across nodes), list of links and ports, subcomponents, parameters} Would need to distinguish what's specific to the node and what is in the halo. The halo needs the link ID.

SST could use a C++ generator interface.

another method of reducing the scale of the graph is hierarchical specification; e.g., supernodes.

SST generators existed two versions ago, but this was removed due to lack of use. Serial generators would be easy to recover; parallel would take work to support. Need to define the C++ API. We want to specify the C++ API such that a generator and Python could use the same API.

Q: when is the freeze for SST v11.1 planned? Answer: second week of Oct 2021