sstsimulator / sst-macro

SST Macro Element Library
http://sst-simulator.org/
Other
34 stars 41 forks source link

Are there any examples of MPI+Pthreads in SST-Macro? #614

Closed afranques closed 3 years ago

afranques commented 3 years ago

Dear SST Community,

Last year I came asking for help setting up a mesh topology with Merlin. Thanks to @jjwilke and @gvoskuilen I was able to implement and evaluate what I needed in SST, and our work got published at HPCA.

We are now exploring new ideas, and I come today because I am trying to simulate MPI+OpenMP (or MPI+Pthreads) in a relatively simple supercomputer such as this: supercomputer_diagram This is a distributed memory system connecting 16 Compute Nodes with a 2D Torus. At the same time, each Compute Node is single socket and shared memory within the node, and consists of 4 cores, each core with its own private L1 cache, but all 4 cores connected to a shared L2 cache through a network-on-chip (such as a ring interconnect). For this configuration, I would like to use MPI to exchange data across nodes, and OpenMP or Pthreads within a node.

I originally thought about implementing this on Ariel+memHierarchy+Merlin, but I quickly realized this might not be the best setup, since it's probably not designed to run hybrid MPI+threads applications (actual benchmarks, not synthetic traffic). I then discovered SST-Macro, which seems to be more suitable, and I read SST/macro 11.0 User's Manual and SST/macro 11.0: Developer's Reference. While this documentation is very good, I could not find any examples of hybrid MPI+threads that would resemble what I would like to implement, and I therefore have the following questions:

  1. Am I right on assuming that SST-Macro (as opposed to Ariel+memHierarchy+Merlin) is the right environment for implementing this supercomputer that I described above?
  2. With SST-Macro, can I have multiple cores within each node (all sharing a last-level-cache and memory through a ring or a mesh interconnect), similarly to what I showed in the amplified diagram of the Compute Node above (red background)?
  3. If so, can I specify the size of the private and shared caches within each node, similarly to what I would do with SST's memHierarchy python configuration file?
  4. If I want to simulate up to 64 nodes with 4, 8, or 16 cores each (up to 1024 cores in total), will I be better off using the standalone SST-Macro core, or the unified SST core?
  5. Do you think I could easily run some of the CORAL or CORAL-2 benchmarks in such simulated system?
  6. Do you know which of the four levels of thread safety that MPI defines (MPI_THREAD_SINGLE, MPI_THREAD_FUNNELED, MPI_THREAD_SERIALIZED, and MPI_THREAD_MULTIPLE) is implemented or recommended to use in SST-Macro?

Thank you very much in advance for your time!

Best, Antonio

jpkenny commented 3 years ago

Hi Antonio,

Sorry for the slow reply, I needed some time to poke around in some code to start to address these questions. Unfortunately Jeremy Wilke, who would be the best equipped to answer many of these questions, is no longer at Sandia. But I'll do my best to move the discussion forward on as many of these as possible, and hopefully some other team members can chime in where appropriate.

1. Am I right on assuming that SST-Macro (as opposed to Ariel+memHierarchy+Merlin) is the right environment for implementing this supercomputer that I described above?

Macro would be pretty ideal for simulating something like Coral benchmarks on a 2D-torus, but doesn't currently support node-level simulation at the level of fidelity that your diagram shows. For a reasonably accurate simulation of the network on-chip interconnect Ariel+memHierarchy+Merlin is more suitable. Unfortunately we can't really integrate all of these pieces at this point (to my knowledge).

2. With SST-Macro, can I have multiple cores within each node (all sharing a last-level-cache and memory through a ring or a mesh interconnect), similarly to what I showed in the amplified diagram of the Compute Node above (red background)?

See answer to question #1.

3. If so, can I specify the size of the private and shared caches within each node, similarly to what I would do with SST's memHierarchy python configuration file?

See answer to question #1.

4. If I want to simulate up to 64 nodes with 4, 8, or 16 cores each (up to 1024 cores in total), will I be better off using the standalone SST-Macro core, or the unified SST core?

Really depends on what question you want to answer. Macro has pretty flexible support for complex workloads but at the expense of less node-level fidelity.

5. Do you think I could easily run some of the CORAL or CORAL-2 benchmarks in such simulated system?

For skeleton apps we have a Lulesh skeleton available for macro and there are ember motifs that could be useful. As far as more detailed simulations go, I think other team members are better positioned to outline the options.

6. Do you know which of the four levels of thread safety that MPI defines (MPI_THREAD_SINGLE, MPI_THREAD_FUNNELED, MPI_THREAD_SERIALIZED, and MPI_THREAD_MULTIPLE) is implemented or recommended to use in SST-Macro?

The mpi implementation in macro will accept any level of requested threading support and I think that the way it's implemented on top of user space threads should allow them all to work. I could be wrong.

Sorry I don't have more authoritative answers, hopefully this helps, Joe

afranques commented 3 years ago

Hello Joe,

Thank you very much for your reply! Oh, so unfortunate to hear Jeremy Wilke moved! Best wishes to him in his next career move :-)

Your answers were great. I would actually love to use Ariel+memHierarchy, since I'm already familiar with both, however the reason why I discarded the idea at first is because I thought Ariel isn't able to model MPI (or even better, MPI+threads). Did I get this wrong? Also, I was skeptical that Ariel+memHierarchy would be able to scale to 1024 cores, but please correct me if you thing it would!

Thanks, Antonio

gvoskuilen commented 3 years ago

@afranques I'm glad the simulation worked for you!

I don't think there's an out-of-the-box solution, but there are some pieces that might help, depending on what you want to study.

  1. You can run MPI+threads applications on Ariel. I don't think I ever fully merged the necessary code changes into the mainline but I can dig it up. Most of the changes are in how the app is launched. However, this is limited to a single node - either the app running completely on one node, or the app running at scale and SST simulating one of the nodes while the rest run natively. This works for node-level studies but not for network studies. The missing piece is getting the Ariel pintool to intercept MPI activity and feed it through the simulated network.

  2. There has been some work done on linking detailed node models done with the Miranda core model into a larger network simulation. I haven't tried this and I don't know what all it's capable of and what the caveats are. I'm pretty certain it would be some work, for example, to swap out Miranda for Ariel, but I don't know how much. If this is interesting, we can get some more details on what's possible.

  3. I've run Ariel/memH out to ~300 cores. I have not tried a larger simulation, but 1024 is not a huge leap from there. I think it'd be worth a try.

afranques commented 3 years ago

Got it. Thank you for your reply, @gvoskuilen!

  1. In my last project I managed to have Ariel pintool intercept custom functions in toy benchmarks (mostly for instrumentation), so maybe I could extend this to intercept MPI calls as well so that I can then feed them through the simulated network, as you suggest.
  2. Understood. I think this could be an interesting backup plan if the Ariel/memH and SST-Macro options fail.
  3. Fantastic, I will give it a try then and see how it scales!