sstsimulator / sst-macro

SST Macro Element Library
http://sst-simulator.org/
Other
34 stars 41 forks source link

Multiprocessor simulations fail with sst-core using .ini files. #623

Closed jpkenny closed 3 years ago

jpkenny commented 3 years ago

First reported by IBM as: FATAL: [0:0] SST Core: ERROR: Parsing SDL file: Link logPejection0->0 referenced more than two times

But there are multiple bugs in play here. Will comment on them individually.

jpkenny commented 3 years ago

Bug #1 sst-macro uses a "short-circuit" logP network for small messages. Construction of this is failing here because multiple connections are being made from switches in the logP network to the port specified by sst.macro.NICLogPInjectionPort (a constant) on each node. I have a fix for this.

jpkenny commented 3 years ago

Bug #2 configure does not do mpi check when sst-core is detected. But topology.cc currently uses MPI_Comm_size to detect number of ranks. Number of ranks is used to distribute logP switches across ranks. All of this breaks if SSTMAC_HAVE_VALID_MPI isn't defined in sstmac_config.h. Fix is to do the mpi check always.

We are also told that calling MPI is bad. But we're not REALLY calling it from a component, since it happens during setup. So I'm not sure it's really that bad.

jpkenny commented 3 years ago

Bug #3 (probably) I don't think this is right (but probably works ok sometimes):

SwitchId
Topology::nodeToLogpSwitch(NodeId nid) const
{
  int n_nodes = numNodes();
  int nodes_per_switch = n_nodes / nproc;
  int epPlusOne = nodes_per_switch + 1;
  int num_procs_with_extra_node = n_nodes % nproc;

  int div_cutoff = num_procs_with_extra_node * epPlusOne;
  if (nid >= div_cutoff){
    int offset = nid - div_cutoff;
    return offset / nodes_per_switch;
  } else {
    return nid / epPlusOne;
  }
}

Both returns will have values starting at 0.

jpkenny commented 3 years ago

If all the previous bugs are fixed you run into:

error: no topology.name parameter in namespace

procs > 1 are not getting the correct parameters. Fix for this currently eludes @calewis and me.

jpkenny commented 3 years ago

To the extent that I have tested it (make check/installcheck pass with several node counts) this is fixed with PR #627. The shortcut network topology will have somewhat lower performance as implemented for sst-core. I have not investigated the performance impacts of this.