sstsimulator / sst-elements

SST Architectural Simulation Components and Libraries
http://www.sst-simulator.org
Other
92 stars 121 forks source link

Ember: FFT3D failing #539

Closed h4u5 closed 7 years ago

h4u5 commented 7 years ago

New Issue for sst-elements

1 - Detailed description of problem or enhancement

Ember FFT3D is failing on the latest devel branch. sst: mpi/motifs/emberfft3d.cc:273: void SST::Ember::EmberFFT3DGenerator::initTimes(int, int, int, int, float, std::vector&): Assertion `cost > 0.0' failed.

2 - Describe how to reproduce

EMBER: network: topology=dragonfly2 shape=24:48:12:96 arbitration=xbar_arb_lru, routing=minimal EMBER: network: topology=dragonfly2 shape=24:48:12:96 arbitration=xbar_arb_lru EMBER: numNodes=110224 numNics=110592 EMBER: network: BW=12.5GB/s pktSize=2048B flitSize=32B EMBER: Job: -nidList=0-110223 -ranksPerNode=1 ['Init'] EMBER: Job: -nidList=0-110223 -ranksPerNode=1 ['FFT3D', 'nx=1992', 'ny=1992', 'nz=1992', 'npRow=332', 'nsPerElement=0.025', 'fwd_fft1=2.4', 'fwd_fft2=2.5', 'fwd_fft3=1.5', 'bwd_fft1=2.0', 'bwd_fft2=3.17', 'bwd_fft3=1.6'] EMBER: Job: -nidList=0-110223 -ranksPerNode=1 ['Fini'] Merlin parameters: (key, value) dragonfly:algorithm, minimal dragonfly:intergroup_per_router, 24 num_peers, 110592 output_buf_size, 28KB router_radix, 95 num_vns, 1 dragonfly:routers_per_group, 48 num_ports, 95 link_lat, 150ns link_bw, 12.5GB/s dragonfly:global_route_mode, absolute dragonfly:intergroup_links, 12 flit_size, 32B topology, merlin.dragonfly2 dragonfly:hosts_per_router, 24 output_latency, 150ns xbar_bw, 12.5GB/s dragonfly:num_groups, 96 xbar_arb, merlin.xbar_arb_lru input_buf_size, 28KB input_latency, 150ns dragonfly:shape, debug, 0 Round robin partitioning

3 - What Operating system(s) and versions Skybridge

4 - What version of external libraries (Boost, MPI) module load openmpi-gnu/1.8 boost_1_56_0

5 - Provide sha1 of all relevant sst repositories (sst-core, sst-elements, etc) sst-elements: 4108adb51241b77ec6f8b47a7f6c20253288e049

6 - Fill out Labels, Milestones, and Assignee fields as best possible

h4u5 commented 7 years ago

Just wanted to update that I've tested this at small scale (64 simulated nodes) and didn't have a problem.

h4u5 commented 7 years ago

I did more debugging and testing on this today.
The problem only occurs when the following parameters are set: PARAMETER 6 = arg.fwd_fft1 () [] PARAMETER 7 = arg.fwd_fft2 () [] PARAMETER 8 = arg.fwd_fft3 () [] PARAMETER 9 = arg.bwd_fft1 () [] PARAMETER 10 = arg.bwd_fft2 () [] PARAMETER 11 = arg.bwd_fft3 () []

E.g. when defaultSim.py is: motif['cmd'] = "FFT3D nx=256 ny=256 nz=256 npRow=4 ranks=16 nsPerElement=0.025" things work. When defaultSim.py is set to: motif['cmd'] = "FFT3D nx=256 ny=256 nz=256 npRow=4 ranks=16 nsPerElement=0.025 fwd_fft1=2.4 fwd_fft2=2.5 fwd_fft3=1.5 bwd_fft1=2.0 bwd_fft2=3.17 bwd_fft3=1.6" things break during setup.

I don't know what Ember is really doing but maybe the extra parameters are throwing off the parser so that ranks never gets populated?

Here's some other notes from the debug:

EmberInitGenerator::generate() gets called two times in a normal run for each rank (stack traces below):

Path where m_size = 0

0 EmberInitGenerator::generate

1 refillQueue at emberengine.h:64

2 EmberEngine::issueNextEvent

3 EmberEngine::Setup

4 Simulation::setup

5 start_simulation

Path where m_size = #ofranks

0 EmberInitGenerator::generate

1 refillQueue at emberengine.h:64

2 EmberEngine::issueNextEvent

3 EmberEngine::completeFunctor

4 Firefly::FunctionSM::handleToDriver

5 Simulation::run

When the extra parameters are added, the run then hits the assert in mpi/motifs/emberfft3d.cc:279: void SST::Ember::EmberFFT3DGenerator::initTimes(int, int, int, int, float, std::vector&): Assertion `cost > 0.0' failed.

The second iteration of EmberInitGenerator::generate() never is reached when this fails.

h4u5 commented 7 years ago

Was trying more stuff out this morning and my test that was passing last night is failing this morning: motif['cmd'] = "FFT3D nx=256 ny=256 nz=256 npRow=4 ranks=16 nsPerElement=0.025"

h4u5 commented 7 years ago

I'm closing this issue. One of the problems was having nsPerElement=0.025. I'm pretty sure I was just going down the rabbit hole thinking that there was a problem passing in the number of ember ranks.

If it crops up again I'll reopen.