Closed h4u5 closed 7 years ago
Just wanted to update that I've tested this at small scale (64 simulated nodes) and didn't have a problem.
I did more debugging and testing on this today.
The problem only occurs when the following parameters are set:
PARAMETER 6 = arg.fwd_fft1 () []
PARAMETER 7 = arg.fwd_fft2 () []
PARAMETER 8 = arg.fwd_fft3 () []
PARAMETER 9 = arg.bwd_fft1 () []
PARAMETER 10 = arg.bwd_fft2 () []
PARAMETER 11 = arg.bwd_fft3 () []
E.g. when defaultSim.py is: motif['cmd'] = "FFT3D nx=256 ny=256 nz=256 npRow=4 ranks=16 nsPerElement=0.025" things work. When defaultSim.py is set to: motif['cmd'] = "FFT3D nx=256 ny=256 nz=256 npRow=4 ranks=16 nsPerElement=0.025 fwd_fft1=2.4 fwd_fft2=2.5 fwd_fft3=1.5 bwd_fft1=2.0 bwd_fft2=3.17 bwd_fft3=1.6" things break during setup.
I don't know what Ember is really doing but maybe the extra parameters are throwing off the parser so that ranks never gets populated?
Here's some other notes from the debug:
EmberInitGenerator::generate() gets called two times in a normal run for each rank (stack traces below):
Path where m_size = 0
Path where m_size = #ofranks
When the extra parameters are added, the run then hits the assert in mpi/motifs/emberfft3d.cc:279: void SST::Ember::EmberFFT3DGenerator::initTimes(int, int, int, int, float, std::vector
The second iteration of EmberInitGenerator::generate() never is reached when this fails.
Was trying more stuff out this morning and my test that was passing last night is failing this morning: motif['cmd'] = "FFT3D nx=256 ny=256 nz=256 npRow=4 ranks=16 nsPerElement=0.025"
I'm closing this issue. One of the problems was having nsPerElement=0.025. I'm pretty sure I was just going down the rabbit hole thinking that there was a problem passing in the number of ember ranks.
If it crops up again I'll reopen.
New Issue for sst-elements
1 - Detailed description of problem or enhancement
Ember FFT3D is failing on the latest devel branch. sst: mpi/motifs/emberfft3d.cc:273: void SST::Ember::EmberFFT3DGenerator::initTimes(int, int, int, int, float, std::vector&): Assertion `cost > 0.0' failed.
2 - Describe how to reproduce
EMBER: network: topology=dragonfly2 shape=24:48:12:96 arbitration=xbar_arb_lru, routing=minimal EMBER: network: topology=dragonfly2 shape=24:48:12:96 arbitration=xbar_arb_lru EMBER: numNodes=110224 numNics=110592 EMBER: network: BW=12.5GB/s pktSize=2048B flitSize=32B EMBER: Job: -nidList=0-110223 -ranksPerNode=1 ['Init'] EMBER: Job: -nidList=0-110223 -ranksPerNode=1 ['FFT3D', 'nx=1992', 'ny=1992', 'nz=1992', 'npRow=332', 'nsPerElement=0.025', 'fwd_fft1=2.4', 'fwd_fft2=2.5', 'fwd_fft3=1.5', 'bwd_fft1=2.0', 'bwd_fft2=3.17', 'bwd_fft3=1.6'] EMBER: Job: -nidList=0-110223 -ranksPerNode=1 ['Fini'] Merlin parameters: (key, value) dragonfly:algorithm, minimal dragonfly:intergroup_per_router, 24 num_peers, 110592 output_buf_size, 28KB router_radix, 95 num_vns, 1 dragonfly:routers_per_group, 48 num_ports, 95 link_lat, 150ns link_bw, 12.5GB/s dragonfly:global_route_mode, absolute dragonfly:intergroup_links, 12 flit_size, 32B topology, merlin.dragonfly2 dragonfly:hosts_per_router, 24 output_latency, 150ns xbar_bw, 12.5GB/s dragonfly:num_groups, 96 xbar_arb, merlin.xbar_arb_lru input_buf_size, 28KB input_latency, 150ns dragonfly:shape, debug, 0 Round robin partitioning
3 - What Operating system(s) and versions Skybridge
4 - What version of external libraries (Boost, MPI) module load openmpi-gnu/1.8 boost_1_56_0
5 - Provide sha1 of all relevant sst repositories (sst-core, sst-elements, etc) sst-elements: 4108adb51241b77ec6f8b47a7f6c20253288e049
6 - Fill out Labels, Milestones, and Assignee fields as best possible