sstsimulator / sst-elements

SST Architectural Simulation Components and Libraries
http://www.sst-simulator.org
Other
90 stars 118 forks source link

Event queue empty when running motifs with large message size #1775

Open hanskasan opened 2 years ago

hanskasan commented 2 years ago

Hi,

I am currently trying to run Ember ring motif but with some modification. In the original ring motif, node i sends packets to node (i+1) after receiving packets from node (i-1). For my application, I need all nodes to send the packets at the same time, thus instead of this (the original ring motif):

if ( 0 == rank() ) {
        enQ_send( evQ, m_sendBuf, m_messageSize, DATA_TYPE, to, TAG,
                                                GroupWorld );
    enQ_recv( evQ, m_recvBuf, m_messageSize, DATA_TYPE, from, TAG,
                                                GroupWorld, &m_resp );
} else {
    enQ_recv( evQ, m_recvBuf, m_messageSize, DATA_TYPE, from, TAG,
                                                GroupWorld, &m_resp );
    enQ_send( evQ, m_sendBuf, m_messageSize, DATA_TYPE, to, TAG,
                                                GroupWorld );
}

I made a slight modification:

enQ_send( evQ, m_sendBuf, m_messageSize, DATA_TYPE, to, TAG,
                                                GroupWorld );
enQ_recv( evQ, m_recvBuf, m_messageSize, DATA_TYPE, from, TAG,
                                                GroupWorld, &m_resp );

The motif works with message size up to 8192B but the simulation exits early when large message size is used, as shown below:

EMBER: using param directory: paramFiles
EMBER: platform: default
EMBER: network: topology=dragonfly shape=4:8:4:33
EMBER: numNodes=1056 numNics=1056
EMBER: network: BW=4GB/s pktSize=32B flitSize=32B
EMBER: Job=0, nidList='0-1055'
EMBER: Motif='Init'
EMBER: Motif='Ring iterations=1 compute=0 messagesize=16384'
EMBER: Motif='Fini'
*** Event queue empty, exiting simulation... ***
Simulation is complete, simulated time: 18.4467 Ms

For the parameters, I made the following changes:

networkParams = {
    "packetSize" : "32B",
    "flitSize" : "32B"
}

I use much smaller packet size for my simulation (32B instead of the default 2048B), hence much more packets are injected. Can you please tell me if there is anything that I need to be careful of when simulating with huge number of packets?

The problem occurs on both Ubuntu 18.04.5 LTS and CentOS 7.5.1804.

I built the SST from the distributed SST Core 11.0.0 and SST Elements 11.0.0 tarfiles (2021-May-03 release).

Thank you!

feldergast commented 2 years ago

Nothing immediately jumps out for the ember parameters. The number of total packets should be a problem. The one issue I do see is that the dragonfly shape is not "typical", so there may be issues in the dragonfly models. The shape is building a 32-node group, which is quite small, but probably not the issue. The fact that you have 33 groups with 4 links to each group means that you have 4x the global bandwidth then you have injection bandwidth. While this should theoretically work, the routing algorithms haven't been tested with configurations that resemble that in any way. Try changing the number of intergroup links to 1 and see if you still see the same issues.