sstsimulator / sst-elements

SST Architectural Simulation Components and Libraries
http://www.sst-simulator.org
Other
93 stars 121 forks source link

Merlin simulations too slow and weird behavior from the buffer size param #1838

Open tommasobo opened 2 years ago

tommasobo commented 2 years ago

In our group we are using merlin to simulate a new topology. We have successfully tested the topology and it works correctly but we have found a couple of issues that we can't seem to track down:

1) We have found out that increasing the buffer size parameters (input_buf_size and output_buf_size) does actually decrease our performance significantly (intended as bandwidth of an AllToAll or AllReduce for example). We believe that shouldn't be the case and are quite confused about what is happening. Are we missing something?

2) Is there any suggestion on your side about how to speedup the simulations? Currently we find large simulations (but even just 1024 nodes network) involving AllToAll to be quite slow even when running SST on 128 nodes. We have tried increasing the flit size and packet size but that didn't help as much as we wished. Is there any parameter that we are missing from emberLoad that could help speed up the simulations?

Thanks a lot once again for the help!

feldergast commented 2 years ago
  1. We have found out that increasing the buffer size parameters (input_buf_size and output_buf_size) does actually decrease our performance significantly (intended as bandwidth of an AllToAll or AllReduce for example). We believe that shouldn't be the case and are quite confused about what is happening. Are we missing something?

That is odd. I usually see the opposite effect. At least, up to a point: For traffic patterns that cause heavy congestion, more buffer space can actually increase the "offered load" in the system, thus reducing throughput. Essentially, you get a case where there is always data to fill in when other data is moved and the congestion lasts longer. I imagine a large all-to-all would be one of those patterns. I would not expect that for all_reduce unless you are doing it on very large arrays.

  1. Is there any suggestion on your side about how to speedup the simulations? Currently we find large simulations (but even just 1024 nodes network) involving AllToAll to be quite slow even when running SST on 128 nodes. We have tried increasing the flit size and packet size but that didn't help as much as we wished. Is there any parameter that we are missing from emberLoad that could help speed up the simulations?

Unfortunately, large all-to-all patterns just take a long time to simulate since there are so many events injected into the system. I've seen this when using that pattern for research. The biggest impact on performance you may have control over is the shortest latency on links cut by a partition, as this will affect the synchronization interval. There are some tricks you can play here if you have added latency on your inputs and outputs. For example, in hr_router, the input latency is just "pushed" onto the input link and the output latency is "pushed" onto the output link (via the Link::addSendLatency() and Link::addReceiveLatency() calls in the model). Since this happens after partitioning, there is no way for the core to account for the added latency across the links. However, in the input file, you can change the parameters and do this manually. So, with a link_latency=20ns, input_latency=40ns and output_latency=40ns, you can equivalently set the input and output latencies to zero and the link latencies to 100ns. This will dramatically reduce the synchronization interval and should provide some speedup. I've been tempted to add an option to the merlin python module that will do this for you automatically, but I haven't had a chance to do it yet.

I suspect there is also a point of diminishing returns on scaling. If you are only doing a 1024 node network, then you probably won't really scale much past 32 or 64 ranks (mpi/threads or a combination thereof). For larger networks, you will, of course, be able to scale further. Also, running mpi everywhere will give you better results than threads (we're still working on this one).

tommasobo commented 2 years ago

Thanks for the detailed answers!

1) Yeah sadly it seems that if we set a very large buffer (let's say 2MB) then we start getting far worse results and also the throughput grows slower with increasing message sizes.

2) Another confusing thing we found out is that for the AllToAll example we need very large messages to get full network bandwidth compared to the equivalent real system that we have. For example with the real system we are able to reach the theoretical max all to all bandwidth with just 16KiB message sizes. With SST we get half of the max bandwidth even with 4MiB messages. I don't know if there is some time overhead somewhere that we are missing.

feldergast commented 2 years ago

I assume you're using the ember AllToAll motif? I discovered recently that it was written with a simple naive approach. It just sends a single message to every other rank (and may use a blocking send, or a send/recv pair before moving on. I can't remember exactly, but there was something else as well that affected performance). No one has had time to sit down and implement one of the more optimized algorithms from the literature. I suspect that's what you're seeing there. Not all of the motifs have been optimized, mostly just the ones we've used for research projects in the past. It's a large lift to get optimized versions for each.