s5z / zsim

A fast and scalable x86-64 multicore simulator
GNU General Public License v2.0
329 stars 184 forks source link

Dynamic uncore model in ZSim #117

Open dzhang50 opened 8 years ago

dzhang50 commented 8 years ago

I'm still trying to solve my issues with ZSim scaling way too optimistically on irregular graph algorithms (e.g. BFS on Galois 2.2.1 with rmat4 input). Unfortunately, tweaking sim.phaseLength didn't help too much. I've identified the uncore network as a very likely reason for this issue: the statically-timed Network::getRTT() function assumes zero network contention. I was previously testing with 3-cycle total hop latencies (2 cycles for the router, 1 cycle for the actual hop like in your ZSim ISCA 2013 paper). Changing the network latencies (e.g. sweeping the router hop latency from 1 to 16) produces drastically different scaling results (2x+ cycle difference at 64 cores), as does my preliminary mesh network model that uses the CycleQueue in the OOO core model to model back-pressuring virtual channels (since getRTT() results are calculated once and cached, I also modified coherence_controls.cpp to call getRTT() for each network access). Obviously actual 64-core Westmere-class machines don't exist, but this correlates much better with real machines even on lower core counts (testing on 10 core 20 thread E5-2680 v2).

Unfortunately, it seems like in order for me to build a truly accurate mesh network model, I'll need to design a weave-based model. The weave code is very complex and is probably the part of ZSim that I understand the least. Does your research group already have an internal uncore model that's more detailed than the current getRTT() function? The ISCA 2013 ZSim paper mentions a mesh network timing model, and the MICRO 2015 SWARM paper mentions a more detailed mesh network with X-Y routing. If so, would it be possible to release it? If not, do you have any suggestions/tips on the proper way I can build one? I can probably emulate a mesh network with contention by just increasing the static latency per hop (clearly the easiest thing to do), but I suspect my advisor wouldn't like it if I told him I was doing this.

dzhang50 commented 8 years ago

I gathered some data and created a graph to show you what I mean. This is very preliminary and rough data, the dynamic mesh network implementation is not good and is also modeling an extremely basic (read: low performance) mesh network. Also I was wrong about scaling for fixed latency. It turns out changing the static hop latency does greatly change absolute performance in cycles, but it affects performance by roughly the same factor regardless of the number of cores so the net result is that scaling relative to 1 thread isn't affected by static hop latency.

However, with a dynamic uncore model, it behaves like a low fixed latency model for small thread counts (low network contention) and more like a high fixed latency model for higher thread counts (high network contention). I suspect that my preliminary implementation of a dynamic uncore model is not good and much more testing is required, but it serves to demonstrate the point.

The graph below shows speedup relative to an infinitely fast static network model running BFS on 1 thread. The simulated microarchitecture has 64 OOO weave-type cores @ 2.5GHz, 2MB/core L3, and has a 6-channel DDR4 model that I added (don't really change the results vs. DDR3-1333-CL10 stock DDR weave-based model). Each line represents a specific network implementation: N-staticNet means a N-cycle per hop static network using the released ZSim network code. 3-dynNet means a 3-cycle per hop dynamic network using my custom initial dynamic mesh network implementation.

image 2

gaomy3832 commented 8 years ago

You can model the contention in the network using MD1 queuing theory. Take a look at class MD1Memory in mem_ctrls.h/cpp. This is obviously not the perfect solution, but much easier than implementing a weave model.

Actually I have a similar concern. zsim currently only supports 1 timing record per access in the core even recorder (or 2 if there is a writeback). This almost limits that there can be only 1 weave model all through the memory hierarchy (cache/prefetcher, network, memory). I assume to overcome this the core recorder class needs to be modified. Is there any suggestion on what is the right way to do so?

s5z commented 8 years ago

Internally, we have a network timing model that simulates contention. It's more expensive but more accurate, and should solve your problem. We'll push this out soon---stay tuned.

Mingyu, we have also fixed the 1-record-per-access limitation (and this is in fact already released; you should update to the latest master): now we enforce that each object returns a single timing record. This means that the (normal, non-weave) Cache has some logic to glue timing records from a writeback and a demand access. So now you can have arbitrary combinations of weave-phase models throughout the hierarchy.

gaomy3832 commented 8 years ago

Thanks Daniel. Yes I have noticed that. This is a really helpful update. Now it is more convenient to implement accurate weave models.

dzhang50 commented 8 years ago

Daniel, that sounds fantastic! I will be eagerly waiting for your network timing model to be released :)

dzhang50 commented 8 years ago

Hi Daniel, any updates on when the network timing model will be released? We're targeting a publication so I need to decide whether I should wait to run my experiments with the new network timing model, or in the event there's not enough time I need to come up with a kludge (like I showed above). Thanks!

dzhang50 commented 8 years ago

Any updates on when the network timing model will be pushed out?

shani-re commented 7 years ago

Hi, is the network model released? If not, when will it be? Thanks!

nirdavid commented 6 years ago

any updates?

dzhang50 commented 6 years ago

There were no updates. I ended up writing my own MD1 model based on the suggestion from @gaomy3832 . I'm trying to push this change, as well as many other accuracy improvements to the main ZSim repo here: https://github.com/s5z/zsim/issues/213. However, I still have yet to receive a response from the main ZSim folks (including via email). Since then, I have received further emails about these changes, this time from industry researchers. I might just release the changes anyways under a ZSim fork.

nirdavid commented 6 years ago

Hi Dan, Any updates? I would be happy to see your changes. Thanks

dzhang50 commented 6 years ago

@nirdavid ,

Since Daniel and the other authors never responded to my email or PR, I decided to create my own fork of ZSim called ZSim++:

https://github.com/dzhang50/zsim-plusplus

It's a separate repository rather than an official Github fork for the reasons documented here:

https://www.niels-ole.com/ownership/2018/03/16/github-forks.html yigit/android-priority-jobqueue#58

I have been (very slowly) cleaning up my research code and committing one feature at a time to my new repo. So far, I have added the simple config file, DDR4 support, floating-point stats (useful for things like IPC), and the simple MD1 network model. There's still plenty of changes to come, including a ton of changes to the OOO core model (including serious bug fixes). I hope that other people who have been using ZSim for a long time (such as @gaomy3832 ) will also consider contributing their own code to ZSim++, provided the changes are sufficiently general.

Dan