Per Thread Fast Forwarding

vijay4454 commented 8 years ago

Hello

I am wondering if there is an easy way to do fast forwarding for a specific thread (thread 0) in a single process multiple thread simulation. I have a pthread program that I need to simulate on large core count system. One of the cores is OoO while others are simple in-order cores. I need ZSim to ignore/not count the cycles spent by the main thread (that runs on the OoO core) in specific functions.

I tried implementing this feature in ZSim by instrumenting the binary and placing specific handlers before and after those specific functions (whose names I pass to the PIN tool through pin_cmd.cpp). Inside the handler code (which takes thread ID as argument), I invoke the EnterFastForward() or ExitFastForward() as appropriate. However, I realize that it fast forwards the entire process, which means it is fast forwarding the other threads besides thread 0.

Is there an easy way to get around this problem and fast forward just thread 0? If not, what would you recommend is the least intrusive/easiest way to do this?

Thanks

gaomy3832 commented 8 years ago

The easiest way is to change your code. Normally I would imagine the main thread spawns a bunch of worker threads and then wait in idle until all workers finish. You can reorganize your code to move the work in the main thread before or after the parallel section, then fast forward this part has no effects on worker threads. If the work in the main thread has to happen in parallel with the workers (due to communication, synchronization, etc.), then you probably should not fast forward it since it will affect the performance of your region of interest.

hlitz commented 8 years ago

Define a new Magic op that, when called (by each thread individually), reads out the per-thread cycle count and subtract it later.

On Sep 27, 2016, at 8:00 AM, vijay4454 notifications@github.com wrote:

Hello

I am wondering if there is an easy way to do fast forwarding for a specific thread (thread 0) in a single process multiple thread simulation. I have a pthread program that I need to simulate on large core count system. One of the cores is OoO while others are simple in-order cores. I need ZSim to ignore/not count the cycles spent by the main thread (that runs on the OoO core) in specific functions.

I tried implementing this feature in ZSim by instrumenting the binary and placing specific handlers before and after those specific functions (whose names I pass to the PIN tool through pin_cmd.cpp). Inside the handler code (which takes thread ID as argument), I invoke the EnterFastForward() or ExitFastForward() as appropriate. However, I realize that it fast forwards the entire process, which means it is fast forwarding the other threads besides thread 0.

Is there an easy way to get around this problem and fast forward just thread 0? If not, what would you recommend is the least intrusive/easiest way to do this?

Thanks

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/s5z/zsim/issues/138, or mute the thread https://github.com/notifications/unsubscribe-auth/ADZKfnfGklfbht-SJhd42Jaul51yBr5oks5quS-UgaJpZM4KHxEX.

vijay4454 commented 8 years ago

I am developing a simulator for some new architecture with a new programming model. One, I don't want to impose too many constraints on how the programs should be written. Second, there is a specific reason why I want to ignore cycles spent inside specific functions. Cycles spent inside the current implementation of these functions have no real world deployment significance.

gaomy3832 commented 8 years ago

As Heiner suggested above, you can define a new magic op. The current fast-forward is deferred to the end of phase to be synced. Take a look at the logic in Join(), TakeBarrier() in zsim.cpp to see how to do immediate join and leave of the threads (set cids, call sched->join()/leave(), set fPtrs, etc.).

vijay4454 commented 8 years ago

Thanks hlitz and gaomy. I implemented this by counting the sum total of cycles spent inside the API calls for each thread, and then simply writing out that count to a per-core statistic like the regular cycle count. It required changes to simple_core.cpp, ooo_core.cpp, etc in addition to zsim.cpp and pin_cmd.cpp. I did not add a new magic call. Simulator seems to be working properly with this change.

It was a bit of an intrusive change to the simulator, but I guess that is ok as long as it works without simulation slowdown.

benpatsai commented 8 years ago

If you do that, remember that the whole memory hierarchy does see what happened during your magic functions. If those functions are very short and/or non-memory intensive then I think it's fine.

If that's not the case, I would suggest you leverage the NullCore, a perfect IPC=1 core, to better model what you want. Basically, add one NullCore in your system. When encountering magic functions, schedule that thread to the NullCore. Once it finishes those functions, schedule it back to the OOO core.

vijay4454 commented 8 years ago

@benpatsai. Thanks, that's a very good point. I ignored that.

I had tried something similar to your suggestion. I had fast forwarded the thread (the main thread in the code) on entering each magic function and exited fast forward on exit of each magic function. But the problem I faced was that a different thread gets scheduled on the core on which thread 0 was initially running. I want thread 0 to ALWAYS run on the first configured core (which I configure as OoO) and rest of the threads run on the other cores that are configured as simpleCores. This happens because thread 1 is created (using pthread_create) after thread 0 encounters the magic call and goes into fast-forward, thus leaving core 0 free for the taking for thread 1.

I think I will face the same issue if I follow your NullCore suggestion, won't I?

benpatsai commented 8 years ago

So that boils down to how to schedule a particular thread from a particular core to another core. One example implementation can be:

Implement a magic op to distinguish the main thread from other threads (like register thread)
Make the process have multiple core masks, one for the OOO core + the Null core, and one for in-order cores.
For non-main thread, use the in-order core mask. For the main thread, use the other mask. You can achieve this by setting the mask vector in the ThreadInfo for a thread.
When running into the magic functions, schedule the main thread between those two cores within that mask.

vijay4454 commented 8 years ago

@benpatsai. Thanks a lot for your suggestion. However, I think the approach of just measuring and subtracting cycle count in certain functions should work fine for my use case. The functions code should not alter the caching effects by much.

vijay4454 commented 8 years ago

@benpatsai. I am trying to implement your suggestion of having two core masks and scheduling main thread between the two cores in the mask for the main thread. I am not quite sure how to implement the point 4 in your answer above. Can you direct me to the relevant code in the simulator that will help me figure it out?

benpatsai commented 7 years ago

@vijay4454, you can look at process_tree.{cpp,h} to see how the mask is parsed from the config file. And by tracing down ProcessTreeNode.mask, you should be able to learn how/when the scheduler uses it to schedule threads to a set of cores.

gaomy3832 commented 7 years ago

@vijay4454, you may want to take a look at my pending pull request #114, which implements such affinity scheduling. I did it through the standard sched_get/setaffinity syscalls. You can reuse the internal logic with whatever interface you want to use.

vijay4454 commented 7 years ago

Thanks benpatsai & gaomy3832! I have been able to implement this and get it working.

s5z / zsim

Per Thread Fast Forwarding #138