Improving the performance of the event scheduler

borja-perez commented 4 years ago

As part of our work with Sparta, we have identified a potential opportunity to improve the performance of the event scheduler in certain cases. We model cores that have to do some work almost every cycle (unless identified as inactive), so we find ourselves scheduling an event per core for the next cycle continuosly. It would be more efficient to have a special scheduler that knows about certain work that has to be done in (almost) every cycle. This new scheduler would be very similar to the one already in Sparta, but would have a slightly different run function and a method to set a core as active/inactive. This should significantly reduce the overhead of having to handle the data structures holding events all the time.

petewilson-bsc commented 4 years ago

We have experimented with an infrastructure like this, with apparent higher performance than the standard Sparta scheduler.

Caveats - it's in C. It uses different data structures. It may not be comparable. Apologies for TL;DR risks...

But basic approach is:

there's a data structure which represents the simulated object. We call them Engines, but we believe they're similar to Sparta Units.
this data structure points to action routines, to the data structure(s) specific to the simulated object, as usual
all Engines are initially enqueued with time zero on a single doubly-linked list
the eventloop traverses this queue twice per clock, calling the Engine's action function each time with an argument which says phi = 0 or 1 (phi = 0 means start of clock, = 1 means end of clock
Engines grab their data from input ports at phi = 0. They update internal state as appropriate and queue output internally
at phi = 1, Engines seek to output their queued data to the input ports of other Engine's they're connected to

If an Engine has no inputs at phi = 0, it has nothing to do, and is removed from the schedule queue and is marked dormant If another Engine outputs data to a dormant Engine, the dormant Engine added to the schedule queue after the queue has been traversed before running phi = 0

If an Engine knows it has nothing to do for many, many clocks ( a rarely-accessed memory with a long latency, for example), it deschedules itself and schedules Trigger on the time-ordered Trigger queue. A Trigger simply says 'reschedule this Engine at time T', and the head of queue trigger time is evaluated before the schedule queue is traversed for phi 0

We constructed a system with 1 to many RISC-V scalar cores running a verification scalar code (with no store instructions) with a simulated memory having a latency of around a hundred cycles. The model was built around an ISS (like Spike) with extra calls automatically inserted to manage register scoreboarding, cache misses, passage of time etc. The simulated RISC-V cores modeled icache and dcache and function unit latencies by scoreboarding registers with a timetag at which the value would be valid. On a miss, the memory responded with a cacheline of data 100+ clocks after it received the request. A core which sees a miss continues until the destination register is used - if the data has not returned the timetag says 'infinity' and the core's Engine is descheduled. When the data turns up, the timetag is updated and in general the core's Engine is put back into the schedule queue. An icache miss deschedules the core; again, it's re-scheduled when the miss returns and the cache is updated.

We generally see overall aggregate simulation performance of around 20-30 MIPS on a 2018 Mac Mini (~3GHz Core i7 with 16 GB of RAM; single threaded). The performance dropped a bit with 64 or 128 cores all running the same silly program, interleaved at the ~100+ instruction per timeslice level.

We also did some simple time-ordered queue timing experiments, which showed that time-ordered queue insertions can be expensive.

Caveats again - we believe the simulation described here mirrored in functionality what Borja had at the time in Spike/Sparta as he's described. Both models evaluated cache lookups inline as a function call from the ISS. Both models modeled core+caches, and memory as separate Engines/Units. We were able to determine that the ISS speeds were not the bottleneck (both ISSs run 100-200 MIPS without timing and cache extras). If the situations are comparable, then it suggests that the scheduler can have a substantial effect on performance; this performance impact could well be very pronounced on more complex event-dense models such as microarchitecture models.

If this looks promising, I we're happy to share more info as desired; and to prototype within Sparta and let you know results for putative upstreaming - if it works as hoped.

klingaard commented 4 years ago

Thanks, Borja, for the feedback!

To give a little background on our Scheduler, it has (at minimum) the following requirements to adhere to:

Event scheduling must be flexible and based on relative time from current. The time granularity is always in Ticks which, but default is picoseconds
The Scheduler must support and adhere to Event ordering. Event ordering is determined by the modeler either by phasing or a direct precedence rule. As Events are scheduled, their order must remain consistent between one scheduling session and another. This is to allow the modeler to have the guarantee his/her model is advanced the exact same way each and every simulation run
The Scheduler must be able to "skip" time. This is the fundamental purpose of a Discrete Event Simulation engine.
The Scheduler must be able to determine when simulation is complete. The Scheduler supports two types of Events: continuing and non-continuing. Continuing events will keep simulation moving forward. Non-continuing do not require the Scheduler to keep simulation moving forward. As long as there is one continuing event scheduled, simulation will continue to be advanced. If no more continuing events are found, simulation will end.
The Scheduler must be able to allow asynchronous event scheduling, but only do so on time boundaries. There is no requirement that this event scheduling be reproducible exactly.

These requirements make up the bulk of the Scheduler's functionality. What makes our Scheduler so unique is the guaranteed ordering of events within simulation.

With these requirements stated, making a change to the Scheduler is always tricky. There are many simulators using Sparta that can experience unacceptable changes in timing that we just can't allow.

But, I have an idea that might allow you and team to move forward. From what I understand, you are using Sparta as the main driver of simulation, driving both Spike and your cache/system model. Being that the Spike model is orthogonal (w.r.t. timing) to the rest of the system, ordering is not required for Spike. What I suggest is that you use Sparta as slave in your simulation flow, allowing the special scheduler you've mentioned to be the driver. The Sparta Scheduler has APIs that allow a modeler to determine the next scheduled item in time between calls to run(): sparta::Scheduler::nextEventTick and sparta::Scheduler::isFinished. You can "schedule" the Sparta Scheduler's run method to advance 1 tick or run indefinitely until the Scheduler determines isFinished. Then you can schedule a run of Spike and advance it again. What you get from this design is:

The sparta Scheduler remains the controller of the timed components and maintains order within those components
The special scheduler will determine when to advance the cores and when to advance the Sparta Scheduler. Since this special scheduler is not required to follow any of the rules of the Sparta Scheduler for these disparate simulation components, it can use whatever scheduling algorithm it chooses.

Bonus: Because the timing between the cores and the system isn't required to strict, you could place the cores in threads, have them run simultaneously, then sync, then advance Sparta to handle any outstanding transactions (if any are registered). Note that if you put the Sparta Scheduler in a thread running in parallel with these cores, not only will you get indeterministic behavior, but you will have to convert any events scheduled by a Spike-generated transaction on the timed Sparta components to a sparta::AsyncEvent. I can provide more details on how to do this if this is a direction you'd like to explore.

klingaard commented 2 years ago

Moving this to discussion https://github.com/sparcians/map/discussions/329

sparcians / map

Improving the performance of the event scheduler #187