Closed borja-perez closed 2 years ago
We have experimented with an infrastructure like this, with apparent higher performance than the standard Sparta scheduler.
Caveats - it's in C. It uses different data structures. It may not be comparable. Apologies for TL;DR risks...
But basic approach is:
If an Engine has no inputs at phi = 0, it has nothing to do, and is removed from the schedule queue and is marked dormant If another Engine outputs data to a dormant Engine, the dormant Engine added to the schedule queue after the queue has been traversed before running phi = 0
If an Engine knows it has nothing to do for many, many clocks ( a rarely-accessed memory with a long latency, for example), it deschedules itself and schedules Trigger on the time-ordered Trigger queue. A Trigger simply says 'reschedule this Engine at time T', and the head of queue trigger time is evaluated before the schedule queue is traversed for phi 0
We constructed a system with 1 to many RISC-V scalar cores running a verification scalar code (with no store instructions) with a simulated memory having a latency of around a hundred cycles. The model was built around an ISS (like Spike) with extra calls automatically inserted to manage register scoreboarding, cache misses, passage of time etc. The simulated RISC-V cores modeled icache and dcache and function unit latencies by scoreboarding registers with a timetag at which the value would be valid. On a miss, the memory responded with a cacheline of data 100+ clocks after it received the request. A core which sees a miss continues until the destination register is used - if the data has not returned the timetag says 'infinity' and the core's Engine is descheduled. When the data turns up, the timetag is updated and in general the core's Engine is put back into the schedule queue. An icache miss deschedules the core; again, it's re-scheduled when the miss returns and the cache is updated.
We generally see overall aggregate simulation performance of around 20-30 MIPS on a 2018 Mac Mini (~3GHz Core i7 with 16 GB of RAM; single threaded). The performance dropped a bit with 64 or 128 cores all running the same silly program, interleaved at the ~100+ instruction per timeslice level.
We also did some simple time-ordered queue timing experiments, which showed that time-ordered queue insertions can be expensive.
Caveats again - we believe the simulation described here mirrored in functionality what Borja had at the time in Spike/Sparta as he's described. Both models evaluated cache lookups inline as a function call from the ISS. Both models modeled core+caches, and memory as separate Engines/Units. We were able to determine that the ISS speeds were not the bottleneck (both ISSs run 100-200 MIPS without timing and cache extras). If the situations are comparable, then it suggests that the scheduler can have a substantial effect on performance; this performance impact could well be very pronounced on more complex event-dense models such as microarchitecture models.
If this looks promising, I we're happy to share more info as desired; and to prototype within Sparta and let you know results for putative upstreaming - if it works as hoped.
Thanks, Borja, for the feedback!
To give a little background on our Scheduler, it has (at minimum) the following requirements to adhere to:
Tick
s which, but default is picosecondsThese requirements make up the bulk of the Scheduler's functionality. What makes our Scheduler so unique is the guaranteed ordering of events within simulation.
With these requirements stated, making a change to the Scheduler is always tricky. There are many simulators using Sparta that can experience unacceptable changes in timing that we just can't allow.
But, I have an idea that might allow you and team to move forward. From what I understand, you are using Sparta as the main driver of simulation, driving both Spike and your cache/system model. Being that the Spike model is orthogonal (w.r.t. timing) to the rest of the system, ordering is not required for Spike. What I suggest is that you use Sparta as slave in your simulation flow, allowing the special scheduler you've mentioned to be the driver. The Sparta Scheduler has APIs that allow a modeler to determine the next scheduled item in time between calls to run()
: sparta::Scheduler::nextEventTick and sparta::Scheduler::isFinished. You can "schedule" the Sparta Scheduler's run method to advance 1 tick or run
indefinitely until the Scheduler determines isFinished
. Then you can schedule a run of Spike and advance it again. What you get from this design is:
Bonus: Because the timing between the cores and the system isn't required to strict, you could place the cores in threads, have them run simultaneously, then sync, then advance Sparta to handle any outstanding transactions (if any are registered). Note that if you put the Sparta Scheduler in a thread running in parallel with these cores, not only will you get indeterministic behavior, but you will have to convert any events scheduled by a Spike-generated transaction on the timed Sparta components to a sparta::AsyncEvent. I can provide more details on how to do this if this is a direction you'd like to explore.
Moving this to discussion https://github.com/sparcians/map/discussions/329
As part of our work with Sparta, we have identified a potential opportunity to improve the performance of the event scheduler in certain cases. We model cores that have to do some work almost every cycle (unless identified as inactive), so we find ourselves scheduling an event per core for the next cycle continuosly. It would be more efficient to have a special scheduler that knows about certain work that has to be done in (almost) every cycle. This new scheduler would be very similar to the one already in Sparta, but would have a slightly different run function and a method to set a core as active/inactive. This should significantly reduce the overhead of having to handle the data structures holding events all the time.