Add/Scrub Instrumentation for Understanding DRAM Bottlenecks

stanford-ppl / spatial

Spatial: "Specify Parameterized Accelerators Through Inordinately Abstract Language"

MIT License

274 stars 32 forks source link

I think we already have all the components for this lying around, between MAGCore counters, instrumentation counters, and the backpressure/forwardpressure helpers. I think what would be useful is if --instrument helped answer the questions:

1) Is Fringe spending too much time waiting for the Accel to drain/fill the data fifos (i.e. parallelize loads/stores more) 2) Is Fringe having trouble keeping up with the requests generated by the Accel (i.e. distribute your loads/stores better in the app, or use "Par of Pipes vs Pipe of Pars" or "decentralized controllers" flags if/when these options exist) 3) Where are the hotspots in the Accel causing issues 1 and/or 2

I added cycles stalled / iter (outbound not ready) and cycles idle / iter (inbounds not valid). Still needs to be user-tested to know if this is the right metric to track, but I think it is a good start. It helps to show where the bottle necks are. In the best case, cycles idle / iter should be equal to pipe latency (data was valid all the time, except for when it was fully exhausted and the pipe was drained). Ideally, there would be 0 cycles stalled / iter.
For DRAM, its a little tricky because there is some kind of effective lower bound for these numbers (in a simple dot product test, I see about 200 cycles idle/iter for the stage that catches data from DRAM). Not sure how to incorporate this info.

stanford-ppl / spatial

Add/Scrub Instrumentation for Understanding DRAM Bottlenecks #163