Open SeanNijjar opened 3 weeks ago
@davorchap - gave to you to reassign. Since @aliuTT is out for now.
For now I've decided to revisit this priority as I've done a bit of an investigation and found a bigger potential culprit for perf swings. I've also been unable to reproduce the massive gains by getting rid of this the other day (though I still see some noticeable improvement:
Currently, the
erisc.cc
erisc kernel wrapper code performs (rightfully) context switches to the lower level routing FW while waiting for work.However, the current implementation will always context switch as soon as it sees there is no run message for the core.
This is pretty massively detrimental to erisc user perf (and CCL perf, more generally) as it means that there is a very high likelihood of multiple cores context switching across the op on startup. When aggregated across the entire op, this effect compounds and can lead to many thousands of cycles wasted at startup. Additionally, it introduces large startup skew between eriscs on the same core.
Overall, after prototyping a non-mergeable change to introduce an idle timer before context switching in this code block, I was able to massively reduce CCL cycle times - especially for smaller invocations - 2-3 thousand cycles on the low end (for minumum times) and many thousands on the high end (max cycle count).
e.g. example of experimental change:
Note that you can't naively just add an always active idle counter because that will block the ethernet FW routing from working correctly. Instead, we should only enable this after we are absolutely sure the full FD path has been brought up. Additionally, we may wish to expose some low level FW signals (e.g. work available) to this level of FW so it can know it shoudl context switch.