Closed jtuyls closed 3 days ago
Does the core repeat the fill/matmul indefinitely, but the data is delivered from L3 only once? How is this intended to be used? It's not clear from the comments how an infinite loop can be used to "average over a certain number of runs". Please address this comment by improving the description in-code.
Done
Is this definitely useful, have you seen subsequent runs being significantly faster? f so, is it noticeably different from using the n_runs/n_repeats flags already exposed in run.py?
Yes.
If so, is the reason that the interval between runs is reduced with this new approach, and this reduced delay prevents the AIE going to sleep?
Yes, very likely.
Is your reason for not including 1 that it will be too 'invasive'? Could it not be incorporated into the existing precompiler #ifdef logic using
and ?
Yes, this can be done in a follow up. For now, manually adjusting works well for me and I would even have to do that after putting it inside the ifdef as I would have to adjust the number of runs anyway. So yeah, I prefer to do this well later.
Adds a new pass that inserts infinite looping around the
amdaie.core
blocks. This results in the cores running the same program over and over which is useful for measuring performance statistics like latency/throughput, averaged over a certain number of runs, while excluding core reconfiguration overhead.Concretely, with this flag, it becomes possible to put a loop around the kernel command execution:
at this location: https://github.com/nod-ai/iree-amd-aie/blob/076ea13e599d5c4283b989d0e6587b698220ef53/runtime/src/iree-amd-aie/driver/xrt-lite/direct_command_buffer.cc#L201
This way, every computation and data movement inside the NPU/AIE will be repeated N number of times.