nod-ai / iree-amd-aie

IREE plugin repository for the AMD AIE accelerator
Apache License 2.0
69 stars 30 forks source link

Add pass that inserts infinite loops around core blocks #908

Closed jtuyls closed 3 days ago

jtuyls commented 3 days ago

Adds a new pass that inserts infinite looping around the amdaie.core blocks. This results in the cores running the same program over and over which is useful for measuring performance statistics like latency/throughput, averaged over a certain number of runs, while excluding core reconfiguration overhead.

Concretely, with this flag, it becomes possible to put a loop around the kernel command execution:

auto time0 = std::chrono::high_resolution_clock::now();
int nb_runs = 1;
for (int i = 0; i < nb_runs; i++) {
  ebuf.m_cmd_pkt->state = ERT_CMD_STATE_NEW;
  hwq->issue_command(ebuf.get_exec_buf_bo());
  hwq->wait_command(ebuf.get_exec_buf_bo(), 0);
}
auto time1 = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> elapsed_s = (time1 - time0) / nb_runs;

at this location: https://github.com/nod-ai/iree-amd-aie/blob/076ea13e599d5c4283b989d0e6587b698220ef53/runtime/src/iree-amd-aie/driver/xrt-lite/direct_command_buffer.cc#L201

This way, every computation and data movement inside the NPU/AIE will be repeated N number of times.

jtuyls commented 3 days ago

Does the core repeat the fill/matmul indefinitely, but the data is delivered from L3 only once? How is this intended to be used? It's not clear from the comments how an infinite loop can be used to "average over a certain number of runs". Please address this comment by improving the description in-code.

Done

jtuyls commented 3 days ago

Is this definitely useful, have you seen subsequent runs being significantly faster? f so, is it noticeably different from using the n_runs/n_repeats flags already exposed in run.py?

Yes.

If so, is the reason that the interval between runs is reduced with this new approach, and this reduced delay prevents the AIE going to sleep?

Yes, very likely.

Is your reason for not including 1 that it will be too 'invasive'? Could it not be incorporated into the existing precompiler #ifdef logic using and ?

Yes, this can be done in a follow up. For now, manually adjusting works well for me and I would even have to do that after putting it inside the ifdef as I would have to adjust the number of runs anyway. So yeah, I prefer to do this well later.