nod-ai / iree-amd-aie

IREE plugin repository for the AMD AIE accelerator
Apache License 2.0
69 stars 30 forks source link

[XRT-LITE] add ability to configure NPU power mode #851

Closed makslevental closed 1 month ago

makslevental commented 1 month ago

Notes

  1. This needs sudo;
    • if you want to run the whole run_matmul_test.sh script under sudo and you have env variables you need to do sudo -E;
  2. I remembered this can actually be done using xrt-smi with something like
    sudo /opt/xilinx/xrt/bin/xrt-smi configure -d 0000:c5:00.1 --pmode turbo

    Still maybe this is useful to expose directly here so that xrt-smi isn't required in the env.

Using the added test I got these results:

Default

-----------------------------------------------------------------------------------------------------------------
Benchmark                                                       Time             CPU   Iterations UserCounters...
-----------------------------------------------------------------------------------------------------------------
BM_matmul_64x64_64xbf16_/process_time/real_time              1.61 ms        0.671 ms          456 items_per_second=620.987/s
BM_matmul_64x64_64xbf16_/process_time/real_time              1.56 ms        0.641 ms          456 items_per_second=643.073/s
BM_matmul_64x64_64xbf16_/process_time/real_time              1.59 ms        0.648 ms          456 items_per_second=630.323/s
BM_matmul_64x64_64xbf16_/process_time/real_time              1.62 ms        0.653 ms          456 items_per_second=616.069/s
BM_matmul_64x64_64xbf16_/process_time/real_time              1.59 ms        0.646 ms          456 items_per_second=629.755/s
BM_matmul_64x64_64xbf16_/process_time/real_time              1.57 ms        0.644 ms          456 items_per_second=635.695/s
BM_matmul_64x64_64xbf16_/process_time/real_time              1.58 ms        0.641 ms          456 items_per_second=633.842/s
BM_matmul_64x64_64xbf16_/process_time/real_time              1.57 ms        0.639 ms          456 items_per_second=636.084/s
BM_matmul_64x64_64xbf16_/process_time/real_time              1.59 ms        0.642 ms          456 items_per_second=630.571/s
BM_matmul_64x64_64xbf16_/process_time/real_time              1.58 ms        0.648 ms          456 items_per_second=633/s
BM_matmul_64x64_64xbf16_/process_time/real_time_mean         1.59 ms        0.648 ms           10 items_per_second=630.94/s
BM_matmul_64x64_64xbf16_/process_time/real_time_median       1.58 ms        0.645 ms           10 items_per_second=631.786/s
BM_matmul_64x64_64xbf16_/process_time/real_time_stddev      0.019 ms        0.009 ms           10 items_per_second=7.68176/s
BM_matmul_64x64_64xbf16_/process_time/real_time_cv           1.23 %          1.42 %            10 items_per_second=1.22%

Turbo

-----------------------------------------------------------------------------------------------------------------
Benchmark                                                       Time             CPU   Iterations UserCounters...
-----------------------------------------------------------------------------------------------------------------
BM_matmul_64x64_64xbf16_/process_time/real_time              1.57 ms        0.652 ms          433 items_per_second=638.857/s
BM_matmul_64x64_64xbf16_/process_time/real_time              1.55 ms        0.651 ms          433 items_per_second=644.931/s
BM_matmul_64x64_64xbf16_/process_time/real_time              1.57 ms        0.650 ms          433 items_per_second=638.939/s
BM_matmul_64x64_64xbf16_/process_time/real_time              1.57 ms        0.644 ms          433 items_per_second=638.037/s
BM_matmul_64x64_64xbf16_/process_time/real_time              1.57 ms        0.664 ms          433 items_per_second=635.318/s
BM_matmul_64x64_64xbf16_/process_time/real_time              1.58 ms        0.663 ms          433 items_per_second=631.421/s
BM_matmul_64x64_64xbf16_/process_time/real_time              1.54 ms        0.648 ms          433 items_per_second=650.474/s
BM_matmul_64x64_64xbf16_/process_time/real_time              1.54 ms        0.646 ms          433 items_per_second=649.22/s
BM_matmul_64x64_64xbf16_/process_time/real_time              1.56 ms        0.669 ms          433 items_per_second=642.177/s
BM_matmul_64x64_64xbf16_/process_time/real_time              1.60 ms        0.660 ms          433 items_per_second=623.584/s
BM_matmul_64x64_64xbf16_/process_time/real_time_mean         1.56 ms        0.655 ms           10 items_per_second=639.296/s
BM_matmul_64x64_64xbf16_/process_time/real_time_median       1.57 ms        0.652 ms           10 items_per_second=638.898/s
BM_matmul_64x64_64xbf16_/process_time/real_time_stddev      0.020 ms        0.009 ms           10 items_per_second=8.09723/s
BM_matmul_64x64_64xbf16_/process_time/real_time_cv           1.27 %          1.31 %            10 items_per_second=1.27%

Higher items_per_second is better (I'm pretty sure?).

So for BM_matmul_64x64_64xbf16_/process_time/real_time_mean we get 630.94/s under default vs. 639.296/s under turbo, but with stddev=8.09723 it's basically the same. So I'm not sure what the effect should be :shrug:.

Note at least one of the things it's doing is enabling/disabling clock gating:

[13486.742867] amdxdna:aie2_pm_set_mode:90: amdxdna 0000:c5:00.1: Changing power mode from 0 to 4
[13486.742869] amdxdna:aie2_pm_clock_gating:27: amdxdna 0000:c5:00.1: Disable clock gating, 1 type(s)
...
[13493.313651] amdxdna:aie2_pm_set_mode:90: amdxdna 0000:c5:00.1: Changing power mode from 4 to 0
[13493.313653] amdxdna:aie2_pm_clock_gating:27: amdxdna 0000:c5:00.1: Enable clock gating, 1 type(s)

(via dmesg).

EDIT:

I did this test with a debug build - maybe in a release build there's a difference 🤷‍♂️