Closed lialan closed 3 months ago
Good catch! Marie knows these much better, but I agree with conclusion
Ooh sorry I had missed that part! So my above comment was incorrect: this code was not causing falling off of the ukernel fast paths, since it was not taken when ukernels are enabled.
But good point that we don't need separate ukernel vs non-ukernel paths here anymore. Since ukernels are default, we don't care much anymore about the non-ukernel fine-tuning (and if we did, the best thing to do would be to work on improving codegen on the same tile shapes as are being selected.
So: resolve this by inlining all instances of hasUkernel
in this function as true
, i.e. always select tile sizes as if having ukernels enabled. (So when propagating that, that will result in dropping one of the two code paths as suggested by Hanhan).
FYI this appears to have regressed performance on at least one continuous benchmark configuration: https://perf.iree.dev/serie?IREE?a57eb6fb2cf6fbf61f9141e63784cc356153766042d0890e6caee565eadcbf21 (~1200ms -> ~2000ms)
The compile flags for that benchmark are here: https://github.com/iree-org/iree/blob/024c48b23c0e8721d94e544fa78bcfd291afa439/tests/e2e/test_artifacts/generated_e2e_test_iree_artifacts.cmake#L1673-L1686
What is the latest status for this flag combination? Is it useful to keep the experimental-flags
benchmarks at this point?
"--iree-opt-data-tiling=true"
"--iree-llvmcpu-enable-ukernels=none"
This similar benchmark https://perf.iree.dev/serie?IREE?3263426782173c417a4205ee460ccf4acb939c53397da8ae06f8ebf3f7228f87 using these flags is stable at 317ms: https://github.com/iree-org/iree/blob/024c48b23c0e8721d94e544fa78bcfd291afa439/tests/e2e/test_artifacts/generated_e2e_test_iree_artifacts.cmake#L1583-L1596
Thanks @ScottTodd for flagging this. In this instance, we want to live with this regression because (1) it only affects a non-default path and (2) it only exposes an existing flaw in that non-default path (which was being papered over by the thing that this PR removed).
DT-only is now a non-default configuration on x86-64 and arm-64 CPU, since the mmt4d ukernel was enabled by default there, making DT_UK the new default.
The fact that DT_UK does 317 ms where DT_only used to do 1200 ms with the tweaked tiled size and now does 2000 ms without the tweaked tile size, just means that codegen is doing a very poor job. There is no fundamental reason why codegen would need a different tile size than the ukernel case, so the currently 2000 ms codegen path could be optimized to become as fast as the 317 ms ukernel path, if anyone cares --- so there'll never be a need for separate tile sizes for codegen.
I don't know though that anyone still cares about pure codegen for mmt4d ops. It would be neat to not need ukernels, but having ukernels on by default makes it a non-priority for some of us.
This patch is to make sure the matmul tiles selected for ARM64 to have same dimensions for arm64 ukernels.