iree-org / iree

A retargetable MLIR-based machine learning compiler and runtime toolkit.
http://iree.dev/
Apache License 2.0
2.58k stars 577 forks source link

Drop tile sizes specific to the ukernels-disabled case. #17631

Closed lialan closed 3 months ago

lialan commented 3 months ago

This patch is to make sure the matmul tiles selected for ARM64 to have same dimensions for arm64 ukernels.

jpienaar commented 3 months ago

Good catch! Marie knows these much better, but I agree with conclusion

bjacob commented 3 months ago

Ooh sorry I had missed that part! So my above comment was incorrect: this code was not causing falling off of the ukernel fast paths, since it was not taken when ukernels are enabled.

But good point that we don't need separate ukernel vs non-ukernel paths here anymore. Since ukernels are default, we don't care much anymore about the non-ukernel fine-tuning (and if we did, the best thing to do would be to work on improving codegen on the same tile shapes as are being selected.

So: resolve this by inlining all instances of hasUkernel in this function as true, i.e. always select tile sizes as if having ukernels enabled. (So when propagating that, that will result in dropping one of the two code paths as suggested by Hanhan).

ScottTodd commented 3 months ago

FYI this appears to have regressed performance on at least one continuous benchmark configuration: https://perf.iree.dev/serie?IREE?a57eb6fb2cf6fbf61f9141e63784cc356153766042d0890e6caee565eadcbf21 (~1200ms -> ~2000ms)

The compile flags for that benchmark are here: https://github.com/iree-org/iree/blob/024c48b23c0e8721d94e544fa78bcfd291afa439/tests/e2e/test_artifacts/generated_e2e_test_iree_artifacts.cmake#L1673-L1686

What is the latest status for this flag combination? Is it useful to keep the experimental-flags benchmarks at this point?

     "--iree-opt-data-tiling=true" 
     "--iree-llvmcpu-enable-ukernels=none" 

This similar benchmark https://perf.iree.dev/serie?IREE?3263426782173c417a4205ee460ccf4acb939c53397da8ae06f8ebf3f7228f87 using these flags is stable at 317ms: https://github.com/iree-org/iree/blob/024c48b23c0e8721d94e544fa78bcfd291afa439/tests/e2e/test_artifacts/generated_e2e_test_iree_artifacts.cmake#L1583-L1596

bjacob commented 3 months ago

Thanks @ScottTodd for flagging this. In this instance, we want to live with this regression because (1) it only affects a non-default path and (2) it only exposes an existing flaw in that non-default path (which was being papered over by the thing that this PR removed).

DT-only is now a non-default configuration on x86-64 and arm-64 CPU, since the mmt4d ukernel was enabled by default there, making DT_UK the new default.

The fact that DT_UK does 317 ms where DT_only used to do 1200 ms with the tweaked tiled size and now does 2000 ms without the tweaked tile size, just means that codegen is doing a very poor job. There is no fundamental reason why codegen would need a different tile size than the ukernel case, so the currently 2000 ms codegen path could be optimized to become as fast as the 317 ms ukernel path, if anyone cares --- so there'll never be a need for separate tile sizes for codegen.

I don't know though that anyone still cares about pure codegen for mmt4d ops. It would be neat to not need ukernels, but having ukernels on by default makes it a non-priority for some of us.