Compiling for llvm-cpu without targeting a specific CPU is a bad experience

stellaraccident commented 1 day ago

I've seen multiple people falling down this hole: they run iree-compile on their model, targeting CPU. Then they get performance that is 10x-100x off of any reasonable expectation. Then they either go away silently or report back about poor experiences (not always reporting flags and such).

There are good reasons why a compiler like IREE shouldn't make assumptions about what the CPU target is, but on the other hand, it will almost always produce a a grossly subpar experience to not specify a target CPU, since the generic target (at least on X86) lacks so many features as to be basically useless for any high performance numerics.

I've even fallen down this hole recently and had to go remember the incantation to select a specific CPU. In the case I was working on (an f16 CPU LLM), performance was 100x different between not specifying a target CPU and specifying "host". We need to guide people better than this.

Proposal

As mentioned, there are good reasons for a compiler to not make too many assumptions without being told what to do. But I think we can/should actively warn, possibly with a link to the documentation site when the compiler is invoked without the user specifying a target CPU. Whatever the warning is should be very explicit that the user should pass --iree-llvmcpu-target-cpu=host to target the precise CPU they are running on. We should possibly also accept "generic" or something like that for if the user really wants to target the default and not get the warning. I basically want to guard against the case where the user has not specified anything and the compiler just silently generates 100x too slow of code. In almost all cases, it will be better for the user to say something and we should guide them on a proper choice.

ScottTodd commented 1 day ago

We need to guide people better than this.

Yes please! https://github.com/iree-org/iree/issues/15487

stellaraccident commented 1 day ago

Let's stop over thinking this and do something simple like I suggest. Open to other options but would like to see this improved.

ScottTodd commented 12 hours ago

The suggested proposal SGTM. I might even want to default to having LLVM use the current host for the target CPU and available features (if those are different), then have users explicitly pass "generic" for the lowest common denominator.

We could apply similar logic to the GPU backends - try to detect devices on the system (shell out to vulkaninfo / rocm-smi / nvidia-smi?) and default to what is available, but still support cross compilation with explicit device info and a "generic" target where possible.

benvanik commented 12 hours ago

Yuck - that is not a cheap thing to do and has a high risk of flakes - I am still not sure why proper documentation is insufficient? You must specify your target device (--iree-hal-target-device=) when compiling so also specifying a "use my hardware info" is fine. Just change the documentation to include both flags and then a user has a choice and knows what to do if they want to change things. Anything automatic is going to have issues. Users aren't coming in to iree-compile command line invocations blind - if all the docs specify the flag and they choose not to copy/paste it that's on them.

benvanik commented 12 hours ago

Note that this is also what clang does - https://clang.llvm.org/docs/HIPSupport.html - you must pass --offload-arch= to compile HIP code and if you want the native host target you must pass --offload-arch=native. nvcc does the same thing - you pass -arch=[some gpu] or -arch=native.

benvanik commented 12 hours ago

(if there's such a big concern about documentation not fixing this issue then I'd be ok with making compilation fail if the user doesn't specify an arch for a backend - whether a particular arg, generic, native, etc - but guessing is bad)

ScottTodd commented 12 hours ago

We can certainly update the docs (https://iree.dev/guides/deployment-configurations/cpu/#compile-a-program) and start with a warning from the compiler if information is omitted and generic is used as the default.

I'm seeing a proliferation of flags (mainly in rocm usage, but also cpu) and the documentation can't keep up. I want more of that to be captured somewhere - docs, samples, the compiler itself, etc.

See one example here: https://github.com/iree-org/iree/blob/914858fb89c028a94564b590626aab519d611e54/experimental/regression_suite/shark-test-suite-models/sdxl/test_unet.py#L90-L110

benvanik commented 12 hours ago

That's insanity - besides the debug flags (dumping statistics/etc) if any of those are required that's a bug. I think Mahesh has said it before: a feature is not done until it's on by default and if all of those flags are needed to make the model compile or perform then the engineering was never completed. The only two flags required there should be --iree-hal-target-device=hip (that's using the old deprecated flag) and --iree-hip-target= (which could be native if we wanted to do what clang does and invoke amdgpu-arch if it's present).

stellaraccident commented 12 hours ago

I'm fine making --iree-llvmcpu-target-cpu=<something> required. Given the proliferation of things out there, I think that the way to get there may be to first do what I am suggesting: make it issue a warning if not specified and incorrectly/implicitly defaulting to a generic CPU (with a note that this flag will soon be required).

Agreed on all of the other points. Need to burn down all of the other flags. I'm just starting with this one.

ScottTodd commented 10 hours ago

I can take a pass at this, unless someone else wants to.

Plan:

Emit a warning if --iree-llvmcpu-target-cpu is omitted
(Maybe?) emit a warning if iree-llvmcpu-target-cpu-features is omitted
Update docs at https://iree.dev/guides/deployment-configurations/cpu/
Audit usage of --iree-hal-target-backends=llvm-cpu in-tree and set those flags explicitly
- Also start switching the repo over to --iree-hal-target-device?
(Later) make one or both of those flags required

ScottTodd commented 9 hours ago

Can someone clarify why we have all three of these flags?

--iree-llvmcpu-target-triple
--iree-llvmcpu-target-cpu
--iree-llvmcpu-target-cpu-features

It seems like the triple could be a superset of the cpu? Is there some redundancy there? I see some riscv sample code setting both: https://github.com/iree-org/iree/blob/d834aa7357179e0d806f3634d2efe3af2fa45171/runtime/src/iree/hal/local/elf/testdata/generate.sh#L68-L73

but even our microkernels blog post (highlighting cpu performance work) only includes a few of the flags: https://github.com/iree-org/iree/blob/d834aa7357179e0d806f3634d2efe3af2fa45171/docs/website/docs/community/blog/posts/microkernels.md?plain=1#L25-L32

Oh, the linalg tutorial from @bjacob explains that target-cpu is for x86, but target-cpu-features is for other architectures? https://github.com/iree-org/iree/blob/d834aa7357179e0d806f3634d2efe3af2fa45171/docs/website/docs/community/blog/posts/linalg-tutorial.md?plain=1#L183-L193

There is some complicated logic and then a few calls into LLVM itself in https://github.com/iree-org/iree/blob/main/compiler/plugins/target/LLVMCPU/LLVMTargetOptions.cpp.

(This is why I filed https://github.com/iree-org/iree/issues/15487 - I've wanted someone directly familiar with LLVM CPU to be driving this)

ScottTodd commented 8 hours ago

More context on cpu vs cpu-features:

stellaraccident commented 8 hours ago

This stuff always grows into a bit of a hairball. The condition we are trying to guard is that iree-llvmcpu-target-cpu being empty should never drive a decision (should be a warning now and an error eventually). Will need to peel back the decision tree to that.

benvanik commented 7 hours ago

The CPU flags mirror LLVM - we can't remove them, but we could more intelligently populate them - maybe - triple is often not enough. I think we do ask for defaults from LLVM today. I'm hesitant to suggest we diverge from clang behavior as then we have to support that (and if the issue here is that our documentation sucks adding bespoke stuff only hurts that).

ScottTodd commented 5 hours ago

Made some progress stepping through the details:

I tried this resnet50 ONNX model with and without --iree-llvmcpu-target-cpu=host on my system. I see about 140ms with the flag and 170ms without:

See logs and commands used here
``` # (Download the file and upgrade it to ONNX version >= 17) $ iree-compile.exe \ resnet50-v2-7_version17.mlir \ --iree-hal-target-backends=llvm-cpu \ -o resnet_noflags.vmfb $ iree-compile.exe \ resnet50-v2-7_version17.mlir \ --iree-hal-target-backends=llvm-cpu \ --iree-llvmcpu-target-cpu=host \ -o resnet_targetcpu_host.vmfb $ iree-benchmark-module.exe \ --module=resnet_noflags.vmfb \ --device=local-task \ --function=main \ --input=1x3x224x224xf32 2024-09-20T14:05:40-07:00 Running iree-benchmark-module.exe Run on (64 X 3693 MHz CPU s) CPU Caches: L1 Data 32 KiB (x32) L1 Instruction 32 KiB (x32) L2 Unified 512 KiB (x32) L3 Unified 16384 KiB (x8) ***WARNING*** Library was built as DEBUG. Timings may be affected. ----------------------------------------------------------------------------------------- Benchmark Time CPU Iterations UserCounters... ----------------------------------------------------------------------------------------- BM_main/process_time/real_time 171 ms 2898 ms 4 items_per_second=5.86105/s $ iree-benchmark-module.exe \ --module=resnet_targetcpu_host.vmfb \ --device=local-task \ --function=main \ --input=1x3x224x224xf32 2024-09-20T14:06:24-07:00 Running iree-benchmark-module.exe Run on (64 X 3693 MHz CPU s) CPU Caches: L1 Data 32 KiB (x32) L1 Instruction 32 KiB (x32) L2 Unified 512 KiB (x32) L3 Unified 16384 KiB (x8) ***WARNING*** Library was built as DEBUG. Timings may be affected. ----------------------------------------------------------------------------------------- Benchmark Time CPU Iterations UserCounters... ----------------------------------------------------------------------------------------- BM_main/process_time/real_time 138 ms 2134 ms 5 items_per_second=7.24555/s ```
JitGlobals runs the CPU compilation pipeline for the host, regardless of flags and requested devices. That's fine, it's an implementation detail of the compiler. We shouldn't emit any warnings on this path. Relevant code: https://github.com/iree-org/iree/blob/891f438c27f201ae0dde3e8408eefc6fca80bc5f/compiler/src/iree/compiler/ConstEval/JitGlobals.cpp#L658-L668 https://github.com/iree-org/iree/blob/891f438c27f201ae0dde3e8408eefc6fca80bc5f/compiler/src/iree/compiler/ConstEval/JitGlobals.cpp#L804-L814
The targetCPU flag default is set here: https://github.com/iree-org/iree/blob/891f438c27f201ae0dde3e8408eefc6fca80bc5f/compiler/plugins/target/LLVMCPU/LLVMTargetOptions.h#L178-L181 I'm thinking I'll replace that with empty string then add some logic that warns and sets it back to that default later. Need to still watch for how JitGlobals, default target construction from CLI flags, and explicit target construction from --iree-hal-target-device or program IR interact with all the code paths.

iree-org / iree

Compiling for llvm-cpu without targeting a specific CPU is a bad experience #18561

Proposal