iree-org / iree

A retargetable MLIR-based machine learning compiler and runtime toolkit.
http://iree.dev/
Apache License 2.0
2.6k stars 584 forks source link

initCpuLaunchConfig doesn't propagate error status #5756

Closed asaadaldien closed 2 years ago

asaadaldien commented 3 years ago

To repro, you can undo https://github.com/google/iree/pull/5755

iree-translate will fail which should reported as an (empty) slot in the perf graph.

asaadaldien commented 3 years ago

This is a buildkite issue, (comment from @GMNGeoffrey this should create a failed exist status), see logs: https://buildkite.com/iree/iree-android-benchmark/builds/2460#4286ae2e-7b78-46ba-9df7-48089b6c0767

GMNGeoffrey commented 3 years ago

So I'm very confused by these logs. The error shows up in the step where it's "preparing benchmark files" https://buildkite.com/iree/iree-android-benchmark/builds/2460#4286ae2e-7b78-46ba-9df7-48089b6c0767/45-4718, which is just copying some files around: https://github.com/google/iree/blob/51f8385c68b7e61f06c4a59971b48e72dc097427/build_tools/mako/prepare_benchmark_files.py#L30-L42

It shouldn't even be running iree-translate there. It's not called till later (not even the same script): https://github.com/google/iree/blob/51f8385c68b7e61f06c4a59971b48e72dc097427/build_tools/mako/compile_android_modules.py#L37-L43. This makes me suspicious that the logs are getting mangled somehow, but I'm not sure how. Everything looks like synchronous loops to me

GMNGeoffrey commented 3 years ago

I'm reproing locally.

$ docker run --user=$(id -u):$(id -g) --volume=$PWD:$IREE_DOCKER_WORKDIR --workdir=$IREE_DOCKER_WORKDIR --rm gcr.io/iree-oss/cmake-android@sha256:15d3266ae4865f7642a4ef4d76e5181f0dc3482a7cfba9021b6b55be524208ec python3 build_tools/mako/compile_android_modules.py

spews out the errors but has exit status 0 :confused:

GMNGeoffrey commented 3 years ago

And running that python script within the docker container interactively also has exit code 0. And so does running the iree-translate command it's failing on

build-host/iree/tools/iree-translate mobilenet-v2/iree_input.mlir --iree-mlir-to-vm-bytecode-module --iree-hal-target-backends=dylib-llvm-aot -o mobilenet-v2_Pixel4_cpu.vmfb --iree-llvm-target-triple=aarch64-none-linux-android29 --iree-flow-inline-constants-max-byte-length=2048 --iree-flow-dispatch-formation-enable-operand-fusion -iree-llvm-loop-unrolling=true

I'm guessing something is doing an emitOpError without returning it :-/

GMNGeoffrey commented 3 years ago

Yup: https://github.com/google/iree/blob/51f8385c68b7e61f06c4a59971b48e72dc097427/iree/compiler/Conversion/LinalgToLLVM/MaterializeCPULaunchConfigurationPass.cpp#L69-L74

When initCpuLaunchConfig fails and returns None, we just proceed. Something needs to propagate the failure

asaadaldien commented 3 years ago

Yup:

https://github.com/google/iree/blob/51f8385c68b7e61f06c4a59971b48e72dc097427/iree/compiler/Conversion/LinalgToLLVM/MaterializeCPULaunchConfigurationPass.cpp#L69-L74

When initCpuLaunchConfig fails and returns None, we just proceed. Something needs to propagate the failure

Good find, Thanks Geffory! I will file a separate issue to propagate that error and fix it. However this is a separate issue a bug like that can show up in the future and we shouldn’t end up with incorrect numbers reported in mako. Lets please fix that one (e.g pipeline checks vmfb module file is actually generated)

GMNGeoffrey commented 3 years ago

But it was created! It's 14 MB in fact. I'm not sure there's much the pipeline can do here

asaadaldien commented 3 years ago

Yup:

https://github.com/google/iree/blob/51f8385c68b7e61f06c4a59971b48e72dc097427/iree/compiler/Conversion/LinalgToLLVM/MaterializeCPULaunchConfigurationPass.cpp#L69-L74

When initCpuLaunchConfig fails and returns None, we just proceed. Something needs to propagate the failure

Good find, Thanks Geffory! I will file a separate issue to propagate that error and fix it. However this is a separate issue a bug like that can show up in the future and we shouldn’t end up with incorrect numbers reported in mako. Lets please keep that one

But it was created! It's 14 MB in fact. I'm not sure there's much the pipeline can do here

You are absolutely right! This is generated (incorrect module) and we should fail and mako actually was getting the right perf number for that module.

MaheshRavishankar commented 2 years ago

This is stale now. Closing