ONNX "resize" op test failures

ScottTodd commented 6 months ago

What happened?

https://github.com/iree-org/iree/pull/17330 updates our LLVM and torch-mlir commits, pulling in https://github.com/llvm/torch-mlir/pull/3013. Some tests are newly passing, many tests are still failing somewhere (compiler, runtime numerics), and a few tests are hanging on certain platforms.

At least CUDA is hanging on test_resize_downsample_scales_linear: https://github.com/iree-org/iree/actions/runs/9034897378/job/24828864270?pr=17330#step:9:1813 I can't reproduce that on Windows though.

Steps to reproduce your issue

Generally follow the instructions at https://github.com/nod-ai/SHARK-TestSuite/tree/main/iree_tests and pull the config files from this repo.

For example, to run on CUDA:

pytest onnx/ -k test_resize -rA \
  --config-files=D:\dev\projects\iree\build_tools\pkgci\external_test_suite\onnx_gpu_cuda.json \
  --ignore-xfails

or Vulkan:

pytest onnx/ -k test_resize -rA \
  --config-files=D:\dev\projects\iree\build_tools\pkgci\external_test_suite\onnx_gpu_vulkan.json \
  --ignore-xfails

Config	Logs
CPU	https://gist.github.com/ScottTodd/0778165b2d31a54bfefbb9fa2b2662d6
CUDA	https://gist.github.com/ScottTodd/dd34be6577da489f3d5b6b0a0a65ed0d
Vulkan	https://gist.github.com/ScottTodd/b2f509585bee804ebd900e2144258241

Note that Vulkan has model.mlir:4:10: error: failed to legalize operation 'arith.fptosi' that was explicitly marked illegal

What component(s) does this issue relate to?

Frontends, Compiler, Runtime

Version information

No response

Additional context

No response

bjacob commented 6 months ago

FYI @AmosLewis this is the reason why https://github.com/llvm/torch-mlir/pull/3013 was ultimately dropped from the integrate #17330.

AmosLewis commented 6 months ago

FYI @AmosLewis this is the reason why llvm/torch-mlir#3013 was ultimately dropped from the integrate #17330.

Will you start a new PR to bump it next? Do you have any idea is it a torch-mlir bug or is it an iree bug?

ScottTodd commented 6 months ago

I suspect the Vulkan failed to legalize operation 'arith.fptosi' error is in upstream MLIR SPIRV (missing lowering)
Numerical errors in tests could be issues in the torch-mlir lowerings
CUDA hang ... no idea, couldn't get much from CI logs and couldn't reproduce on Windows. Maybe a miscompile (torch-mlir lowering) or runtime issue (IREE CUDA HAL), if compilation succeeded but the hang was a runtime.

AmosLewis commented 6 months ago

https://github.com/nod-ai/SHARK-Turbine/issues/616 the model and failure resize mlir are listed in the description

bjacob commented 6 months ago

Will you start a new PR to bump it next?

I don't plan to do it myself. We have an integration rotation schedule and the integrates of this week were already done out-of-schedule :-)

ScottTodd commented 6 months ago

We have a separate rotation for updating torch-mlir (in fact, @AmosLewis is up for next week 🤔). They are usually updated separately but needed to be updated together in this case.

AmosLewis commented 6 months ago

https://github.com/iree-org/iree/pull/17358

iree-org / iree