iree-org / iree

A retargetable MLIR-based machine learning compiler and runtime toolkit.
http://iree.dev/
Apache License 2.0
2.5k stars 557 forks source link

CUDA backend Conv2d wrong result for specific parameter constellations #16476

Open lucas-camp opened 4 months ago

lucas-camp commented 4 months ago

What happened?

I have a PyTorch Conv2d module that is compiled with SHARK Turbine. Running the generated MLIR file through IREEs CUDA backend computes wrong results for specific combinations of input shapes, padding values and strides. It seems that the wrong results only appear if both input spatial dimensions are a multiple of 16, the ~strides~ paddings are a multiple of 16 + 1 (i.e. 1, 17, 33, ...) and the kernel size is 3.

Using Torch-MLIR with output type LINALG_ON_TENSORS results in the same wrong result. However using Torch-MLIR with output type TOSA produces correct results.

I tested the CUDA code on a Tesla V100 (Driver 450, CUDA 12.3) and a MX150 (driver 535, CUDA12.2). It's worth to note that the CPU backend computes the correct result for all passes.

Steps to reproduce your issue

Compile the different MLIR inputs from https://gist.github.com/lucas-camp/da680f922ea958fbcbdf0eee79ebf523#file-conv2d_turbine-mlir with commands iree-compile --iree-hal-target-backends=llvm-cpu INPUT.mlir -o OUTPUT.vmfb for CPU and iree-compile --iree-hal-target-backends=cuda --iree-hal-cuda-llvm-target-arch=sm_70 INPUT.mlir -o OUTPUT.vmfb for CUDA. Run both modules and compare the ouputs for a random input of size 1x1x16x16.

What component(s) does this issue relate to?

Compiler

Version information

IREE version 20240218.805

Additional context

FYI @marbre

stellaraccident commented 4 months ago

Based on the description, I'm going to start with the assumption that this is an issue with the torch to linalg conversions. Asking Rob to weigh in.

rsuderman commented 4 months ago

Based on the description, I'm going to start with the assumption that this is an issue with the torch to linalg conversions. Asking Rob to weigh in.

So the big problem is taking the tosa path and linalg path generate different linalg named ops. Specifically tosa inserts the transposes to support a NHWC ordering while torch uses the NCHW case. Looking at the dispatches generated it appears to be fine?

lucas-camp commented 4 months ago

Thanks for your response, as the llvm-cpu path leads to correct results and the GPU path only fails for specific parameter constellations and going from torch to linalg directly, I could imagine that the compilation goes wrong for some parameter specialization (for example 3x3 kernels) at a lower level. If you need more inputs to look at, let me know.