iree-org / iree

A retargetable MLIR-based machine learning compiler and runtime toolkit.
http://iree.dev/
Apache License 2.0
2.83k stars 611 forks source link

ResNet18 Wrong Results in FP16 Mode on GPU RTX 4000 #13785

Open ShuaiShao93 opened 1 year ago

ShuaiShao93 commented 1 year ago

What happened?

Pytorch fp16 resnet18 model returns wrong results on RTX 4000

Steps to reproduce your issue

  1. Create a file with following content
    
    import torch

import os import io import numpy as np import time

import torch_mlir import iree.compiler as ireec import iree.runtime as ireert

model = torch.hub.load('pytorch/vision:v0.10.0', 'resnet18', pretrained=True) model.eval()

Download an example image from the pytorch website

import urllib url, filename = ("https://github.com/pytorch/hub/raw/master/images/dog.jpg", "dog.jpg") try: urllib.URLopener().retrieve(url, filename) except: urllib.request.urlretrieve(url, filename)

sample execution (requires torchvision)

from PIL import Image from torchvision import transforms input_image = Image.open(filename) preprocess = transforms.Compose([ transforms.Resize(256), transforms.CenterCrop(224), transforms.ToTensor(), transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]), ]) input_tensor = preprocess(input_image) input_batch = input_tensor.unsqueeze(0) # create a mini-batch as expected by the model

move the input and model to GPU for speed

device = "cuda" input_batch = input_batch.to(device) model.to(device)

model.half() input_batch = input_batch.half()

warmup

with torch.no_grad(): output = model(input_batch)

start = time.time()
output = model(input_batch)
print("torch result", output[0, 0])
print("torch time", time.time() - start)

IREE

mlir = torch_mlir.compile( model, input_batch, output_type="linalg-on-tensors", use_tracing=True)

iree_input_type = "tm_tensor" bytecode_stream = io.BytesIO() mlir.operation.write_bytecode(bytecode_stream) flatbuffer = ireec.compile_str(bytecode_stream.getvalue(), target_backends=[device], input_type=iree_input_type, extra_args=[ "--iree-hal-cuda-llvm-target-arch=sm_75", "--iree-flow-dump-dispatch-graph", "--iree-flow-dump-dispatch-graph-output-file=foo.dot"])

iree_device = ireert.get_device(device) config = ireert.Config(device=iree_device) ctx = ireert.SystemContext(config=config) vm_module = ireert.VmModule.from_flatbuffer(ctx.instance, flatbuffer) ctx.add_vm_module(vm_module) invoker = ctx.modules.module

warmup

iree_input_batch = ireert.asdevicearray(iree_device, input_batch.cpu().numpy()) result = invoker.forward(iree_input_batch)

start = time.time() result = invoker.forward(iree_input_batch) numpy_result = np.asarray(result) print("iree result", numpy_result[0][0]) print("iree time", time.time() - start)

2. Execute it and check the log

torch result tensor(0.0119, device='cuda:0', dtype=torch.float16) torch time 0.029006481170654297 iree result 0.0005054 iree time 0.016841650009155273



### What component(s) does this issue relate to?

_No response_

### Version information

6a46afd4f82715979b54c57fb41a8505466cd68f

### Additional context

_No response_
allieculp commented 1 year ago

@qcolombet Can you help give this a quick triage?

qcolombet commented 1 year ago

@rsuderman is this something someone in your team can investigate?

allieculp commented 1 year ago

Adding @jpienaar @aaron-schneider to help triage!

allieculp commented 1 year ago

@rsuderman @jpienaar @aaron-schneider Any update here?

jpienaar commented 1 year ago

No work has been scheduled here. FP16 triage here would require GPU to even produce the IR.

allieculp commented 1 year ago

Should we close as 'not planned'?

jpienaar commented 1 year ago

No, we first need to identify a team to do first order triage here. This is a GPU issue, whether GPU specific or not or codegen specific or not is TBD. But you need at least someone with a GPU to even produce the IR here given how torch-mlir works. We are currently generating what seems like the wrong results (with ML frameworks it is difficult as numerics very flexible ... ). There is the other issue where @silvasean was going to look at verifying numerics which may overlap here, but he has some other tasks in flight (but feel free to correct me @silvasean if you have some time to look at this as prelude for the other numeric verification issue assigned). Else this seems like it should be dispatched GPU side - so @mattwalsh or @julianwa to schedule.

allieculp commented 1 year ago

Adding @qcolombet if we need codegen team to look at this.

qcolombet commented 1 year ago

It's too early to know if my team needs to be involved. Let me try to get the IR like @jpienaar said to see if it helps narrow down what needs to happen here.

qcolombet commented 1 year ago

This is taking longer than what I would have expected.

I am struggling to get a working machine with the correct setup. torch-mlir is not supported with python 3.9, which is what I get with Debian 11.

I had to bump my python and now I hit:

    raise AssertionError("Torch not compiled with CUDA enabled")

I guess I need to build torch locally to get CUDA support.

ShuaiShao93 commented 1 year ago

I was using Ubuntu 20.04 + python 3.8 + cuda 11.8

powderluv commented 1 year ago

@ShuaiShao93 if you have the generate .mlir file @qcolombet could get away without anything to do with torch / torch-mlir. But if that is not possible to generate we can help get a recreate.

qcolombet commented 1 year ago

@mariecwhite pointed me to the right script to install pytorch with CUDA support: https://github.com/iree-org/iree-samples/blob/main/iree-torch/library/setup_venv.sh and the steps are:

export WITH_CUDA=1
./setup_venv.sh
unset WITH_CUDA
source torch-models.venv/bin/activate
qcolombet commented 1 year ago

I can reproduce the problem now:

torch result tensor(0.0126, device='cuda:0', dtype=torch.float16)
torch time 0.03869962692260742
iree result 0.0005054
iree time 0.004549264907836914
qcolombet commented 1 year ago

tm_tensor_input.mlir.tgz Here is the IR. To compile:

iree-compile --iree-hal-cuda-llvm-target-arch=sm_75 --iree-input-type=tm_tensor -iree-hal-target-backends=cuda -o <out>.vmfb tm_tensor_input.mlir
qcolombet commented 1 year ago

@jpienaar how do we go from there? Ideally it would be good to bisect which dispatch is problematic by sending half of them to the reference implementation and the other half to the GPU and iterate, but I don't know if that's possible.

ShuaiShao93 commented 1 year ago

Nice! Maybe it's a good idea to build a debugging tool to locate the problematic dispatch and dump the inputs for repro?

antiagainst commented 1 year ago

You can use the SPIR-V / LLVM CPU backend as a reference, if the result there is correct. Try to compile the model with --iree-flow-trace-dispatch-tensors on. Then run the model through SPIR-V / LLVM CPU and CUDA to compare all intermediate tensors and find the first meaningful different dispatch.

qcolombet commented 1 year ago

@antiagainst thanks for the pointer.

qcolombet commented 1 year ago

Hi @qedawkins,

@MaheshRavishankar mentioned offline that you may have some tooling already available to do the bisect here. Could you share these?

Thanks!

qedawkins commented 1 year ago

Hi @qedawkins,

@MaheshRavishankar mentioned offline that you may have some tooling already available to do the bisect here. Could you share these?

Thanks!

Yes, I added a few flags for targeted tracing/bisecting because --iree-flow-trace-dispatch-tensors produces far too much output and takes too long to be practical for full models in my experience. There are two options: 1) --iree-flow-trace-dispatch=<string> to trace the inputs/outputs of only a specified dispatch 2) --iree-flow-break-dispatch=<string> to "crop" the model at the first occurrence of the specified dispatch (per function in the input module).

You can either specify @<function_name>:<index> to trace on the <index>th dispatch in the model, or you can just specify a dispatch name, e.g. forward_dispatch_0_matmul_2560x2560x2560. Note that the dispatch name is treated as a regular expression so if you specify forward_dispatch_1 this will match dispatch_1, dispatch_10, dispatch_11 and so on (after deduplication).

Typically my flow will look something like:

  1. Compile with --iree-flow-dump-dispatch-graph-output-file=<string> to get a dump of the graph to inform how I bisect the dispatches
  2. Start from some "middle-ish" dispatch and do --iree-flow-break-dispatch=@forward:<index> and see the results of the model between two backends (in this case SPIR-V might be a good choice).*
  3. Bisect the dispatches based on the flow in the graph. Reminder that the graph is non-linear so cropping/tracing will only check the branch of the model whose results are directly required for the target dispatch.

*Tracing also works, but to me only makes sense over breaking if you want to check more than one dispatch at a time (e.g. all matmuls in the model). Otherwise with breaking you get to save on compile times and you get the output values as a typical model result which I find easier to work with in python (for value comparison) than the results printed by tracing. All depends on your preferred workflow though.

qedawkins commented 1 year ago

This script could likely be repurposed to verify values for the above flow: https://github.com/iree-org/iree-samples/blob/main/transform_dialect/python/compile_and_compare.py

Currently the target device is shared between both compilation commands so that would have to be split into two.

@qcolombet let me know if you have any other questions, hope this helps :D

qcolombet commented 1 year ago

Thanks @qedawkins for the details. I haven't had a chance to look at it yet, but wanted to acknowledge that I saw your message.