Open ShuaiShao93 opened 1 year ago
@qcolombet Can you help give this a quick triage?
@rsuderman is this something someone in your team can investigate?
Adding @jpienaar @aaron-schneider to help triage!
@rsuderman @jpienaar @aaron-schneider Any update here?
No work has been scheduled here. FP16 triage here would require GPU to even produce the IR.
Should we close as 'not planned'?
No, we first need to identify a team to do first order triage here. This is a GPU issue, whether GPU specific or not or codegen specific or not is TBD. But you need at least someone with a GPU to even produce the IR here given how torch-mlir works. We are currently generating what seems like the wrong results (with ML frameworks it is difficult as numerics very flexible ... ). There is the other issue where @silvasean was going to look at verifying numerics which may overlap here, but he has some other tasks in flight (but feel free to correct me @silvasean if you have some time to look at this as prelude for the other numeric verification issue assigned). Else this seems like it should be dispatched GPU side - so @mattwalsh or @julianwa to schedule.
Adding @qcolombet if we need codegen team to look at this.
It's too early to know if my team needs to be involved. Let me try to get the IR like @jpienaar said to see if it helps narrow down what needs to happen here.
This is taking longer than what I would have expected.
I am struggling to get a working machine with the correct setup.
torch-mlir
is not supported with python 3.9
, which is what I get with Debian 11.
I had to bump my python and now I hit:
raise AssertionError("Torch not compiled with CUDA enabled")
I guess I need to build torch
locally to get CUDA support.
I was using Ubuntu 20.04 + python 3.8 + cuda 11.8
@ShuaiShao93 if you have the generate .mlir file @qcolombet could get away without anything to do with torch / torch-mlir. But if that is not possible to generate we can help get a recreate.
@mariecwhite pointed me to the right script to install pytorch with CUDA support: https://github.com/iree-org/iree-samples/blob/main/iree-torch/library/setup_venv.sh and the steps are:
export WITH_CUDA=1
./setup_venv.sh
unset WITH_CUDA
source torch-models.venv/bin/activate
I can reproduce the problem now:
torch result tensor(0.0126, device='cuda:0', dtype=torch.float16)
torch time 0.03869962692260742
iree result 0.0005054
iree time 0.004549264907836914
tm_tensor_input.mlir.tgz Here is the IR. To compile:
iree-compile --iree-hal-cuda-llvm-target-arch=sm_75 --iree-input-type=tm_tensor -iree-hal-target-backends=cuda -o <out>.vmfb tm_tensor_input.mlir
@jpienaar how do we go from there? Ideally it would be good to bisect which dispatch is problematic by sending half of them to the reference implementation and the other half to the GPU and iterate, but I don't know if that's possible.
Nice! Maybe it's a good idea to build a debugging tool to locate the problematic dispatch and dump the inputs for repro?
You can use the SPIR-V / LLVM CPU backend as a reference, if the result there is correct. Try to compile the model with --iree-flow-trace-dispatch-tensors
on. Then run the model through SPIR-V / LLVM CPU and CUDA to compare all intermediate tensors and find the first meaningful different dispatch.
@antiagainst thanks for the pointer.
Hi @qedawkins,
@MaheshRavishankar mentioned offline that you may have some tooling already available to do the bisect here. Could you share these?
Thanks!
Hi @qedawkins,
@MaheshRavishankar mentioned offline that you may have some tooling already available to do the bisect here. Could you share these?
Thanks!
Yes, I added a few flags for targeted tracing/bisecting because --iree-flow-trace-dispatch-tensors
produces far too much output and takes too long to be practical for full models in my experience. There are two options:
1) --iree-flow-trace-dispatch=<string>
to trace the inputs/outputs of only a specified dispatch
2) --iree-flow-break-dispatch=<string>
to "crop" the model at the first occurrence of the specified dispatch (per function in the input module).
You can either specify @<function_name>:<index>
to trace on the <index>
th dispatch in the model, or you can just specify a dispatch name, e.g. forward_dispatch_0_matmul_2560x2560x2560
. Note that the dispatch name is treated as a regular expression so if you specify forward_dispatch_1
this will match dispatch_1
, dispatch_10
, dispatch_11
and so on (after deduplication).
Typically my flow will look something like:
--iree-flow-dump-dispatch-graph-output-file=<string>
to get a dump of the graph to inform how I bisect the dispatches--iree-flow-break-dispatch=@forward:<index>
and see the results of the model between two backends (in this case SPIR-V might be a good choice).**Tracing also works, but to me only makes sense over breaking if you want to check more than one dispatch at a time (e.g. all matmuls in the model). Otherwise with breaking you get to save on compile times and you get the output values as a typical model result which I find easier to work with in python (for value comparison) than the results printed by tracing. All depends on your preferred workflow though.
This script could likely be repurposed to verify values for the above flow: https://github.com/iree-org/iree-samples/blob/main/transform_dialect/python/compile_and_compare.py
Currently the target device is shared between both compilation commands so that would have to be split into two.
@qcolombet let me know if you have any other questions, hope this helps :D
Thanks @qedawkins for the details. I haven't had a chance to look at it yet, but wanted to acknowledge that I saw your message.
What happened?
Pytorch fp16 resnet18 model returns wrong results on RTX 4000
Steps to reproduce your issue
import os import io import numpy as np import time
import torch_mlir import iree.compiler as ireec import iree.runtime as ireert
model = torch.hub.load('pytorch/vision:v0.10.0', 'resnet18', pretrained=True) model.eval()
Download an example image from the pytorch website
import urllib url, filename = ("https://github.com/pytorch/hub/raw/master/images/dog.jpg", "dog.jpg") try: urllib.URLopener().retrieve(url, filename) except: urllib.request.urlretrieve(url, filename)
sample execution (requires torchvision)
from PIL import Image from torchvision import transforms input_image = Image.open(filename) preprocess = transforms.Compose([ transforms.Resize(256), transforms.CenterCrop(224), transforms.ToTensor(), transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]), ]) input_tensor = preprocess(input_image) input_batch = input_tensor.unsqueeze(0) # create a mini-batch as expected by the model
move the input and model to GPU for speed
device = "cuda" input_batch = input_batch.to(device) model.to(device)
model.half() input_batch = input_batch.half()
warmup
with torch.no_grad(): output = model(input_batch)
IREE
mlir = torch_mlir.compile( model, input_batch, output_type="linalg-on-tensors", use_tracing=True)
iree_input_type = "tm_tensor" bytecode_stream = io.BytesIO() mlir.operation.write_bytecode(bytecode_stream) flatbuffer = ireec.compile_str(bytecode_stream.getvalue(), target_backends=[device], input_type=iree_input_type, extra_args=[ "--iree-hal-cuda-llvm-target-arch=sm_75", "--iree-flow-dump-dispatch-graph", "--iree-flow-dump-dispatch-graph-output-file=foo.dot"])
iree_device = ireert.get_device(device) config = ireert.Config(device=iree_device) ctx = ireert.SystemContext(config=config) vm_module = ireert.VmModule.from_flatbuffer(ctx.instance, flatbuffer) ctx.add_vm_module(vm_module) invoker = ctx.modules.module
warmup
iree_input_batch = ireert.asdevicearray(iree_device, input_batch.cpu().numpy()) result = invoker.forward(iree_input_batch)
start = time.time() result = invoker.forward(iree_input_batch) numpy_result = np.asarray(result) print("iree result", numpy_result[0][0]) print("iree time", time.time() - start)
torch result tensor(0.0119, device='cuda:0', dtype=torch.float16) torch time 0.029006481170654297 iree result 0.0005054 iree time 0.016841650009155273