[GPU] : HSA_STATUS_ERROR_MEMORY_APERTURE_VIOLATION in GPU while passing inference in CPU

pdhirajkumarprasad commented 1 month ago

input.0.bin.txt input.1.bin.txt input.2.bin.txt

What happened?

For attached IR, we are seeing error as

:0:rocdevice.cpp            :3006: 1267514219452d us:  Callback: Queue 0x749caff00000 aborting with error : HSA_STATUS_ERROR_MEMORY_APERTURE_VIOLATION: The agent attempted to access memory beyond the largest legal address. code: 0x29

while same is passing in CPU with functional correctness. Due to weight, file size is becoming > 25M so uploaded in zip.

Steps to reproduce your issue

1> Download the zip file and unzip with command 'unzip model.torch_onnx.mlir.zip'

command to reproduce the issue on MI300:

iree-opt -pass-pipeline='builtin.module(func.func(convert-torch-onnx-to-torch))' model.torch_onnx.mlir -o model.torch.mlir 

iree-opt -pass-pipeline='builtin.module(torch-lower-to-backend-contract,func.func(torch-scalarize-shapes),torch-shape-refinement-pipeline,torch-backend-to-linalg-on-tensors-backend-pipeline)' model.torch.mlir -o model.modified.mlir 

iree-compile model.modified.mlir --iree-hal-target-backends=rocm --iree-hip-target=gfx942 -o compiled_model.vmfb 

iree-run-module --module='compiled_model.vmfb' --device=hip --function='main_graph' --input='1x128xi64=@input.0.bin' --input='1x128xi64=@input.1.bin' --input='1x128xi64=@input.2.bin' --output=@'output.0.bin'  --output=@'output.1.bin'

This is impacting 600+ models so please treat this as high priority

model.torch_onnx.mlir.zip

What component(s) does this issue relate to?

Runtime

Version information

No response

Additional context

No response

aviator19941 commented 1 month ago

I was able to solve this error by removing the input sizes and only using the input file, i.e. using --input='@input.0.bin' instead of --input='1x128xi64=@input.0.bin'. It seems like GPU doesn't support the input sizes and input file at the same time.

nirvedhmeshram commented 1 month ago

I was able to solve this error by removing the input sizes and only using the input file, i.e. using --input='@input.0.bin' instead of --input='1x128xi64=@input.0.bin'. It seems like GPU doesn't support the input sizes and input file at the same time.

is the output close to cpu atleast in shape? I wonder if without the input shape its not actually taking the input we expect (which may be dynamically shaped) and just producing garbage. Also normally, input sizes are required for .bin files.

ScottTodd commented 1 month ago

OOC why are the repro steps using iree-opt?

iree-opt -pass-pipeline='builtin.module(func.func(convert-torch-onnx-to-torch))' model.torch_onnx.mlir -o model.torch.mlir 
iree-opt -pass-pipeline='builtin.module(torch-lower-to-backend-contract,func.func(torch-scalarize-shapes),torch-shape-refinement-pipeline,torch-backend-to-linalg-on-tensors-backend-pipeline)' model.torch.mlir -o model.modified.mlir

That sort of manual pipeline specification is unsupported. For any user workflows, use iree-compile and let it handle which pipelines to run.

aviator19941 commented 1 month ago

I was able to solve this error by removing the input sizes and only using the input file, i.e. using --input='@input.0.bin' instead of --input='1x128xi64=@input.0.bin'. It seems like GPU doesn't support the input sizes and input file at the same time.

is the output close to cpu atleast in shape? I wonder if without the input shape its not actually taking the input we expect (which may be dynamically shaped) and just producing garbage.

Ah I see, not sure, I was encountering this error with benchmarking Llama on GPU last night which does have some dynamic shaped inputs. But, removing the input shapes from iree-benchmark-module and only using numpy files as the inputs I was able to run/benchmark without this error.

ScottTodd commented 1 month ago

But, removing the input shapes from iree-benchmark-module and only using numpy files as the inputs I was able to run/benchmark without this error.

Numpy files contain shape information (metadata + buffer contents). Binary files do not (just buffer contents). If using numpy, you can (should?) omit the 1x128xi64. If using binary, you need it (otherwise the runtime doesn't know how to interpret the buffer)

pdhirajkumarprasad commented 1 month ago

Few things here

Here when I am returning from 296, same set of command (with input size) works fine, and generated output is matching with CPU. The error is coming when returning from 297 in GPU

     %294 = torch.operator "onnx.Mul"(%290, %293) : (!torch.vtensor<[?,2,64,?],f32>, !torch.vtensor<[1],f32>) -> !torch.vtensor<[?,2,64,?],f32>
    %295 = torch.operator "onnx.MatMul"(%292, %294) : (!torch.vtensor<[?,2,?,64],f32>, !torch.vtensor<[?,2,64,?],f32>) -> !torch.vtensor<[?,2,?,?],f32>
    %296 = torch.operator "onnx.Add"(%295, %100) : (!torch.vtensor<[?,2,?,?],f32>, !torch.vtensor<[?,?,?,?],f32>) -> !torch.vtensor<[?,2,?,?],f32>
    %297 = torch.operator "onnx.Softmax"(%296) {torch.onnx.axis = -1 : si64} : (!torch.vtensor<[?,2,?,?],f32>) -> !torch.vtensor<[?,2,?,?],f32>
    return %296: !torch.vtensor<[?,2,?,?],f32>

When these set of command i.e input with size (which is needed for bin file) is working in CPU then we should have that same behavior in GPU as well

iree-org / iree