zjgarvey commented 2 months ago

This issue will be used to track compilation failures for migraphx models on CPU and GPU. Compile failures for each model should have a link to an issue with a smaller reproducer in the notes column.

Notes:

migraphx_ORT__bert_base_cased_1 fails on CPU but passes on GPU. Other adjacent models fail for similar reasons on both. Very odd.
not including tests migraphx_sdxl__unet__model, migraphx_ORT__bert_large_uncased_1 because they cause a crash (likely OOM)
not including any of the tf models yet.

CPU Status Table

The Following report was generated with IREE compiler version iree-org/iree@caacf6c8015b4344b2d9b4a82c2fddc015693831 Torch-mlir version llvm/torch-mlir@2665ed343b19713ba5c1c555b2366a93de8b9d2b

Passing Summary

TOTAL TESTS = 30	Stage	# Passing	% of Total
Setup	30	100.0%	100.0%
IREE Compilation	24	80.0%	80.0%
Gold Inference	22	73.3%	91.7%
IREE Inference Invocation	19	63.3%	86.4%
Inference Comparison (PASS)	15	50.0%	78.9%

Fail Summary

TOTAL TESTS = 30	Stage	# Failed at Stage
Setup	0	0.0%
IREE Compilation	6	20.0%
Gold Inference	2	6.7%
IREE Inference Invocation	3	10.0%
Inference Comparison	4	13.3%

Test Run Detail

Test was run with the following arguments: Namespace(device='local-task', backend='llvm-cpu', iree_compile_args=None, mode='cl-onnx-iree', torchtolinalg=True, stages=None, skip_stages=None, benchmark=False, load_inputs=False, groups='all', test_filter='migraphx', testsfile=None, tolerance=None, verbose=True, rundirectory='test-run', no_artifacts=False, cleanup='0', report=True, report_file='mi_10_10.md')

Test	Exit Status	Mean Benchmark Time (ms)	Notes
migraphx_agentmodel__AgentModel	compilation	None	iree-18268 iree-18412 torch-mlir-3651
migraphx_bert__bert-large-uncased	preprocessing	None
migraphx_bert__bertsquad-12	Numerics	None
migraphx_cadene__dpn92i1	PASS	None
migraphx_cadene__inceptionv4i16	PASS	None
migraphx_cadene__resnext101_64x4di1	PASS	None
migraphx_cadene__resnext101_64x4di16	PASS	None
migraphx_huggingface-transformers__bert_mrpc8	native_inference	None
migraphx_mlperf__bert_large_mlperf	Numerics	None
migraphx_mlperf__resnet50_v1	PASS	None
migraphx_models__whisper-tiny-decoder	compiled_inference	None
migraphx_models__whisper-tiny-encoder	native_inference	None
migraphx_onnx-misc__taau_low_res_downsample_d2s_for_infer_time_fp16_opset11	import_model	None
migraphx_onnx-model-zoo__gpt2-10	preprocessing	None
migraphx_ORT__bert_base_cased_1	PASS	None
migraphx_ORT__bert_base_uncased_1	PASS	None
migraphx_ORT__bert_large_uncased_1	PASS	None
migraphx_ORT__distilgpt2_1	compiled_inference	None
migraphx_ORT__onnx_models__bert_base_cased_1_fp16_gpu	Numerics	None
migraphx_ORT__onnx_models__bert_large_uncased_1_fp16_gpu	Numerics	None
migraphx_ORT__onnx_models__distilgpt2_1_fp16_gpu	compiled_inference	None
migraphx_pytorch-examples__wlang_gru	PASS	None
migraphx_pytorch-examples__wlang_lstm	PASS	None
migraphx_sdunetmodel	import_model	None
migraphx_sdxlunetmodel	import_model	None
migraphx_torchvision__densenet121i32	PASS	None
migraphx_torchvision__inceptioni1	PASS	None
migraphx_torchvision__inceptioni32	PASS	None
migraphx_torchvision__resnet50i1	PASS	None
migraphx_torchvision__resnet50i64	PASS	None

OLD STATUS (Will update and migrate issues to current table)

Test	Exit Status	Notes
migraphx_agentmodel__AgentModel	compilation
migraphx_bert__bert-large-uncased	compilation	iree-18269 Two IR reported under this, depicting different behavior
migraphx_bert__bertsquad-12	compilation	iree-18267 torch-mlir-3647
migraphx_cadene__dpn92i1	PASS
migraphx_cadene__inceptionv4i16	PASS
migraphx_cadene__resnext101_64x4di1	PASS
migraphx_cadene__resnext101_64x4di16	PASS
migraphx_huggingface-transformers__bert_mrpc8	compilation	iree-18413
migraphx_mlperf__bert_large_mlperf	compilation	iree-18297
migraphx_mlperf__resnet50_v1	PASS
migraphx_models__whisper-tiny-decoder	compilation	torch-mlir-3647
migraphx_models__whisper-tiny-encoder	compilation	torch-mlir-3647
migraphx_onnx-misc__taau_low_res_downsample_d2s_for_infer_time_fp16_opset11	construct_inputs	ORT issue with resize with f16 inputs?
migraphx_onnx-model-zoo__gpt2-10	compilation	shark-turbine-465 torch-mlir-615 torch-mlir-3293
migraphx_ORT__bert_base_cased_1	Numerics	Passed when '--iree-input-demote-i64-to-i32' is not present iree-18273
migraphx_ORT__bert_base_uncased_1	Numerics	Passed when '--iree-input-demote-i64-to-i32' is not present
migraphx_ORT__bert_large_uncased_1	compilation	crashes "MatMul" fail to legalize stream.cmd.dispatch https://github.com/iree-org/iree/issues/18229 https://github.com/llvm/torch-mlir/issues/3647 ??
migraphx_ORT__distilgpt2_1	Numerics
migraphx_ORT__onnx_models__bert_base_cased_1_fp16_gpu	Numerics
migraphx_ORT__onnx_models__bert_large_uncased_1_fp16_gpu	Numerics
migraphx_ORT__onnx_models__distilgpt2_1_fp16_gpu	Numerics
migraphx_pytorch-examples__wlang_gru	Numerics	iree-18441
migraphx_pytorch-examples__wlang_lstm	Numerics	iree-18441
migraphx_sdunetmodel	import_model	Killed during MLIR import. Too big??
migraphx_sdxlunetmodel	import_model	Killed during MLIR import. Too big??
migraphx_torchvision__densenet121i32	PASS
migraphx_torchvision__inceptioni1	PASS
migraphx_torchvision__inceptioni32	PASS
migraphx_torchvision__resnet50i1	PASS
migraphx_torchvision__resnet50i64	PASS

GPU Status Table

last generated with pip installed iree tools at version

iree-compiler      20240903.1005
iree-runtime       20240903.1005

Summary

Stage	Count
Total	21 (non-crashing, see table below)
PASS	12
Numerics	2
results-summary	0
postprocessing	0
compiled_inference	up to 5 (not included in total) crash during this stage
compilation	4
preprocessing	0
import_model	1
native_inference	2
construct_inputs	0
setup	0

Test Run Detail

Test was run with the following arguments: Namespace(device='hip://1', backend='rocm', iree_compile_args=['iree-hip-target=gfx942'], mode='onnx-iree', torchtolinalg=False, stages=None, skip_stages=None, load_inputs=False, groups='all', test_filter='migraphx', tolerance=None, verbose=True, rundirectory='test-run', no_artifacts=False, report=True, report_file='9_3_migraphx.md')

Test	Exit Status	Notes
migraphx_agentmodel__AgentModel	compilation	related : https://github.com/llvm/torch-mlir/pull/3630
migraphx_bert__bert-large-uncased	compilation	operand return type issue (see CPU table)
migraphx_bert__bertsquad-12	compilation (without shape inference)/ compiled_inference	1. Failing to use shape inference torch-mlir passes in torch-to-iree pipeline gives an all dynamic squeeze-dim op. 2. If using torch-lower-to-backend-contract to get the shape information, this crashes during inference with OOB memory access
migraphx_cadene__dpn92i1	PASS
migraphx_cadene__inceptionv4i16	PASS
migraphx_cadene__resnext101_64x4di1	PASS
migraphx_cadene__resnext101_64x4di16	PASS
migraphx_huggingface-transformers__bert_mrpc8	native_inference
migraphx_mlperf__bert_large_mlperf	native_inference
migraphx_mlperf__resnet50_v1	PASS
migraphx_onnx-misc__taau_low_res_downsample_d2s_for_infer_time_fp16_opset11	import_model
migraphx_onnx-model-zoo__gpt2-10	compilation	https://github.com/nod-ai/SHARK-Turbine/issues/465 https://github.com/llvm/torch-mlir/issues/615 https://github.com/llvm/torch-mlir/issues/3293
migraphx_ORT__bert_base_cased_1	PASS
migraphx_ORT__bert_base_uncased_1	PASS
migraphx_ORT__distilgpt2_1	likely compiled_inference	crashes with "Memory access fault by GPU node-3 (Agent handle: 0x5595fe450840) on address 0x7f1811a56000. Reason: Unknown."
migraphx_ORT__onnx_models__bert_base_cased_1_fp16_gpu	compiled_inference	causes a hard crash for trying to access memory out of bounds (Mi300x)
migraphx_ORT__onnx_models__bert_large_uncased_1_fp16_gpu	compiled_inference	same crash as above
migraphx_ORT__onnx_models__distilgpt2_1_fp16_gpu	likely compiled_inference	crashes with "Memory access fault by GPU node-3 (Agent handle: 0x5595fe450840) on address 0x7f1811a56000. Reason: Unknown."
migraphx_pytorch-examples__wlang_gru	Numerics
migraphx_pytorch-examples__wlang_lstm	Numerics
migraphx_torchvision__densenet121i32	PASS
migraphx_torchvision__inceptioni1	PASS
migraphx_torchvision__inceptioni32	PASS
migraphx_torchvision__resnet50i1	PASS
migraphx_torchvision__resnet50i64	PASS

Note: GPU missing sd model (runs out of memory and kills the test). Probably happening during native inference, so it might need some looking into.

Performance data with iree-benchmark-module on GPU

Summary

Stage	Count
Total	30
PASS	13
Numerics	3
results-summary	0
postprocessing	0
benchmark	0
compiled_inference	2
native_inference	1
construct_inputs	0
compilation	8
preprocessing	0
import_model	3
setup	0

Test Run Detail

Test was run with the following arguments: Namespace(device='local-task', backend='llvm-cpu', iree_compile_args=None, mode='cl-onnx-iree', torchtolinalg=False, stages=None, skip_stages=None, benchmark=True, load_inputs=False, groups='all', test_filter='migraphx', testsfile=None, tolerance=None, verbose=True, rundirectory='test-run', no_artifacts=False, cleanup='0', report=True, report_file='report.md')

Test	Exit Status	Mean Benchmark Time (ms)
migraphx_agentmodel__AgentModel	compilation	None
migraphx_bert__bert-large-uncased	compilation	None
migraphx_bert__bertsquad-12	compilation	None
migraphx_cadene__dpn92i1	PASS	457.4397828740378
migraphx_cadene__inceptionv4i16	PASS	26072.668661984306
migraphx_cadene__resnext101_64x4di1	PASS	995.6825857516378
migraphx_cadene__resnext101_64x4di16	PASS	6324.309662605326
migraphx_huggingface-transformers__bert_mrpc8	compilation	None
migraphx_mlperf__bert_large_mlperf	PASS	8195.630943014596
migraphx_mlperf__resnet50_v1	PASS	219.81522629761858
migraphx_models__whisper-tiny-decoder	compiled_inference	None
migraphx_models__whisper-tiny-encoder	native_inference	None
migraphx_onnx-misc__taau_low_res_downsample_d2s_for_infer_time_fp16_opset11	import_model	None
migraphx_onnx-model-zoo__gpt2-10	compilation	None
migraphx_ORT__bert_base_cased_1	PASS	817.4834945239127
migraphx_ORT__bert_base_uncased_1	compilation	None
migraphx_ORT__bert_large_uncased_1	PASS	2728.984761983156
migraphx_ORT__distilgpt2_1	compiled_inference	None
migraphx_ORT__onnx_models__bert_base_cased_1_fp16_gpu	Numerics	2141.3577783387154
migraphx_ORT__onnx_models__bert_large_uncased_1_fp16_gpu	Numerics	6767.566671983029
migraphx_ORT__onnx_models__distilgpt2_1_fp16_gpu	Numerics	101.96079453453422
migraphx_pytorch-examples__wlang_gru	compilation	None
migraphx_pytorch-examples__wlang_lstm	compilation	None
migraphx_sdunetmodel	import_model	None
migraphx_sdxlunetmodel	import_model	None
migraphx_torchvision__densenet121i32	PASS	2639.900082334255
migraphx_torchvision__inceptioni1	PASS	627.4162046611309
migraphx_torchvision__inceptioni32	PASS	22124.727455200627
migraphx_torchvision__resnet50i1	PASS	284.1490000589854
migraphx_torchvision__resnet50i64	PASS	11100.900294492021

nirvedhmeshram commented 2 months ago

@zjgarvey added https://github.com/llvm/torch-mlir/issues/3647 to some of the models as we need that along with https://github.com/iree-org/iree/issues/18229

MaheshRavishankar commented 2 months ago

cc @lialan as well. Can you co-ordinate with Zach to track CPU codegen issues.

nirvedhmeshram commented 2 months ago

Also adding https://github.com/llvm/torch-mlir/issues/3651 that needs to be done for supporting broad range of models.

nod-ai / SHARK-TestSuite

MiGraphx CPU/GPU Status Tracking #325

Notes:

CPU Status Table

Passing Summary

Fail Summary

Test Run Detail

OLD STATUS (Will update and migrate issues to current table)

GPU Status Table

Summary

Test Run Detail

Performance data with iree-benchmark-module on GPU

Summary

Test Run Detail