iree-org / iree

A retargetable MLIR-based machine learning compiler and runtime toolkit.
http://iree.dev/
Apache License 2.0
2.61k stars 584 forks source link

Abort (core dumped) #18741

Open pdhirajkumarprasad opened 4 days ago

pdhirajkumarprasad commented 4 days ago

What happened?

For the attached IR seeing abort during runtime.

command:

iree-compile model.modified.mlir --iree-hal-target-backends=llvm-cpu -o compiled_model.vmfb 
iree-run-module --module='compiled_model.vmfb' --device=local-task --function='torch_jit' --input='1x3x224x224xf32=@input.0.bin' --output=@'output.0.bin' 

input.0.txt model.mlir.txt

Steps to reproduce your issue

download two attached file and name them as 'model.modified.mlir' and 'input.0.bin' and invoke the command mentioned above

What component(s) does this issue relate to?

Runtime

Version information

No response

Additional context

No response

pashu123 commented 3 days ago

There are cf.asserts and I don't know how well they are supported by the runtime. Please add the flag iree-opt-strip-assertions i.e., iree-compile -iree-opt-strip-assertions model.mlir.txt --iree-hal-target-backends=llvm-cpu -o compiled_model.vmfb

IanWood1 commented 3 days ago

I think this might be the correct behavior (apparently cf.assert isn't required to do anything with the message) because the process is bing terminated via SIGABRT. Looking at the input IR on line 251:

cf.assert %false, "mismatching contracting dimension"

So this seems like a possible lowering issue. However, I'm not sure why no message was raised.

benvanik commented 3 days ago

if you strip assertions then you'll probably get crashes - make sure you aren't stripping them if you want the errors.

(there may also be cases where some things aren't properly guarded by the assertions, so you get death before the assertion is hit - I don't think we have bugs like that, but assertions are rarely used so it's possible - you can use --trace_execution at runtime to see the program flow and should see a vm.fail if the assertion is hit)

zjgarvey commented 3 days ago

I think I have a resolution for many of these inference crashes at the torch level. I think most of these are related to the shape cleanup work we've been doing.

https://github.com/llvm/torch-mlir/pull/3781 + setting up a different shape refinement pipeline on the frontend seems to be working well on the sampling of models I've been testing from https://github.com/pdhirajkumarprasad/SHARK-TestSuite/blob/feature/qa/issue/onnx-to-torch/abort-at-runtime.

I'll add some tests to the linked PR, post some changes to our pipeline in the test suite, and post a summary report of the thirty models I tried locally.

zjgarvey commented 3 days ago

This is from a sampling of distinct sounding models from the list of runtime crashing models after the changes mentioned above.

The passes I took to generate linalg IR for these models:

torch-mlir-opt --convert-torch-onnx-to-torch --torch-lower-to-backend-contract --torch-scalarize-shapes --torch-shape-refinement-pipeline --torch-backend-to-linalg-on-tensors-backend-pipeline

With the scalarize shapes changes in the draft pr.

Passing Summary

TOTAL TESTS = 30 Stage # Passing % of Total % of Attempted
Setup 30 100.0% 100.0%
IREE Compilation 29 96.7% 96.7%
Gold Inference 29 96.7% 100.0%
IREE Inference Invocation 25 83.3% 86.2%
Inference Comparison (PASS) 25 83.3% 100.0%

Fail Summary

TOTAL TESTS = 30 Stage # Failed at Stage % of Total
Setup 0 0.0%
IREE Compilation 1 3.3%
Gold Inference 0 0.0%
IREE Inference Invocation 4 13.3%
Inference Comparison 0 0.0%

Test Run Detail

Test was run with the following arguments: Namespace(device='local-task', backend='llvm-cpu', iree_compile_args=None, mode='cl-onnx-iree', torchtolinalg=True, stages=None, skip_stages=None, benchmark=False, load_inputs=False, groups='all', test_filter=None, testsfile='inference1.txt', tolerance=None, verbose=True, rundirectory='./test-onnx', no_artifacts=False, cleanup='2', report=True, report_file='reports/inference1.md')

Test Exit Status Mean Benchmark Time (ms) Notes
model--all-MiniLM-L12-v2-qa-all--LLukas22 PASS None
model--bart-base-few-shot-k-1024-finetuned-squad-seed-2--anas-awadalla compiled_inference None
model--bart-base-squad2--sjrhuschlee compiled_inference None
model--bart-large-finetuned-squadv1--valhalla compiled_inference None
model--bengali_language_NER--Suchandra PASS None
model--bert-base-cased-cefr--LordCoffee PASS None
model--bert-base-finetuned-nli--Jihyun22 PASS None
model--bert-base-multilingual-cased-finetuned-squad--JensH PASS None
model--bert-base-multilingual-uncased-finetuned-squad--Martin97Bozic PASS None
model--bert-base-NER--dslim PASS None
model--bert-base-qa--srcocotero PASS None
model--bert-base-turkish-128k-cased-finetuned_lr-2e-05_epochs-3--husnu PASS None
model--bert-base-tweetner7-2021--tner PASS None
model--bert-base-uncased-few-shot-k-1024-finetuned-squad-seed-0--anas-awadalla PASS None
model--Bert_Squad--johnjose223 PASS None
model--BioBERT-finetuned-ner-conll2003--ViktorDo PASS None
model--EstBERT128_sentiment--tartuNLP PASS None
model--FinancialBERT-Sentiment-Analysis--ahmedrachid PASS None
model--GPyT--Sentdex compiled_inference None
model--IMDB_BERT_5E--pig4431 PASS None
model--MetaQA--haritzpuerto PASS None
model--MiniLM-L12-H384-uncased-squad--haritzpuerto PASS None
model--MTL-bert-base-uncased-ww-squad--jgammack PASS None
model--SEAD-L-6_H-384_A-12-wnli--course5i PASS None
model--TinyBERT_General_4L_312D-squad--haritzpuerto PASS None
model--Trial_3_Results--sunitha PASS None
mvitv2_base PASS None
mvitv2_large import_model None
mvitv2_small PASS None
mvitv2_tiny PASS None