Trouble compiling or handling test data for 3 transformer based models

messerb5467 commented 2 years ago

gpt2-10, bidaf-9, and inception-v2-6 are all having trouble compiling/handling test data. In particular, gpt2-10 and inception-v2-6 are currently failing for unknown reasons, meanwhile the test data for bidaf-9 is not currently supported for our test harness. I'm willing to help with these in the mid term once I find time.

messerb5467 commented 2 years ago

The comment above has been updated with a direct link to the onnx models. I'll handle bidaf-9 once I find time unless someone else gets to it first as I would also like to refactor some of the test harness code itself at the same time.

chentong319 commented 2 years ago

gpt2-10, bidaf-9, and inception-v2-6 are all having trouble compiling/handling test data. In particular, gpt2-10 and inception-v2-6 are currently failing for unknown reasons, meanwhile the test data for bidaf-9 is not currently supported for our test harness. I'm willing to help with these in the mid term once I find time.

Can you provide more details about the failure? I just tried gpt2-10. The model was successfully imported. But the onnx-to-onnx pass did not generate the expect output file nor any error message. Did you encounter the same error?

messerb5467 commented 2 years ago

I did hit a similar issue and wasn't necessarily intending to address this in detail right now. Regardless, though, I apologize for not providing much detail as I didn't have much to truly go on. GPT2-10 exits immediately when you try to compile it with nothing to review, inceptionv2-6 will core dump, and bidaf9 just needs some test data changed around which I'm planning to do. That's really all I've got but the current results can be seen in the onnx-mlir through the onnx-mlir status badge on the main page of the repo.

chentong319 commented 2 years ago

I looked into the output of onnx-mlir --EmitONNXIR --mlir-print-ir-after-all. I do not quite understand what happened. I just put the all the pass header print here. Obviously, the last pass SymbolDCEPass was not executed.

// -----// IR Dump Before (anonymous namespace)::DecomposeONNXToONNXPass (decompose-onnx) //----- //
// -----// IR Dump Before onnx_mlir::(anonymous namespace)::ShapeInferencePass (shape-inference) //----- //
// -----// IR Dump Before Canonicalizer (canonicalize) //----- //
// -----// IR Dump Before onnx_mlir::(anonymous namespace)::ShapeInferencePass (shape-inference) //----- //
// -----// IR Dump Before (anonymous namespace)::ConvOptONNXToONNXPass (conv-opt-onnx) //----- //
// -----// IR Dump Before onnx_mlir::(anonymous namespace)::ShapeInferencePass (shape-inference) //----- //
// -----// IR Dump Before (anonymous namespace)::ConstPropONNXToONNXPass (constprop-onnx) //----- //
// -----// IR Dump Before (anonymous namespace)::ONNXOpTransformPass (onnx-op-transform) //----- //
// -----// IR Dump Before (anonymous namespace)::DecomposeONNXToONNXPass (decompose-onnx) //----- //
// -----// IR Dump Before onnx_mlir::(anonymous namespace)::ShapeInferencePass (shape-inference) //----- //
// -----// IR Dump Before Canonicalizer (canonicalize) //----- //
// -----// IR Dump Before onnx_mlir::(anonymous namespace)::ShapeInferencePass (shape-inference) //----- //
// -----// IR Dump Before (anonymous namespace)::ConvOptONNXToONNXPass (conv-opt-onnx) //----- //
// -----// IR Dump Before onnx_mlir::(anonymous namespace)::ShapeInferencePass (shape-inference) //----- //
// -----// IR Dump Before (anonymous namespace)::ConstPropONNXToONNXPass (constprop-onnx) //----- //
// -----// IR Dump Before (anonymous namespace)::DecomposeONNXToONNXPass (decompose-onnx) //----- //
// -----// IR Dump Before onnx_mlir::(anonymous namespace)::ShapeInferencePass (shape-inference) //----- //
// -----// IR Dump Before Canonicalizer (canonicalize) //----- //
// -----// IR Dump Before onnx_mlir::(anonymous namespace)::ShapeInferencePass (shape-inference) //----- //
// -----// IR Dump Before (anonymous namespace)::ConvOptONNXToONNXPass (conv-opt-onnx) //----- //
// -----// IR Dump Before onnx_mlir::(anonymous namespace)::ShapeInferencePass (shape-inference) //----- //
// -----// IR Dump Before (anonymous namespace)::ConstPropONNXToONNXPass (constprop-onnx) //----- //
// -----// IR Dump Before (anonymous namespace)::SimplifyShapeRelatedOpsPass (simplify-shape-related-ops-onnx) //----- //
// -----// IR Dump Before (anonymous namespace)::ConstPropONNXToONNXPass (constprop-onnx) //----- //
// -----// IR Dump Before onnx_mlir::(anonymous namespace)::ShapeInferencePass (shape-inference) //----- //
// -----// IR Dump Before Canonicalizer (canonicalize) //----- //
// -----// IR Dump Before (anonymous namespace)::ConstPropONNXToONNXPass (constprop-onnx) //----- //
// -----// IR Dump Before onnx_mlir::(anonymous namespace)::ShapeInferencePass (shape-inference) //----- //
// -----// IR Dump Before Canonicalizer (canonicalize) //----- //
// -----// IR Dump Before (anonymous namespace)::ConstPropONNXToONNXPass (constprop-onnx) //----- //
// -----// IR Dump Before onnx_mlir::(anonymous namespace)::ShapeInferencePass (shape-inference) //----- //
// -----// IR Dump Before Canonicalizer (canonicalize) //----- //

tungld commented 2 years ago

bidaf-9 failed because of unsupported Scan version as I heard from @negiyas.

FYI, these failures are also reported by our Jenkins here: https://www.onnxmlir.xyz/jenkinx/job/ONNX-MLIR-Pipeline-Docker-Build/Model_20Zoo_20Report/

AlexandreEichenberger commented 2 years ago

bidaf-9 failed because of unsupported Scan version as I heard from @negiyas.

too old (deprecated) or too new?

tungld commented 2 years ago

too old (deprecated) or too new?

It uses an old opset (opset 8) for Scan while onnx-mlir supports opset 16 of Scan that is not backward compatible: https://github.com/onnx/models/issues/540

AlexandreEichenberger commented 2 years ago

@messerb5467 is it possible to get a similar model with a newer opset? opset 8 is not really supported.

messerb5467 commented 2 years ago

@messerb5467 is it possible to get a similar model with a newer opset? opset 8 is not really supported.

I've got a few options on my side ranging from checking places like tensorhub or the equivalent for pytorch since bidaf-9 is a regular model or using the up-converter to bring it up to a newer opset. I'd probably first consider the up-converter and go back to original model structure if not good enough.

AlexandreEichenberger commented 2 years ago

@messerb5467 please let us know if the current error message is clear enough when provided a model with this older opset.

If it is seen as key to our users to support this older version, let us know. But we would prefer focusing on future ops.

messerb5467 commented 2 years ago

will do. I've not tried to compile this one myself and more saw it through the lense of it's corresponding data failing to convert. Once I get the time, I'll check it out and provide that feedback.

messerb5467 commented 2 years ago

I just tried to compile this for the different accelerator/environment combinations and it all boils down to the same issue:

error: Invalid axis value
error: shape inference failed

Within reason, that won't be enough to help me diagnose the issue that we're seeing. What would help me is the normal tenets of good First Failure Data Capture (FFDC) and help give me direction on whether I would need to do more debugging on my own or if I should open an issue. While I do know that there are lots of good documentation within the project, I usually like to keep just the right amount of information in the error message and build the trail of cookie crumbs out to better places to look. Personally I'm still finding the right balance between project README.md's and longer form comments in code, so I'm open to artistic influence of what that should look like.

messerb5467 commented 2 years ago

This just got bumped up on my plate. Just a small amount of time to do one other thing and I'll get back to this for full-time focus.

messerb5467 commented 2 years ago

I just updated bidaf-9 available, but can't attach it since it is too large. Let me know who needs it and I can forward it over. Within the company makes sense as it is very straightforward with our own solutions, but cloud storage even for something more project centric if pretty cheap would be appreciated. I pay $2 a month for 100GB on drive for instance, but don't know how that would scale to enterprise users of an open source project.

messerb5467 commented 2 years ago

Thanks for the eagerness here everyone. I worked through bidaf-9 with @chentong319 and one of the squeeze ops isn't compatible with onnx-mlir.

tungld commented 1 year ago

This patch # https://github.com/onnx/onnx-mlir/pull/1854 is to fix the GPT2-10 compilation issue.

messerb5467 commented 1 year ago

I've confirmed gpt2-10 on my side. Thanks for the great work!

cjvolzka commented 1 year ago

Retested inception-v2-6 today and it now passes. bidaf-9 isn't yet supported by onnx-mlir so I'm moving this to closed.

onnx / onnx-mlir

Trouble compiling or handling test data for 3 transformer based models #1825