Open corwinjoy opened 1 year ago
The failing test case generates an error message like:
onnxruntime.capi.onnxruntime_pybind11_state.Fail: [ONNXRuntimeError] : 1 : FAIL : Load model from /tmp//test.onnx failed:Node (XA_node) output arg (XA) type inference failed
This issue has been automatically marked as stale due to inactivity and will be closed in 7 days if no further activity occurs. If further support is needed, please provide an update and/or more details.
@BowenBao @gramalingam @justinchuby correct me if I am wrong, but ONNX Function does not store type information by spec.
If this issue is the same as https://github.com/onnx/onnx/issues/5487, then this is a ONNX spec limitation and ORT can't do much about it.
I agree this is the case.
ONNX functions do not store type information by design, yes. However, the issue here may have other confounding factors. The custom-ops (just like regular ops) typically have a shape-and-type-inference function that is the primary source of type information. I guess that is missing in this case? Why is that? Is that because of limitations in the type-inference for python-custom-ops?
Custom ops have to declare types (as far as I know). In this case, the example custom op is explicit in its types:
@ortx.onnx_op(op_type="Adder",
inputs=[ortx.PyCustomOpDef.dt_float, ortx.PyCustomOpDef.dt_float],
outputs=[ortx.PyCustomOpDef.dt_float]
)
In the above example, if the custom op is used directly, everything is fine. However, with make_function
it seems that type
information is erased. I think this is because make_function
is a type of template operator. But, somehow it is not preserving type information, and as a result the graph cannot be compiled. (In order to compile the graph, input and output types for the graph as a whole need to be defined). I think that with normal operators, the graph compiler is able to deduce types, but here it is failing because the types are being lost.
@gramalingam @justinchuby @thiagocrepaldi
Does the failure happen only when use_function
is true, or in both cases?
There are two sources of type-information when ONNX type-inference happens. One is the type-information explicitly included in the model itself. But the second is via the type-inference methods of registered ops. I assume that ortx.onnx_op
creates and registers an ONNX op-schema with the correct signature. This op-schema should be sufficient without needing to explicitly capture type-information in the function within the model.
So, the failure implies something else is going wrong as well. Eg., may be the inference-logic is not getting access to the op-schema for some reason. So, just trying to understand what could be causing this.
@gramalingam The failure only happens when use_function
is true. I set the example up this way to show that the problem is with make_function
.
Describe the issue
When creating custom operations, if you wrap your operations inside a make_function call then the types get lost. This causes a type error when attempting to load the model in ONNX runtime. This is a problem for more complex nodes with a lot of data (such as TreeEnsembleRegressor) where we want to re-use and existing node with different parameters.
The place in the onnxruntime code that generates the error indicates that "this should not happen". https://github.com/microsoft/onnxruntime/blob/d8d79521ca2b266e631ac0ba7036a682ebb58b5b/onnxruntime/core/graph/graph.cc#L2358
To reproduce
Urgency
No response
Platform
Linux
OS Version
22.04
ONNX Runtime Installation
Released Package
ONNX Runtime Version or Commit ID
1.16.1
ONNX Runtime API
Python
Architecture
X64
Execution Provider
Default CPU
Execution Provider Library Version
No response