Closed jfrery closed 1 year ago
I can reproduce this bug and am running into it myself as well. I ran the following code (just changing the input array to onnx_model
from 10 to 20 for clarity, so it's not the same as the length of the input array).
import numpy
from xgboost.sklearn import XGBClassifier
from hummingbird.ml import convert
x, y = numpy.random.randn(1000, 10), numpy.random.randint(0, 2, 1000)
clf = XGBClassifier(n_estimators=20)
clf.fit(x, y)
extra_config = {
"tree_implementation": "gemm",
"onnx_target_opset": 14,
}
extra_config["n_features"] = x.shape[1]
onnx_model = convert(
clf,
backend="onnx",
test_input=x[:20],
extra_config=extra_config,
)
Notably, looking at this onnx_model.graph.output
we see that while the first output variable (the labels) correctly has a symbolic dimension, the second output variable (predictions) does not have a symbolic dimension.
[name: "variable"
type {
tensor_type {
elem_type: 7
shape {
dim {
dim_param: "sym"
}
}
}
}
, name: "onnx::ArgMax_31"
type {
tensor_type {
elem_type: 1
shape {
dim {
dim_value: 20
}
dim {
dim_value: 2
}
}
}
}
]
When running predict, you should also see a warning:
2023-01-17 15:17:37.789612378 [W:onnxruntime:, execution_frame.cc:828 VerifyOutputSizes] Expected shape from model of {20} does not match actual shape of {2} for output variable
I believe that this issue is related to https://github.com/microsoft/hummingbird/issues/656 -- the output of the model is assumed to be a single output, instead of labels and predictions. Adding some debug logic around https://github.com/microsoft/hummingbird/blob/36ebab4d1aca9b913b3a34973928a12fdf49bfd9/hummingbird/ml/_topology.py#L260, to print the input and output names yields:
input: ['input_0']
output: ['variable']
So we see that variable
- the only output - gets correctly modified to have the symbolic dimension but the second doesn't. It's probably possible for end users to fix this in post (manually editing outputted ONNX graph) but I think it's worth fixing in the core library.
hm, hacking a fix into the outputs setup, such that
onnx_model.model.graph.output
Out[10]:
[name: "labels"
type {
tensor_type {
elem_type: 7
shape {
dim {
dim_param: "sym"
}
}
}
}
, name: "predictions"
type {
tensor_type {
elem_type: 1
shape {
dim {
dim_param: "sym"
}
dim {
dim_value: 2
}
}
}
}
]
still yields the same error as in OP. This error also presents in v0.4.6
of Hummingbird, so it's before the change to dynamic_axes
when converting.
For reference, the mentioned bad Transpose node in the graph
node {
input: "/_operators.0/Squeeze_output_0"
output: "/_operators.0/Transpose_output_0"
name: "/_operators.0/Transpose"
op_type: "Transpose"
attribute {
name: "perm"
ints: 1
ints: 0
type: INTS
}
}
backend torch works correctly
torch_model = convert(
clf,
backend="torch",
test_input=x[:20],
extra_config=extra_config,
)
torch_model.predict_proba(x[:2])
torch_model.predict_proba(x[:1])
so my sense is that this is a bug in the torch -> onnx export path, since torch is the intermediate representation for the ONNX conversion anyways
narrowing further, if you use tree_trav
instead of gemm
, the code works. so likely a problem in the GEMM -> ONNX conversion
It is not the first time we are having troubles with GEMM and ONNX. @jfrery can you please change the tree implementation from gemm
to tree_trav
or perf_tree_trav
. You can pass extra_config={"tree_implementation": "tree_trav"}
at conversion time to switch to tree_trav
, for example. Closing this.
Actually we need the Gemm implementation. Maybe we can reopen and we will try to fix this?
It looks that this is more a problem with the onnx export in pytorch since the torch converter works. Can you open an issue with them showing that torch work, while onnx export fails? We can then use this issue to track the downstream one.
Here is the code to reproduce:
And here is the error thrown by onnxruntime.
Looking at the onnx graph, it seems that there is a squeeze operation with
axis=None
. With a single example, the input data shape would be (n_trees, 1, n_examples). Without theaxis
being specified, for a single example, the last 2 dimensions are removed which then doesn't match the transpose operation.