microsoft / hummingbird

Hummingbird compiles trained ML models into tensor computation for faster inference.
MIT License
3.32k stars 274 forks source link

XGBoost onnx backend inference not working with a single example #676

Closed jfrery closed 1 year ago

jfrery commented 1 year ago

Here is the code to reproduce:

import numpy
from xgboost.sklearn import XGBClassifier
from hummingbird.ml import convert

x, y = numpy.random.randn(1000, 10), numpy.random.randint(0, 2, 1000)
clf = XGBClassifier(n_estimators=20)
clf.fit(x, y)
extra_config = {
    "tree_implementation": "gemm",
    "onnx_target_opset": 14,
}
extra_config["n_features"] = x.shape[1]
onnx_model = convert(
    clf,
    backend="onnx",
    test_input=x[:10],
    extra_config=extra_config,
)
onnx_model.predict(x[:2])  # Works as expected
onnx_model.predict(x[:1])  # Does not work

And here is the error thrown by onnxruntime.

InvalidArgument: [ONNXRuntimeError] : 2 : INVALID_ARGUMENT : Non-zero status code returned while running Transpose node. Name:'/_operators.0/Transpose' Status Message: perm: [ 1 0 ] does not align with rank of input data: 1

Looking at the onnx graph, it seems that there is a squeeze operation with axis=None. With a single example, the input data shape would be (n_trees, 1, n_examples). Without the axis being specified, for a single example, the last 2 dimensions are removed which then doesn't match the transpose operation.

stillmatic commented 1 year ago

I can reproduce this bug and am running into it myself as well. I ran the following code (just changing the input array to onnx_model from 10 to 20 for clarity, so it's not the same as the length of the input array).

import numpy
from xgboost.sklearn import XGBClassifier
from hummingbird.ml import convert

x, y = numpy.random.randn(1000, 10), numpy.random.randint(0, 2, 1000)
clf = XGBClassifier(n_estimators=20)
clf.fit(x, y)
extra_config = {
    "tree_implementation": "gemm",
    "onnx_target_opset": 14,
}
extra_config["n_features"] = x.shape[1]
onnx_model = convert(
    clf,
    backend="onnx",
    test_input=x[:20],
    extra_config=extra_config,
)

Notably, looking at this onnx_model.graph.output we see that while the first output variable (the labels) correctly has a symbolic dimension, the second output variable (predictions) does not have a symbolic dimension.

[name: "variable"
type {
  tensor_type {
    elem_type: 7
    shape {
      dim {
        dim_param: "sym"
      }
    }
  }
}
, name: "onnx::ArgMax_31"
type {
  tensor_type {
    elem_type: 1
    shape {
      dim {
        dim_value: 20
      }
      dim {
        dim_value: 2
      }
    }
  }
}
]

When running predict, you should also see a warning:

2023-01-17 15:17:37.789612378 [W:onnxruntime:, execution_frame.cc:828 VerifyOutputSizes] Expected shape from model of {20} does not match actual shape of {2} for output variable

I believe that this issue is related to https://github.com/microsoft/hummingbird/issues/656 -- the output of the model is assumed to be a single output, instead of labels and predictions. Adding some debug logic around https://github.com/microsoft/hummingbird/blob/36ebab4d1aca9b913b3a34973928a12fdf49bfd9/hummingbird/ml/_topology.py#L260, to print the input and output names yields:

input: ['input_0']
output: ['variable']

So we see that variable - the only output - gets correctly modified to have the symbolic dimension but the second doesn't. It's probably possible for end users to fix this in post (manually editing outputted ONNX graph) but I think it's worth fixing in the core library.

stillmatic commented 1 year ago

hm, hacking a fix into the outputs setup, such that

onnx_model.model.graph.output
Out[10]:
[name: "labels"
type {
  tensor_type {
    elem_type: 7
    shape {
      dim {
        dim_param: "sym"
      }
    }
  }
}
, name: "predictions"
type {
  tensor_type {
    elem_type: 1
    shape {
      dim {
        dim_param: "sym"
      }
      dim {
        dim_value: 2
      }
    }
  }
}
]

still yields the same error as in OP. This error also presents in v0.4.6 of Hummingbird, so it's before the change to dynamic_axes when converting.

For reference, the mentioned bad Transpose node in the graph

node {
  input: "/_operators.0/Squeeze_output_0"
  output: "/_operators.0/Transpose_output_0"
  name: "/_operators.0/Transpose"
  op_type: "Transpose"
  attribute {
    name: "perm"
    ints: 1
    ints: 0
    type: INTS
  }
}

backend torch works correctly

torch_model = convert(
    clf,
    backend="torch",
    test_input=x[:20],
    extra_config=extra_config,
)
torch_model.predict_proba(x[:2])
torch_model.predict_proba(x[:1])

so my sense is that this is a bug in the torch -> onnx export path, since torch is the intermediate representation for the ONNX conversion anyways

narrowing further, if you use tree_trav instead of gemm, the code works. so likely a problem in the GEMM -> ONNX conversion

interesaaat commented 1 year ago

It is not the first time we are having troubles with GEMM and ONNX. @jfrery can you please change the tree implementation from gemm to tree_trav or perf_tree_trav. You can pass extra_config={"tree_implementation": "tree_trav"} at conversion time to switch to tree_trav, for example. Closing this.

jfrery commented 1 year ago

Actually we need the Gemm implementation. Maybe we can reopen and we will try to fix this?

interesaaat commented 1 year ago

It looks that this is more a problem with the onnx export in pytorch since the torch converter works. Can you open an issue with them showing that torch work, while onnx export fails? We can then use this issue to track the downstream one.