Open eddiebergman opened 1 year ago
The converter does expect to have one tensor as input. You can use a ColumnTransformer to concatenate all columns into a single one. Then, I put the encoder first as onnx only support numerical values for Imputer. This is the pipeline validated in PR #1030.
model = Pipeline(
steps=[
(
"concat",
ColumnTransformer(
[("concat", "passthrough", list(range(X.shape[1])))],
sparse_threshold=0,
),
),
(
"voting",
VotingClassifier(
flatten_transform=False,
estimators=[
(
"est",
Pipeline(
steps=[
# This encoder is placed before SimpleImputer because
# onnx does not support text for Imputer
("encoder", OrdinalEncoder()),
(
"imputer",
SimpleImputer(strategy="most_frequent"),
),
(
"rf",
RandomForestClassifier(
n_estimators=4,
max_depth=4,
random_state=0,
),
),
],
),
),
],
),
),
]
)
Hi there,
I am new to
onnx
in general so apologies if the issue is misplaced or I am missing something fundamental.I'm coming from the tool
autosklearn
and planning to introduce some basic onnx support by exporting found models after doing some optimization over possible pipelines. These pipelines will mostly consist of an ensemble (VotingClassifier
) which they themselves containPipelines
with disjoint imputation strategies, feature preprocessing and estimators.Based on the error below, it seems that using a
VotingClassifier
would require all features to be numeric (or at least of the same TensorType) to be viable? Is this correct? Is there something fundamental which would prevent theSklearnVotingClassifier
operator from working with more than 1 input?I am linking to this issue here in case anyone using
autosklearn
would like to enableonnx
support and would be able to contribute! I've included a reproducible example and the tracebackReproducible Example
Apologies for using `openml`, sklearn toy datasets do not have such varied column types. ```python from __future__ import annotations def main(): import openml from mlprodict.onnx_conv import guess_schema_from_data from onnxruntime import InferenceSession from skl2onnx import to_onnx from sklearn.ensemble import RandomForestClassifier, VotingClassifier from sklearn.impute import SimpleImputer from sklearn.pipeline import Pipeline from sklearn.preprocessing import OrdinalEncoder dataset = openml.datasets.get_dataset(31) X, y, _, _ = dataset.get_data(target=dataset.default_target_attribute) model = VotingClassifier( estimators=[ ( "est", Pipeline( steps=[ ("imputer", SimpleImputer(strategy="most_frequent")), ("encoder", OrdinalEncoder()), ("rf", RandomForestClassifier(n_estimators=10)), ], ), ), ], ) model.fit(X, y) schema = guess_schema_from_data(X) # Errors here onnx_model = to_onnx(model=model, initial_types=schema) sess = InferenceSession(onnx_model.SerializeToString()) inputs = {c: X[c].to_numpy().reshape((-1, 1)) for c in X.columns} got = sess.run(None, inputs) print(got) if __name__ == "__main__": main() ```Traceback
```python /blank/.venv/lib/python3.10/site-packages/openml/datasets/functions.py:438: FutureWarning: Starting from Version 0.15 `download_data`, `download_qualities`, and `download_features_meta_data` will all be ``False`` instead of ``True`` by default to enable lazy loading. To disable this message until version 0.15 explicitly set `download_data`, `download_qualities`, and `download_features_meta_data` to a bool while calling `get_dataset`. warnings.warn( Traceback (most recent call last): File "/blank/onnx-test.py", line 45, in