onnx / sklearn-onnx

Convert scikit-learn models and pipelines to ONNX
Apache License 2.0
538 stars 99 forks source link

Conversion fails with supported type of DictVectorizer #806

Open dafajon opened 2 years ago

dafajon commented 2 years ago

Bug

I wanted to try conversion with various supported mappings on the DictVectorizer spec. Appearently there is support of map(int64, string). However the conversion fails.

Code

_, y = make_classification(n_samples=5, n_classes=2)

X = [{0: "triangle", 3:"europe"},
     {1: "horror", 2: "rock"},
     {0: "circle", 1: "comedy", 2: "pop", 3: "asia"},
     {1: "action", 2: "rap", 3: "australia"},
     {2: "rock"}]

pipeline = Pipeline([("dvec", DictVectorizer()),
                     ("lorgreg", LogisticRegression(C=0.123))])

initial_type = [('dvec_inp', DictionaryType(Int64Type(), StringTensorType()))]
onx = convert_sklearn(pipeline, initial_types=initial_type)
with open(f"mdl.onnx", "wb") as f:
    f.write(onx.SerializeToString())

Output

RuntimeError: All categories contain a separator '='. This case is not supported by the converter. The mapping must map to numbers not string.

System

System information

OS Platform and Distribution: MacOS 10.14.2 ONNX Runtime installed from (source or binary): PyPI 1.10.0 ONNX Runtime version: 1.10.0 Python version: 3.8.12 Visual Studio version (if applicable): None GCC/Compiler version (if compiling from source): None CUDA/cuDNN version: None GPU model and memory: None

xadupre commented 2 years ago

I assume you trained the pipeline with pipeline.fit(X, y) (it does not work for me). DictVectorizer is usually used with dictonaries where key are strings and values numbers. In that case, your example works.

dafajon commented 2 years ago

It is true I trained the pipeline with pipeline.fit(X, y) and it worked in my case. I have the error during serialization and I wanted to understand it. My other example input with string keys and number values works as I expected; but I expected above one to work as well.

xadupre commented 2 years ago

which version of scikit-learn are you using?

dafajon commented 2 years ago

1.0.1

xadupre commented 2 years ago

This case is not supported. It requires to concatenate strings (https://github.com/scikit-learn/scikit-learn/blob/7e1e6d09b/sklearn/feature_extraction/_dict_vectorizer.py#L163) and this operation is not available with the standard list of onnx operators.

dafajon commented 2 years ago

Thank you very much for the explanation. In this case will there be a change in supported types in the spec?

xadupre commented 2 years ago

This operator is available in onnxruntime-extension. We may use it in the converting library in a few releases.