Add converter for CountVectorizer with "char_wb" analyzer

onnx / sklearn-onnx

Convert scikit-learn models and pipelines to ONNX

Apache License 2.0

554 stars 104 forks source link

Add converter for CountVectorizer with "char_wb" analyzer #446

Open cppntn opened 4 years ago

cppntn commented 4 years ago

I've tried but this error occurred,

NotImplementedError: CountVectorizer cannot be converted, only tokenizer='word' is supported. You may raise an issue at https://github.com/onnx/sklearn-onnx/issues.

which led me here to open this issue

Thanks for your support

xadupre commented 4 years ago

Right now, I have no easy way to fix it. scikit-learn preprocesses the strings before extracting the characters and removes double spaces: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/feature_extraction/text.py#L258. onnxruntime does not implement that behaviour. ONNX StringNormalizer only contains basic options: https://github.com/onnx/onnx/blob/master/docs/Operators.md.