Describe the bug
This is related to issue https://github.com/onnx/sklearn-onnx/pull/485. onnxruntime seems to be missing n-grams if there are stopwords in between. ngrams([a b c] , (1, 2)) --> (a, ab, b, bc, c). If b is a stopwords, we should have ngrams([a b c] , (1, 2), stopwords=['b']) --> (a, ac, c) but onnxruntime seems to return (a, c).
System information
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Windows
ONNX Runtime installed from (source or binary): source
ONNX Runtime version: 1.3
Python version: 3.7
Visual Studio version (if applicable): 2019
To Reproduce
import numpy
from numpy.testing import assert_almost_equal
from onnxruntime import InferenceSession, __version__ as ort_version
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC, SVR
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.feature_selection import SelectKBest
from skl2onnx import convert_sklearn
from skl2onnx.common.data_types import StringTensorType
import onnx
stopwords = ['the', 'and', 'is']
X_train = numpy.array([
"This is the first document",
"This document is the second document.",
"And this is the third one",
"Is this the first document?",
]).reshape((4, 1))
y_train = numpy.array([0, 1, 0, 1])
model_pipeline = Pipeline([
('vectorizer', CountVectorizer(
stop_words=stopwords, lowercase=True,
ngram_range=(1, 2), max_features=30000)),
])
model_pipeline.fit(X_train.ravel(), y_train)
initial_type = [('input', StringTensorType([None, 1]))]
model_onnx = convert_sklearn(
model_pipeline, "cv", initial_types=initial_type,
options={SVC: {'zipmap': False}})
exp = [model_pipeline.transform(X_train.ravel()).toarray()]
sess = InferenceSession(model_onnx.SerializeToString())
got = sess.run(None, {'input': X_train})
if verbose:
voc = model_pipeline.steps[0][-1].vocabulary_
voc = list(sorted([(v, k) for k, v in voc.items()]))
for kv in voc:
print(kv)
for a, b in zip(exp, got):
if verbose:
print(a)
print(b)
assert_almost_equal(a, b)
Output (onnxruntime does not detect n-grams (this first).
This issue has been automatically marked as stale due to inactivity and will be closed in 7 days if no further activity occurs. If further support is needed, please provide an update and/or more details.
Describe the bug This is related to issue https://github.com/onnx/sklearn-onnx/pull/485. onnxruntime seems to be missing n-grams if there are stopwords in between.
ngrams([a b c] , (1, 2)) --> (a, ab, b, bc, c)
. If b is a stopwords, we should havengrams([a b c] , (1, 2), stopwords=['b']) --> (a, ac, c)
but onnxruntime seems to return(a, c)
.System information
To Reproduce
Output (onnxruntime does not detect n-grams (
this first
).Expected behavior The two last matrices must be equal.