microsoft / onnxruntime

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
https://onnxruntime.ai
MIT License
14.1k stars 2.84k forks source link

StringNormalizer+Tokenizer misses n-grams #4201

Open xadupre opened 4 years ago

xadupre commented 4 years ago

Describe the bug This is related to issue https://github.com/onnx/sklearn-onnx/pull/485. onnxruntime seems to be missing n-grams if there are stopwords in between. ngrams([a b c] , (1, 2)) --> (a, ab, b, bc, c). If b is a stopwords, we should have ngrams([a b c] , (1, 2), stopwords=['b']) --> (a, ac, c) but onnxruntime seems to return (a, c).

System information

To Reproduce

import numpy
from numpy.testing import assert_almost_equal
from onnxruntime import InferenceSession, __version__ as ort_version
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC, SVR
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.feature_selection import SelectKBest
from skl2onnx import convert_sklearn
from skl2onnx.common.data_types import StringTensorType
import onnx

stopwords = ['the', 'and', 'is']
X_train = numpy.array([
    "This is the first document",
    "This document is the second document.",
    "And this is the third one",
    "Is this the first document?",
]).reshape((4, 1))
y_train = numpy.array([0, 1, 0, 1])

model_pipeline = Pipeline([
    ('vectorizer', CountVectorizer(
        stop_words=stopwords, lowercase=True,
        ngram_range=(1, 2), max_features=30000)),
])

model_pipeline.fit(X_train.ravel(), y_train)
initial_type = [('input', StringTensorType([None, 1]))]
model_onnx = convert_sklearn(
    model_pipeline, "cv", initial_types=initial_type,
    options={SVC: {'zipmap': False}})

exp = [model_pipeline.transform(X_train.ravel()).toarray()]

sess = InferenceSession(model_onnx.SerializeToString())
got = sess.run(None, {'input': X_train})
if verbose:
    voc = model_pipeline.steps[0][-1].vocabulary_
    voc = list(sorted([(v, k) for k, v in voc.items()]))
    for kv in voc:
        print(kv)
for a, b in zip(exp, got):
    if verbose:
        print(a)
        print(b)
    assert_almost_equal(a, b)

Output (onnxruntime does not detect n-grams (this first).

(0, 'document')
(1, 'document second')
(2, 'first')
(3, 'first document')
(4, 'one')
(5, 'second')
(6, 'second document')
(7, 'third')
(8, 'third one')
(9, 'this')
(10, 'this document')
(11, 'this first')
(12, 'this third')
[[1 0 1 1 0 0 0 0 0 1 0 1 0]
 [2 1 0 0 0 1 1 0 0 1 1 0 0]
 [0 0 0 0 1 0 0 1 1 1 0 0 1]
 [1 0 1 1 0 0 0 0 0 1 0 1 0]]
[[1. 0. 1. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [2. 0. 0. 0. 0. 1. 1. 0. 0. 1. 1. 0. 0.]
 [0. 0. 0. 0. 1. 0. 0. 1. 1. 1. 0. 0. 0.]
 [1. 0. 1. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0.]]

Expected behavior The two last matrices must be equal.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale due to inactivity and will be closed in 7 days if no further activity occurs. If further support is needed, please provide an update and/or more details.