StringNormalizer+Tokenizer misses n-grams

Describe the bug This is related to issue https://github.com/onnx/sklearn-onnx/pull/485. onnxruntime seems to be missing n-grams if there are stopwords in between. ngrams([a b c] , (1, 2)) --> (a, ab, b, bc, c). If b is a stopwords, we should have ngrams([a b c] , (1, 2), stopwords=['b']) --> (a, ac, c) but onnxruntime seems to return (a, c).

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Windows
ONNX Runtime installed from (source or binary): source
ONNX Runtime version: 1.3
Python version: 3.7
Visual Studio version (if applicable): 2019

To Reproduce

import numpy
from numpy.testing import assert_almost_equal
from onnxruntime import InferenceSession, __version__ as ort_version
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC, SVR
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.feature_selection import SelectKBest
from skl2onnx import convert_sklearn
from skl2onnx.common.data_types import StringTensorType
import onnx

stopwords = ['the', 'and', 'is']
X_train = numpy.array([
    "This is the first document",
    "This document is the second document.",
    "And this is the third one",
    "Is this the first document?",
]).reshape((4, 1))
y_train = numpy.array([0, 1, 0, 1])

model_pipeline = Pipeline([
    ('vectorizer', CountVectorizer(
        stop_words=stopwords, lowercase=True,
        ngram_range=(1, 2), max_features=30000)),
])

model_pipeline.fit(X_train.ravel(), y_train)
initial_type = [('input', StringTensorType([None, 1]))]
model_onnx = convert_sklearn(
    model_pipeline, "cv", initial_types=initial_type,
    options={SVC: {'zipmap': False}})

exp = [model_pipeline.transform(X_train.ravel()).toarray()]

sess = InferenceSession(model_onnx.SerializeToString())
got = sess.run(None, {'input': X_train})
if verbose:
    voc = model_pipeline.steps[0][-1].vocabulary_
    voc = list(sorted([(v, k) for k, v in voc.items()]))
    for kv in voc:
        print(kv)
for a, b in zip(exp, got):
    if verbose:
        print(a)
        print(b)
    assert_almost_equal(a, b)

Output (onnxruntime does not detect n-grams (this first).

(0, 'document')
(1, 'document second')
(2, 'first')
(3, 'first document')
(4, 'one')
(5, 'second')
(6, 'second document')
(7, 'third')
(8, 'third one')
(9, 'this')
(10, 'this document')
(11, 'this first')
(12, 'this third')
[[1 0 1 1 0 0 0 0 0 1 0 1 0]
 [2 1 0 0 0 1 1 0 0 1 1 0 0]
 [0 0 0 0 1 0 0 1 1 1 0 0 1]
 [1 0 1 1 0 0 0 0 0 1 0 1 0]]
[[1. 0. 1. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [2. 0. 0. 0. 0. 1. 1. 0. 0. 1. 1. 0. 0.]
 [0. 0. 0. 0. 1. 0. 0. 1. 1. 1. 0. 0. 0.]
 [1. 0. 1. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0.]]

Expected behavior The two last matrices must be equal.

microsoft / onnxruntime

StringNormalizer+Tokenizer misses n-grams #4201