Open nssprogrammer opened 1 year ago
Here is the specification for TfIdfVectorizer: https://github.com/onnx/onnx/blob/435ad2b1d80f67e5e85e83092bbd6d3900f40806/docs/Operators.md#tfidfvectorizer
"An n-gram which cannot be found in pool_strings/pool_int64s should be ignored and has no effect on the output." https://github.com/onnx/onnx/blob/435ad2b1d80f67e5e85e83092bbd6d3900f40806/docs/Operators.md?plain=1#L30502
Excluding the not-found n-grams from the output seems consistent with this. Feel free to open an issue in ONNX if you think the specification is not correct.
Describe the issue
A sklearn tfidf vectorizer model is onnxified and while doing batch prediction it should return zero vectors for out-of-vocabulary texts , but its not returning anything for those out-of vocabulary texts and removing them altogether from the output.
Example :- Suppose the batch contains 10 documents. Documents at indices 3 , 5 and 8 are out-of-vocabulary texts. The tfidf onnx vectorizer model is returning 7 vectors in its output altogether removing the out of vocabulary texts from from the output.
To reproduce
NA
Urgency
No response
ONNX Runtime Installation
Built from Source
ONNX Runtime Version or Commit ID
None
Execution Provider
Other / Unknown