microsoft / onnxruntime

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
https://onnxruntime.ai
MIT License
14.15k stars 2.85k forks source link

Not returning anything for out-of-vocabulary text while batch inference using Tf-IDF ONNX Vectorizer model #16251

Open nssprogrammer opened 1 year ago

nssprogrammer commented 1 year ago

Describe the issue

A sklearn tfidf vectorizer model is onnxified and while doing batch prediction it should return zero vectors for out-of-vocabulary texts , but its not returning anything for those out-of vocabulary texts and removing them altogether from the output.

Example :- Suppose the batch contains 10 documents. Documents at indices 3 , 5 and 8 are out-of-vocabulary texts. The tfidf onnx vectorizer model is returning 7 vectors in its output altogether removing the out of vocabulary texts from from the output.

To reproduce

NA

Urgency

No response

ONNX Runtime Installation

Built from Source

ONNX Runtime Version or Commit ID

None

Execution Provider

Other / Unknown

edgchen1 commented 1 year ago

Here is the specification for TfIdfVectorizer: https://github.com/onnx/onnx/blob/435ad2b1d80f67e5e85e83092bbd6d3900f40806/docs/Operators.md#tfidfvectorizer

"An n-gram which cannot be found in pool_strings/pool_int64s should be ignored and has no effect on the output." https://github.com/onnx/onnx/blob/435ad2b1d80f67e5e85e83092bbd6d3900f40806/docs/Operators.md?plain=1#L30502

Excluding the not-found n-grams from the output seems consistent with this. Feel free to open an issue in ONNX if you think the specification is not correct.