rapidsai / cuml

cuML - RAPIDS Machine Learning Library
https://docs.rapids.ai/api/cuml/stable/
Apache License 2.0
4.18k stars 527 forks source link

[BUG] CountVectorizer Vocabulary Length Mismatch #5709

Open Vortexx2 opened 9 months ago

Vortexx2 commented 9 months ago

Describe the bug Upon using the code provided to fit a CountVectorizer on a given text series, it causes an error to pop up where the lengths of the calculated vocabulary and document frequencies don't match, leading to an error in the _limit_features method, when using a mask for the stop_words_ and vocabulary_ variables. The length of the document frequencies calculated using the document_frequency() method is one less compared to the length of the calculated vocabulary. Upon further inspection, the vocabulary seems to have one last entry (when sorted alphabetically) which is <NA>. I'm not sure, but it seems like this is causing the off by one error. This only occurs when the last string shown below (443) is included in the Series, otherwise this error does not occur.

Steps/Code to reproduce bug Minimum Code required to reproduce:

from cudf.core.series import Series
from cuml.feature_extraction.text import CountVectorizer

# make a random text series with 5 rows
text = Series(['1788', '1788', 'update.zip', '1788', '1788', 'update.zip', '', '', '443'])
# use the text series to create a CountVectorizer
vectorizer = CountVectorizer(ngram_range=(2, 3), analyzer='char')
# fit the vectorizer to the text series
vectorizer.fit(text)

Expected behavior The CountVectorizer should be easily fit to even such a small Dataset.

Environment details (please complete the following information):

Vortexx2 commented 9 months ago

It seems like this is a problem occurring not in cuML, but in cuDF. I have made a PR there to fix this issue as well. cuDF issue

dantegd commented 9 months ago

Thanks for the issue @Vortexx2 and fix in cuDF! Looking forward to the review and merge process over there.