Open Vortexx2 opened 9 months ago
It seems like this is a problem occurring not in cuML, but in cuDF. I have made a PR there to fix this issue as well. cuDF issue
Thanks for the issue @Vortexx2 and fix in cuDF! Looking forward to the review and merge process over there.
Describe the bug Upon using the code provided to fit a
CountVectorizer
on a given text series, it causes an error to pop up where the lengths of the calculated vocabulary and document frequencies don't match, leading to an error in the_limit_features
method, when using a mask for thestop_words_
andvocabulary_
variables. The length of the document frequencies calculated using thedocument_frequency()
method is one less compared to the length of the calculated vocabulary. Upon further inspection, the vocabulary seems to have one last entry (when sorted alphabetically) which is<NA>
. I'm not sure, but it seems like this is causing the off by one error. This only occurs when the last string shown below (443) is included in the Series, otherwise this error does not occur.Steps/Code to reproduce bug Minimum Code required to reproduce:
Expected behavior The
CountVectorizer
should be easily fit to even such a small Dataset.Environment details (please complete the following information):