[BUG] CountVectorizer Vocabulary Length Mismatch

Describe the bug Upon using the code provided to fit a CountVectorizer on a given text series, it causes an error to pop up where the lengths of the calculated vocabulary and document frequencies don't match, leading to an error in the _limit_features method, when using a mask for the stop_words_ and vocabulary_ variables. The length of the document frequencies calculated using the document_frequency() method is one less compared to the length of the calculated vocabulary. Upon further inspection, the vocabulary seems to have one last entry (when sorted alphabetically) which is <NA>. I'm not sure, but it seems like this is causing the off by one error. This only occurs when the last string shown below (443) is included in the Series, otherwise this error does not occur.

Steps/Code to reproduce bug Minimum Code required to reproduce:

from cudf.core.series import Series
from cuml.feature_extraction.text import CountVectorizer

# make a random text series with 5 rows
text = Series(['1788', '1788', 'update.zip', '1788', '1788', 'update.zip', '', '', '443'])
# use the text series to create a CountVectorizer
vectorizer = CountVectorizer(ngram_range=(2, 3), analyzer='char')
# fit the vectorizer to the text series
vectorizer.fit(text)

Expected behavior The CountVectorizer should be easily fit to even such a small Dataset.

Environment details (please complete the following information):

Environment location: GCP cloud, pip used
Linux Distro/Architecture: [Ubuntu 16.04 amd64]
GPU Model/Driver: T4 / 525.105.17
CUDA: 11.8

Method of cuDF & cuML install: pip pip list:

aiohttp                   3.9.1
aiosignal                 1.3.1
anyio                     4.2.0
argon2-cffi               23.1.0
argon2-cffi-bindings      21.2.0
arrow                     1.3.0
asttokens                 2.4.1
async-timeout             4.0.3
attrs                     23.1.0
beautifulsoup4            4.12.2
bleach                    6.1.0
bokeh                     3.3.2
cachetools                5.3.2
certifi                   2023.11.17
cffi                      1.16.0
charset-normalizer        3.3.2
click                     8.1.7
click-plugins             1.1.1
cligj                     0.7.2
cloudpickle               3.0.0
colorcet                  3.0.1
comm                      0.2.0
contourpy                 1.2.0
cubinlinker-cu11          0.3.0.post1
cucim-cu11                23.12.1
cuda-python               11.8.3
cudf-cu11                 23.12.1
cugraph-cu11              23.12.0
cuml-cu11                 23.12.0
cuproj-cu11               23.12.1
cupy-cuda11x              12.3.0
cuspatial-cu11            23.12.1
cuxfilter-cu11            23.12.0
cycler                    0.12.1
dask                      2023.11.0
dask-cuda                 23.12.0
dask-cudf-cu11            23.12.0
datashader                0.16.0
debugpy                   1.8.0
decorator                 5.1.1
defusedxml                0.7.1
distributed               2023.11.0
exceptiongroup            1.2.0
executing                 2.0.1
fastjsonschema            2.19.0
fastrlock                 0.8.2
filelock                  3.13.1
fiona                     1.9.5
fonttools                 4.47.0
fqdn                      1.5.1
frozenlist                1.4.1
fsspec                    2023.12.2
geopandas                 0.14.1
holoviews                 1.18.1
idna                      3.6
imageio                   2.33.1
importlib-metadata        7.0.1
iniconfig                 2.0.0
ipykernel                 6.28.0
ipython                   8.19.0
isoduration               20.11.0
jedi                      0.19.1
Jinja2                    3.1.2
joblib                    1.3.2
jsonpointer               2.4
jsonschema                4.20.0
jsonschema-specifications 2023.12.1
jupyter_client            8.6.0
jupyter_core              5.6.0
jupyter-events            0.9.0
jupyter_server            2.12.1
jupyter_server_proxy      4.1.0
jupyter_server_terminals  0.5.1
jupyterlab_pygments       0.3.0
kiwisolver                1.4.5
lazy_loader               0.3
linkify-it-py             2.0.2
llvmlite                  0.40.1
locket                    1.0.0
Markdown                  3.5.1
markdown-it-py            3.0.0
MarkupSafe                2.1.3
matplotlib                3.8.2
matplotlib-inline         0.1.6
mdit-py-plugins           0.4.0
mdurl                     0.1.2
mistune                   3.0.2
msgpack                   1.0.7
multidict                 6.0.4
multipledispatch          1.0.0
nbclient                  0.9.0
nbconvert                 7.13.1
nbformat                  5.9.2
nest-asyncio              1.5.8
networkx                  3.2.1
numba                     0.57.1
numpy                     1.24.4
nvtx                      0.2.8
overrides                 7.4.0
packaging                 23.2
pandas                    1.5.3
pandocfilters             1.5.0
panel                     1.3.6
param                     2.0.1
parso                     0.8.3
partd                     1.4.1
pexpect                   4.9.0
Pillow                    10.1.0
pip                       23.0.1
platformdirs              4.1.0
pluggy                    1.3.0
polars                    0.20.2
prometheus-client         0.19.0
prompt-toolkit            3.0.43
protobuf                  4.25.1
psutil                    5.9.7
ptxcompiler-cu11          0.7.0.post1
ptyprocess                0.7.0
pure-eval                 0.2.2
pyarrow                   14.0.2
pycparser                 2.21
pyct                      0.5.0
Pygments                  2.17.2
pylibcugraph-cu11         23.12.0
pylibraft-cu11            23.12.0
pynvml                    11.4.1
pyparsing                 3.1.1
pyproj                    3.6.1
pytest                    7.4.3
python-dateutil           2.8.2
python-json-logger        2.0.7
pytz                      2023.3.post1
pyviz_comms               3.0.0
PyWavelets                1.5.0
PyYAML                    6.0.1
pyzmq                     25.1.2
raft-dask-cu11            23.12.0
rapids-dask-dependency    23.12.1
referencing               0.32.0
requests                  2.31.0
requests-file             1.5.1
rfc3339-validator         0.1.4
rfc3986-validator         0.1.1
rich                      13.7.0
rmm-cu11                  23.12.0
rpds-py                   0.15.2
scikit-image              0.21.0
scikit-learn              1.3.2
scipy                     1.11.4
Send2Trash                1.8.2
setuptools                65.5.0
shapely                   2.0.2
simpervisor               1.0.0
six                       1.16.0
sniffio                   1.3.0
sortedcontainers          2.4.0
soupsieve                 2.5
stack-data                0.6.3
tblib                     3.0.0
terminado                 0.18.0
threadpoolctl             3.2.0
tifffile                  2023.12.9
tinycss2                  1.2.1
tldextract                5.1.1
tomli                     2.0.1
toolz                     0.12.0
tornado                   6.4
tqdm                      4.66.1
traitlets                 5.14.0
treelite                  3.9.1
treelite-runtime          3.9.1
types-python-dateutil     2.8.19.14
typing_extensions         4.9.0
uc-micro-py               1.0.2
ucx-py-cu11               0.35.0
uri-template              1.3.0
urllib3                   2.1.0
wcwidth                   0.2.12
webcolors                 1.13
webencodings              0.5.1
websocket-client          1.7.0
xarray                    2023.12.0
xyzservices               2023.10.1
yarl                      1.9.4
zict                      3.0.0
zipp                      3.17.0

rapidsai / cuml

[BUG] CountVectorizer Vocabulary Length Mismatch #5709