Describe the bug
The str.character_ngrams function produces token <NA> for strings which are lesser than the provided n (shown in image for the case of bigrams).
I have debugged this and as far as I understand it, it is being caused by an empty list returned by the libstrings.generate_character_ngrams function. This causes to be a part of the result when it is exploded in the problematic function.
This issue causes several bugs in downstream tasks (like when using cuml for CountVectorizer etc).
Steps/Code to reproduce bug
Minimum code required to reproduce the bug:
Describe the bug The![result output](https://github.com/rapidsai/cudf/assets/68988130/946aeebb-6be3-4719-91e7-25eb9e2c0091)
str.character_ngrams
function produces token<NA>
for strings which are lesser than the providedn
(shown in image for the case of bigrams).I have debugged this and as far as I understand it, it is being caused by an empty list returned by the to be a part of the result when it is exploded in the problematic function.
This issue causes several bugs in downstream tasks (like when using cuml for
libstrings.generate_character_ngrams
function. This causesCountVectorizer
etc).Steps/Code to reproduce bug Minimum code required to reproduce the bug:
Expected behavior