Open VibhuJawa opened 1 year ago
Besides accepting the vocab file directly instead of the hashed input, specifically want to ask about the parameters and return values for this function: https://docs.rapids.ai/api/cudf/stable/user_guide/api_docs/api/cudf.core.subword_tokenizer.subwordtokenizer.__call__/#cudf.core.subword_tokenizer.SubwordTokenizer.__call__ I believe some of these pre-date the lists-column support in cuDF and so I'm wondering if that could used as be a better result than returning the three individual arrays.
@davidwendt based on https://github.com/rapidsai/cudf/pull/17208/ should we close this if the function is going away entirely?
@davidwendt based on #17208 should we close this if the function is going away entirely?
Not yet. We need to get Morpheus to move off the subword-tokenizer I think. I'd like to deprecate the API in this release to help force this a bit.
Sounds good, early deprecation sounds like the right move.
Is your feature request related to a problem? Please describe.
We currently rely on the hashed vocab file using
cudf.utils.hash_vocab_utils.hash_vocab
, we should move to using thevocab file
directly.This will be similar to the API we added here: https://github.com/rapidsai/cudf/pull/13930
Describe the solution you'd like
Instead of earlier:
Additional context This should help the switch from hugging face like tokenizer to be easier.
CC: @davidwendt
@MarcRomeijn, @cwharris, @markmotrin - Given your experience with the cuDF tokenizer, we'd value any feedback or suggestions for enhancing the
Subword tokenizer
API and features you would like.