rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.45k stars 908 forks source link

[FEA] Switch cudf.Subwordtokenizer to use the vocab file directly instead of hash_vocab. #14294

Open VibhuJawa opened 1 year ago

VibhuJawa commented 1 year ago

Is your feature request related to a problem? Please describe.

We currently rely on the hashed vocab file using cudf.utils.hash_vocab_utils.hash_vocab, we should move to using the vocab file directly.

This will be similar to the API we added here: https://github.com/rapidsai/cudf/pull/13930

Describe the solution you'd like

cudf_tokenizer = cudf.core.subword_tokenizer.SubwordTokenizer(vocab_file=xyz.txt)

Instead of earlier:

from cudf.utils.hash_vocab_utils import hash_vocab
hash_vocab('bert-base-cased-vocab.txt', 'voc_hash.txt')
cudf_tokenizer = SubwordTokenizer('voc_hash.txt')

Additional context This should help the switch from hugging face like tokenizer to be easier.

CC: @davidwendt

@MarcRomeijn, @cwharris, @markmotrin - Given your experience with the cuDF tokenizer, we'd value any feedback or suggestions for enhancing the Subword tokenizer API and features you would like.

davidwendt commented 1 year ago

Besides accepting the vocab file directly instead of the hashed input, specifically want to ask about the parameters and return values for this function: https://docs.rapids.ai/api/cudf/stable/user_guide/api_docs/api/cudf.core.subword_tokenizer.subwordtokenizer.__call__/#cudf.core.subword_tokenizer.SubwordTokenizer.__call__ I believe some of these pre-date the lists-column support in cuDF and so I'm wondering if that could used as be a better result than returning the three individual arrays.

vyasr commented 2 weeks ago

@davidwendt based on https://github.com/rapidsai/cudf/pull/17208/ should we close this if the function is going away entirely?

davidwendt commented 2 weeks ago

@davidwendt based on #17208 should we close this if the function is going away entirely?

Not yet. We need to get Morpheus to move off the subword-tokenizer I think. I'd like to deprecate the API in this release to help force this a bit.

vyasr commented 2 weeks ago

Sounds good, early deprecation sounds like the right move.