Support for FastTokenizer in huggingface

wasiahmad / PLBART

Official code of our work, Unified Pre-training for Program Understanding and Generation [NAACL 2021].

https://arxiv.org/abs/2103.06333

MIT License

186 stars 35 forks source link

Support for FastTokenizer in huggingface #52

Closed zhipeng-cai closed 1 month ago

zhipeng-cai commented 1 year ago

Hello, I found there is no a corresponding PLBartTokenizerFast in huggingface, do you have a plan to implement a fast version tokenizer?

In fact, I need to call the word_ids() function of fast tokenizer to get the list indicating the original word corresponding to each tokenized token. word_ids = tokenized_inputs.word_ids(batch_index=i)

Or do you have any ways to calculate the original word index corresponding to each tokenized token?

Thank you very much!

FahadEbrahim commented 4 months ago

I'd appreciate if the Fast Tokenizer can be implemented for this model in HuggingFace.