pytorch / text

Models, data loaders and abstractions for language processing, powered by PyTorch
https://pytorch.org/text
BSD 3-Clause "New" or "Revised" License
3.49k stars 815 forks source link

Fast WordPiece Tokenization #1465

Open minhnhat10 opened 2 years ago

minhnhat10 commented 2 years ago

Hello everyone, I want to implement a Fast WordPiece Tokenization algorithm introduced by Google.

Fast WordPiece algortihm

Google introduced a new algorithm called LinMaxMatch for WordPiece tokenization has time complexity O(n). I realized that Pytorch don't have support it yet so I want to implement it. This can be especially useful for mobile platform.

Implementation

I'm using C++ and I have read contributing documentation of Pytoch and source code of TORCHTEXT.DATA.UTILS so I consider two options for this feature. First, implementing this in c10 folder and second one is implement it as Python package. So, I want to know which way is better and appropriate for Pytorch project.

Thank you for any advices and suggestions from you guys.

bdhirsh commented 2 years ago

I don't see a label for torch text. @ejguan what label do you think this should go under?

ejguan commented 2 years ago

I don't see a label for torch text. @ejguan what label do you think this should go under?

@bdhirsh We should transfer this issue to TorchText repo.

parmeet commented 2 years ago

Hi @minhnhat10, thank you for your proposal. This would certainly be a welcome contribution to torchtext repo :).

Currently, we have sentencepiece and byte-level BPE (used in GPT-2) implemented and binded with python that could act as a starting point to help with the code-base.

I would suggest looking at sentencepiece wrapper and corresponding registration mechanism: using pybind and using torchbind.

cc: @abhinavarora

minhnhat10 commented 2 years ago

Hi @parmeet, thank you for your suggestion. I will try that

lucifermorningstar1305 commented 1 year ago

Hey guys,

I know it's pretty late, but I have implemented a vanilla Python version of the Fast WordPiece Tokenization and look forward to contributing this in TorchText with some help.

https://github.com/lucifermorningstar1305/fast_wordpiece_tokenization

Nayef211 commented 1 year ago

Hey @lucifermorningstar1305. Thanks for reaching out here. Our implementation of the BERT Tokenizer is written in C++ and can be found within these 2 files:

You are welcome to contribute the Fast WordPiece Tokenizer in a separate file within that folder. You can find the contribution guidelines for C++ operators here. Before we migrate over to the new tokenizer, we would also want some simple benchmarks showcasing that the new tokenizer is indeed faster than the existing BERTTokenizer. I am happy to review any PRs you make and help with questions about contributing!