Open eduardoleao052 opened 9 months ago
I don't know much about text pre-processing neither transformers (studied years ago) but I think OpenAI's tiktoken library is a way to go for tokenisation.
I see, I am trying to study tokenization a bit more lately, thanks for the tiktoken tip! If you don't mind me asking, what have you moved on to in terms of interests after learning about transformers and such?
I have moved on to the production side of deep learning for freelance projects. So, I am relying on pre-trained models only. I know it's wrong to not study but just build upon what others have built. But it's a lot less stressful and time freeing than trying to keep up with all new stuffs in detail. @eduardoleao052
That's cool! I guess it's natural, after studying something from a theoretical standpoint, wanting to move on to the practical side of things.
Yes. 🙂
Have you been able to get good results with the tokenization? I've been using a regex like yours to tokenize some texts for my decoder transformer, and the vocabulary size seems to blow up! I think it's because it is at a word level, maybe there's no escaping a larger vocab size.