Tokenizer - Githubissues

jsbaan / transformer-from-scratch

Well documented, unit tested, type checked and formatted implementation of a vanilla transformer - for educational purposes.

213 stars 41 forks source link

Tokenizer #2

Open eduardoleao052 opened 9 months ago

eduardoleao052 commented 9 months ago

Have you been able to get good results with the tokenization? I've been using a regex like yours to tokenize some texts for my decoder transformer, and the vocabulary size seems to blow up! I think it's because it is at a word level, maybe there's no escaping a larger vocab size.

RahulBhalley commented 7 months ago

I don't know much about text pre-processing neither transformers (studied years ago) but I think OpenAI's tiktoken library is a way to go for tokenisation.

eduardoleao052 commented 7 months ago

I see, I am trying to study tokenization a bit more lately, thanks for the tiktoken tip! If you don't mind me asking, what have you moved on to in terms of interests after learning about transformers and such?

RahulBhalley commented 7 months ago

I have moved on to the production side of deep learning for freelance projects. So, I am relying on pre-trained models only. I know it's wrong to not study but just build upon what others have built. But it's a lot less stressful and time freeing than trying to keep up with all new stuffs in detail. @eduardoleao052

eduardoleao052 commented 7 months ago

That's cool! I guess it's natural, after studying something from a theoretical standpoint, wanting to move on to the practical side of things.

RahulBhalley commented 7 months ago

Yes. 🙂