gmihaila / ml_things

This is where I put things I find useful that speed up my work with Machine Learning. Ever looked in your old projects to reuse those cool functions you created before? Well, this repo is designed to be a Python Library of functions I created in my previous project that can be reused. I also share some Notebooks Tutorials and Python Code Snippets.
https://gmihaila.github.io
Apache License 2.0
254 stars 61 forks source link

Tutorials: Pretraining Transformers #15

Closed jbmaxwell closed 3 years ago

jbmaxwell commented 3 years ago

First of all, thanks for writing this notebook—it's been a huge help!

I have an unusual situation, in that I have a small, hand-defined vocabulary for a very specific purpose. For this reason, I've been using the BertWordPieceTokenizer for everything (whether MLM or CLM), and loading it with my fixed vocab file. Is there a way I can do this with your notebook?

Thanks in advance.

jbmaxwell commented 3 years ago

I got it worked out... I hadn't realized that BertTokenizerFast gave me an option to load my vocab directly... ugh...