Closed AngledLuffa closed 6 months ago
I may have figured out how to build a BertTokenizerFast
Basically, just need to wrap the BertWordPieceTokenizer
in a BertTokenizerFast
before saving
new_tokenizer = BertTokenizerFast(tokenizer_object=tokenizer)
new_tokenizer.save_pretrained("zzzzz")
Now the zzzzz
directory can be loaded for training the new Bert model
Glad that you found the answer and sorry for not helping earlier!
If I run the following, then try to reload the tokenizer using
from_file
, I get an error ofsep_token
not being part of the vocabularyI can see that some of the necessary steps, such as adding a
post_processor
, occur if thevocab
is already specified. Surely I'm not supposed to pass in avocab
before training, though... What about adding apost_processor
in some way? Except I don't see any way to get the special token ids out of the tokenizer, either before or after it's created.Am I expected to pass in the
vocab
before creating the Tokenizer? Am I supposed to add thesep_token
manually, or manually create thepost_processor
?If I look at the
.json
file written out, I can see there even is a[SEP]
token in there, but it's not listed assep_token
Even if I manually add
sep_token
to the json, it doesn't workso...
if I then read the tokenizer file back in with the same code path used in
from_file
, such asIt doesn't properly read the tokenizer vocabulary, so probably this isn't how I'm supposed to do it
If I use the results of
save_model
from above, the filezzzzz/test-vocab.txt
, then it successfully loads the tokenizer back in. However, that directory only has the one file in it, and interesting pieces like the postprocessor have all been lost in the process.Is there something I'm doing wrong, or some bug with the save / load process in this tokenizer?
As a random aside, after creating the tokenizer,
BertTokenizerFast
and the like can be called directly on a text, such astokenizer(text)
. With theBertWordPieceTokenizer
, it simply isn't possible. What should I call instead?Is it possible I'm just not supposed to use
BertWordPieceTokenizer
, andBertTokenizerFast
should be used instead? If so, how do I train that?AutoTokenizer
doesn't have any way to load thezzzzz
directory from above, since it only has thevocab.txt
file in it and no config file