richarddwang / electra_pytorch

Pretrain and finetune ELECTRA with fastai and huggingface. (Results of the paper replicated !)
324 stars 41 forks source link

Use different tokenizer (and specify special tokens) #33

Closed ghost closed 2 years ago

ghost commented 2 years ago

Thank you for this great repository. It really is a huge help. There is one thing, however, that I cannot figure out on my own: I would like to train an ELECTRA for a different language and therefore use another tokenizer. Unfortunately, I cannot find where I can change the IDs of the special tokens. I trained a BPE-tokenizer with "<s>":0,"<pad>":1,"</s>":2,"<unk>":3,"<mask>":4,..., but the model seems to assume that these special tokens have the ids 100, 101, 102 and 103. Could you please tell me where I can overwrite this assumption? I'm really sorry for the stupid question, but I really could not find it. Thank you very much in advance.

ghost commented 2 years ago

Okay, I found the solution: It is possible to simply replace the Tokenizer by the Custom Tokenizer, but I had to delete the cache which contained the ids of the old tokenizer.