Thank you for this great repository. It really is a huge help.
There is one thing, however, that I cannot figure out on my own:
I would like to train an ELECTRA for a different language and therefore use another tokenizer.
Unfortunately, I cannot find where I can change the IDs of the special tokens. I trained a BPE-tokenizer with "<s>":0,"<pad>":1,"</s>":2,"<unk>":3,"<mask>":4,..., but the model seems to assume that these special tokens have the ids 100, 101, 102 and 103. Could you please tell me where I can overwrite this assumption?
I'm really sorry for the stupid question, but I really could not find it.
Thank you very much in advance.
Okay, I found the solution:
It is possible to simply replace the Tokenizer by the Custom Tokenizer, but I had to delete the cache which contained the ids of the old tokenizer.
Thank you for this great repository. It really is a huge help. There is one thing, however, that I cannot figure out on my own: I would like to train an ELECTRA for a different language and therefore use another tokenizer. Unfortunately, I cannot find where I can change the IDs of the special tokens. I trained a BPE-tokenizer with
"<s>":0,"<pad>":1,"</s>":2,"<unk>":3,"<mask>":4,...
, but the model seems to assume that these special tokens have the ids 100, 101, 102 and 103. Could you please tell me where I can overwrite this assumption? I'm really sorry for the stupid question, but I really could not find it. Thank you very much in advance.