Closed bedapisl closed 3 years ago
Looking at your example you never saved nboost/pt-tinybert-msmarco
on huggingface.co
, did you do it ?
Then the default way to load from a remote repo is AutoTokenizer.from_pretrained('xxx')
(your code should work)
Just because I still had this issue ... The following worked for me:
from tokenizers.trainers import BpeTrainer
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.pre_tokenizers import Whitespace
from tokenizers.processors import TemplateProcessing
from transformers import PreTrainedTokenizerFast
files = ["/data/deu10K.txt"]
tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])
tokenizer.pre_tokenizer = Whitespace()
tokenizer.train(files, trainer)
# this is probably optional, but still nice to have and was mentioned in the docs
# https://huggingface.co/docs/tokenizers/quicktour#build-a-tokenizer-from-scratch
tokenizer.post_processor = TemplateProcessing(
single="[CLS] $A [SEP]",
pair="[CLS] $A [SEP] $B:1 [SEP]:1",
special_tokens=[
("[CLS]", tokenizer.token_to_id("[CLS]")),
("[SEP]", tokenizer.token_to_id("[SEP]")),
],
)
# just to save it, too.
tokenizer.save("/tmp/WSGerBERT.json")
# ----
# and the magic from above to add support for `transformers`
full_tokenizer = PreTrainedTokenizerFast(tokenizer_object=tokenizer)
# I needed to set those manually to have them picked up ...
full_tokenizer.unk_token = "[UNK]"
full_tokenizer.cls_token = "[CLS]"
full_tokenizer.sep_token = "[SEP]"
full_tokenizer.pad_token = "[PAD]"
full_tokenizer.mask_token = "[MASK]"
full_tokenizer.save_pretrained("/tmp/WSGerBERT", legacy_format=False)
Now I can load it as usual:
import transformers
tokenizer = transformers.AutoTokenizer.from_pretrained("/tmp/WSGerBERT")
Hello, I would like to create a custom tokenizer, save it and be able to load it in some way, which would also work for pretrained tokenizers from Transformers library. Right now I have this script:
But if fails with:
If I change
PreTrainedTokenizerFast
toAutoTokenizer
it also fails:Is it possible to save pretrained tokenizer such that it can be loaded in the same way as the HuggingFace tokenizers?