huggingface / tokenizers

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production
https://huggingface.co/docs/tokenizers
Apache License 2.0
8.93k stars 779 forks source link

Saving tokenizer such that it is loadable by Transformers library #694

Closed bedapisl closed 3 years ago

bedapisl commented 3 years ago

Hello, I would like to create a custom tokenizer, save it and be able to load it in some way, which would also work for pretrained tokenizers from Transformers library. Right now I have this script:

from transformers import AutoModel, BertTokenizerFast, PreTrainedTokenizerFast, AutoTokenizer
from tokenizers import Tokenizer, models, normalizers, pre_tokenizers, decoders, trainers

def load_tokenizer(path):
    return PreTrainedTokenizerFast.from_pretrained(path)

def main():
    tokenizer = Tokenizer(models.BPE(unk_token="[UNK]"))
    tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()
    trainer = trainers.BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[PAD]", "[MASK]"])

    tokenizer.train(['/data/sentiment_analysis/raw_data/delete.txt'], trainer=trainer)

    tokenizer.enable_truncation(max_length=256)

    full_tokenizer = PreTrainedTokenizerFast(tokenizer_object=tokenizer)
    full_tokenizer.pad_token = "[PAD]"

    full_tokenizer.save_pretrained("./test_tokenizer_save", legacy_format=False)

    # test 1
    load_tokenizer('./test_tokenizer_save')

    # test 2
    load_tokenizer('nboost/pt-tinybert-msmarco')

    # test 3
    load_tokenizer('distilbert-base-multilingual-cased')

if __name__ == "__main__":
    main()

But if fails with:

Traceback (most recent call last):
  File "test_tokenizers.py", line 34, in <module>
    main()
  File "test_tokenizers.py", line 27, in main
    load_tokenizer('nboost/pt-tinybert-msmarco')
  File "test_tokenizers.py", line 6, in load_tokenizer
    return PreTrainedTokenizerFast.from_pretrained(path)
  File "/home/pislb/.local/lib/python3.6/site-packages/transformers/tokenization_utils_base.py", line 1698, in from_pretrained
    raise EnvironmentError(msg)
OSError: Can't load tokenizer for 'nboost/pt-tinybert-msmarco'. Make sure that:

- 'nboost/pt-tinybert-msmarco' is a correct model identifier listed on 'https://huggingface.co/models'

- or 'nboost/pt-tinybert-msmarco' is the correct path to a directory containing relevant tokenizer files

If I change PreTrainedTokenizerFast to AutoTokenizer it also fails:

  File "test_tokenizers.py", line 34, in <module>
    main()
  File "test_tokenizers.py", line 18, in main
    full_tokenizer = AutoTokenizer(tokenizer_object=tokenizer)
TypeError: __init__() got an unexpected keyword argument 'tokenizer_object'

Is it possible to save pretrained tokenizer such that it can be loaded in the same way as the HuggingFace tokenizers?

Narsil commented 3 years ago

Looking at your example you never saved nboost/pt-tinybert-msmarco on huggingface.co, did you do it ? Then the default way to load from a remote repo is AutoTokenizer.from_pretrained('xxx') (your code should work)

Querela commented 2 years ago

Just because I still had this issue ... The following worked for me:

from tokenizers.trainers import BpeTrainer
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.pre_tokenizers import Whitespace
from tokenizers.processors import TemplateProcessing

from transformers import PreTrainedTokenizerFast

files = ["/data/deu10K.txt"]

tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])
tokenizer.pre_tokenizer = Whitespace()
tokenizer.train(files, trainer)

# this is probably optional, but still nice to have and was mentioned in the docs
# https://huggingface.co/docs/tokenizers/quicktour#build-a-tokenizer-from-scratch
tokenizer.post_processor = TemplateProcessing(
    single="[CLS] $A [SEP]",
    pair="[CLS] $A [SEP] $B:1 [SEP]:1",
    special_tokens=[
        ("[CLS]", tokenizer.token_to_id("[CLS]")),
        ("[SEP]", tokenizer.token_to_id("[SEP]")),
    ],
)
# just to save it, too.
tokenizer.save("/tmp/WSGerBERT.json")

# ----
# and the magic from above to add support for `transformers`

full_tokenizer = PreTrainedTokenizerFast(tokenizer_object=tokenizer)
# I needed to set those manually to have them picked up ...
full_tokenizer.unk_token = "[UNK]"
full_tokenizer.cls_token = "[CLS]"
full_tokenizer.sep_token = "[SEP]"
full_tokenizer.pad_token = "[PAD]"
full_tokenizer.mask_token = "[MASK]"
full_tokenizer.save_pretrained("/tmp/WSGerBERT", legacy_format=False)

Now I can load it as usual:

import transformers
tokenizer = transformers.AutoTokenizer.from_pretrained("/tmp/WSGerBERT")