huggingface / tokenizers

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production
https://huggingface.co/docs/tokenizers
Apache License 2.0
9.02k stars 798 forks source link

Automatically loading vocab files #59

Closed phosseini closed 4 months ago

phosseini commented 4 years ago

It would be nice if the vocab files be automatically downloaded if they don't already exist. Also would be better if you add a short note/comment in the readme file so that folks know that they should manually download the vocab files. Specifically when running the following line of code:

tokenizer = BertWordPieceTokenizer("bert-base-uncased-vocab.txt", lowercase=True)

which will be resulted in the following error if the vocab file doesn't exist:

Exception: Error while initializing WordPiece

julien-c commented 4 years ago

Yes I agree that this should at least be clearer in the README as other people have reported this

loopdigga96 commented 4 years ago

Any updates about it?

aditya140 commented 4 years ago

Meanwhile these links can be used to download the vocab files for Bert models

'bert-base-uncased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt", 'bert-large-uncased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-vocab.txt", 'bert-base-cased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-vocab.txt", 'bert-large-cased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased-vocab.txt", 'bert-base-multilingual-uncased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-multilingual-uncased-vocab.txt", 'bert-base-multilingual-cased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-multilingual-cased-vocab.txt", 'bert-base-chinese': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-chinese-vocab.txt", 'bert-base-german-cased': "https://int-deepset-models-bert.s3.eu-central-1.amazonaws.com/pytorch/bert-base-german-cased-vocab.txt", 'bert-large-uncased-whole-word-masking': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-whole-word-masking-vocab.txt", 'bert-large-cased-whole-word-masking': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased-whole-word-masking-vocab.txt", 'bert-large-uncased-whole-word-masking-finetuned-squad': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-whole-word-masking-finetuned-squad-vocab.txt", 'bert-large-cased-whole-word-masking-finetuned-squad': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased-whole-word-masking-finetuned-squad-vocab.txt", 'bert-base-cased-finetuned-mrpc': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-finetuned-mrpc-vocab.txt" '''

loopdigga96 commented 4 years ago

Thanks! I think it's a good idea to put these links somewhere in tutorial.

mar-muel commented 4 years ago

Not sure if this is the best way, but as a workaround you can load the tokenizer from the transformer library and access the pretrained_vocab_files_map property which contains all download links (those should always be up to date).

Posting my method here, in case it's useful to anyone:

from tokenizers import BertWordPieceTokenizer
import urllib
from transformers import AutoTokenizer

def download_vocab_files_for_tokenizer(tokenizer, model_type, output_path):
    vocab_files_map = tokenizer.pretrained_vocab_files_map
    vocab_files = {}
    for resource in vocab_files_map.keys():
        download_location = vocab_files_map[resource][model_type]
        f_path = os.path.join(output_path, os.path.basename(download_location))
        urllib.request.urlretrieve(download_location, f_path)
        vocab_files[resource] = f_path
    return vocab_files

model_type = 'bert-base-uncased'
output_path = './my_local_vocab_files/'
tokenizer = AutoTokenizer.from_pretrained(model_type)
vocab_files = download_vocab_files_for_tokenizer(tokenizer, model_type, output_path)
fast_tokenizer = BertWordPieceTokenizer(vocab_files.get('vocab_file'), vocab_files.get('merges_file'))
mrdvince commented 4 years ago

The links to the vocab files should be in the readme, took me a while to figure it out, mar-muel function works great.

github-actions[bot] commented 5 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.