Automatically loading vocab files

phosseini commented 4 years ago

It would be nice if the vocab files be automatically downloaded if they don't already exist. Also would be better if you add a short note/comment in the readme file so that folks know that they should manually download the vocab files. Specifically when running the following line of code:

tokenizer = BertWordPieceTokenizer("bert-base-uncased-vocab.txt", lowercase=True)

which will be resulted in the following error if the vocab file doesn't exist:

Exception: Error while initializing WordPiece

julien-c commented 4 years ago

Yes I agree that this should at least be clearer in the README as other people have reported this

loopdigga96 commented 4 years ago

Any updates about it?

aditya140 commented 4 years ago

Meanwhile these links can be used to download the vocab files for Bert models

loopdigga96 commented 4 years ago

Thanks! I think it's a good idea to put these links somewhere in tutorial.

mar-muel commented 4 years ago

Not sure if this is the best way, but as a workaround you can load the tokenizer from the transformer library and access the pretrained_vocab_files_map property which contains all download links (those should always be up to date).

Posting my method here, in case it's useful to anyone:

from tokenizers import BertWordPieceTokenizer
import urllib
from transformers import AutoTokenizer

def download_vocab_files_for_tokenizer(tokenizer, model_type, output_path):
    vocab_files_map = tokenizer.pretrained_vocab_files_map
    vocab_files = {}
    for resource in vocab_files_map.keys():
        download_location = vocab_files_map[resource][model_type]
        f_path = os.path.join(output_path, os.path.basename(download_location))
        urllib.request.urlretrieve(download_location, f_path)
        vocab_files[resource] = f_path
    return vocab_files

model_type = 'bert-base-uncased'
output_path = './my_local_vocab_files/'
tokenizer = AutoTokenizer.from_pretrained(model_type)
vocab_files = download_vocab_files_for_tokenizer(tokenizer, model_type, output_path)
fast_tokenizer = BertWordPieceTokenizer(vocab_files.get('vocab_file'), vocab_files.get('merges_file'))

mrdvince commented 4 years ago

The links to the vocab files should be in the readme, took me a while to figure it out, mar-muel function works great.

github-actions[bot] commented 5 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

huggingface / tokenizers

Automatically loading vocab files #59