huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
134.58k stars 26.91k forks source link

Custom tokenizer with run_mlm script #10346

Closed shampp closed 3 years ago

shampp commented 3 years ago

Environment info

Who can help

@LysandreJik, @n1t0

Information

Model I am using (Bert):

The problem arises when using:

The tasks I am working on is:

To reproduce

Steps to reproduce the behavior:

  1. I follow the official link [(https://huggingface.co/docs/tokenizers/python/latest/pipeline.html#example)] to train and save a Bert WordPieceLevel tokenizer on a custom corpus.
  2. I use this tokenizer to train a bert model from scratch using the run_mlm script

python run_mlm.py --output_dir=../Data/model/ --model_type=bert --mlm_probability 0.1 --tokenizer_name=../Data/tokenizer --learning_rate 1e-4 --do_train --train_file ../Data/corpus.txt --gradient_accumulation_steps=4 --num_train_epochs 100 --per_gpu_train_batch_size 2 --save_steps 50000 --seed 42 --config_name=../Data/config/ --line_by_line --do_eval --max_seq_length=8 --logging_steps 5000 --validation_split_percentage 20 --save_steps 50000 --save_total_limit 10

from tokenizers import Tokenizer
from tokenizers.models import WordPiece
from tokenizers.pre_tokenizers import Whitespace
from tokenizers import normalizers
from tokenizers.normalizers import NFD, StripAccents
from tokenizers.processors import TemplateProcessing
from tokenizers.trainers import WordPieceTrainer
vocab_file = '../Data/tokenizer/config.json'
corpus_file = '../Data/corpus.txt'
df = pd.read_csv(corpus_file)
bert_tokenizer = Tokenizer(WordPiece(unk_token="[UNK]"))
bert_tokenizer.normalizer = normalizers.Sequence([NFD(), StripAccents()])
bert_tokenizer.pre_tokenizer = Whitespace()
bert_tokenizer.post_processor = TemplateProcessing(
    single="[CLS] $A [SEP]",
    pair="[CLS] $A [SEP] $B:1 [SEP]:1",
    special_tokens=[("[CLS]", 1),("[SEP]", 2),],)
trainer = WordPieceTrainer(vocab_size=25000,min_frequency=3,special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])
bert_tokenizer.train_from_iterator(df.query_text.to_list(),trainer)
bert_tokenizer.save(vocab_file)

My training configuration is as follows "architectures": [ "BertForMaskedLM" ], "attention_probs_dropout_prob": 0.1, "gradient_checkpointing": false, "hidden_act": "gelu", "hidden_dropout_prob": 0.1, "hidden_size": 128, "initializer_range": 0.02, "intermediate_size": 256, "layer_norm_eps": 1e-12, "max_position_embeddings": 1536, "model_type": "bert", "num_attention_heads": 4, "num_hidden_layers": 4, "pad_token_id": 0, "position_embedding_type": "absolute", "transformers_version": "4.3.2", "type_vocab_size": 2, "use_cache": true, "vocab_size": 25000

Expected behavior

I am getting the error

loading configuration file ../Data/tokenizer/config.json Traceback (most recent call last): File "run_mlm.py", line 457, in <module> main() File "run_mlm.py", line 276, in main tokenizer = AutoTokenizer.from_pretrained(model_args.tokenizer_name, **tokenizer_kwargs) File ".../lib/python3.8/site-packages/transformers/models/auto/tokenization_auto.py", line 362, in from_pretrained config = AutoConfig.from_pretrained(pretrained_model_name_or_path, **kwargs) File ".../lib/python3.8/site-packages/transformers/models/auto/configuration_auto.py", line 379, in from_pretrained raise ValueError( ValueError: Unrecognized model in ../Data/tokenizer. Should have amodel_typekey in its config.json, or contain one of the following strings in its name: wav2vec2, convbert, led, blenderbot-small, retribert, mt5, t5, mobilebert, distilbert, albert, bert-generation, camembert, xlm-roberta, pegasus, marian, mbart, mpnet, bart, blenderbot, reformer, longformer, roberta, deberta, flaubert, fsmt, squeezebert, bert, openai-gpt, gpt2, transfo-xl, xlnet, xlm-prophetnet, prophetnet, xlm, ctrl, electra, encoder-decoder, funnel, lxmert, dpr, layoutlm, rag, tapas

LysandreJik commented 3 years ago

Hi! The issue here is that the AutoTokenizer has no idea what is the type of your tokenizer: it's looking for the model_type specified in the config.json, but it seems it cannot find it.

Could you show us the results of ls ../Data/tokenizer/, and if the file config.json is in it, could you show us the exact content of the JSON file?

Thanks a lot!

shampp commented 3 years ago

I am expecting the config.json and vocabulary files to be saved by running bert_tokenizer.save(vocab_file) (Please check the attached code). But unfortunately it saves a json file containing only the vocabulary. I tried the function bert_tokenizer.save_model, but got an error saying Tokenizer don't have such a function. So there is no configuration files. But only a vocabulary json file. If I give a directory path as input to bert_tokenizer.save, it gives me error Exception: Is a directory (os error 21).

LysandreJik commented 3 years ago

The bert_tokenizer.save(vocab_file) method does not save the configuration as the configuration is linked to the model. It is unfortunately currently impossible to use the AutoTokenizer without having the model config.json in the same folder, which is a hard limitation of the AutoTokenizer.

We are aware of this limitation and it is part of the immediate roadmap. Expect a change in the coming weeks related to that issue.

Thank you for your understanding.

github-actions[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

manueltonneau commented 3 years ago

Hi @LysandreJik! Has there been any change on this subject?

LysandreJik commented 3 years ago

There hasn't been any change - but we've been freeing some time to work on this subject. I would expect this to be resolved in 2 or 3 weeks.

manueltonneau commented 3 years ago

Awesome, thanks a lot for your reply :)

hongjianyuan commented 3 years ago

I also encountered this problem, how to solve it

LysandreJik commented 3 years ago

Using a recent version of the library should now work for these use-cases.

Could you try using the master branch to see if it fixes your issue? You should use it to both save your tokenizer, as well as to load it in the script. If it doesn't work, please provide the code you're using as well as the full stack trace. Thank you!