Closed shampp closed 3 years ago
Hi! The issue here is that the AutoTokenizer
has no idea what is the type of your tokenizer: it's looking for the model_type
specified in the config.json
, but it seems it cannot find it.
Could you show us the results of ls ../Data/tokenizer/
, and if the file config.json
is in it, could you show us the exact content of the JSON file?
Thanks a lot!
I am expecting the config.json and vocabulary files to be saved by running bert_tokenizer.save(vocab_file)
(Please check the attached code). But unfortunately it saves a json file containing only the vocabulary. I tried the function bert_tokenizer.save_model
, but got an error saying Tokenizer don't have such a function. So there is no configuration files. But only a vocabulary json file. If I give a directory path as input to bert_tokenizer.save
, it gives me error Exception: Is a directory (os error 21)
.
The bert_tokenizer.save(vocab_file)
method does not save the configuration as the configuration is linked to the model. It is unfortunately currently impossible to use the AutoTokenizer
without having the model config.json
in the same folder, which is a hard limitation of the AutoTokenizer
.
We are aware of this limitation and it is part of the immediate roadmap. Expect a change in the coming weeks related to that issue.
Thank you for your understanding.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Hi @LysandreJik! Has there been any change on this subject?
There hasn't been any change - but we've been freeing some time to work on this subject. I would expect this to be resolved in 2 or 3 weeks.
Awesome, thanks a lot for your reply :)
I also encountered this problem, how to solve it
Using a recent version of the library should now work for these use-cases.
Could you try using the master
branch to see if it fixes your issue? You should use it to both save your tokenizer, as well as to load it in the script. If it doesn't work, please provide the code you're using as well as the full stack trace. Thank you!
Environment info
transformers
version: 4.3.2Who can help
@LysandreJik, @n1t0
Information
Model I am using (Bert):
The problem arises when using:
The tasks I am working on is:
To reproduce
Steps to reproduce the behavior:
python run_mlm.py --output_dir=../Data/model/ --model_type=bert --mlm_probability 0.1 --tokenizer_name=../Data/tokenizer --learning_rate 1e-4 --do_train --train_file ../Data/corpus.txt --gradient_accumulation_steps=4 --num_train_epochs 100 --per_gpu_train_batch_size 2 --save_steps 50000 --seed 42 --config_name=../Data/config/ --line_by_line --do_eval --max_seq_length=8 --logging_steps 5000 --validation_split_percentage 20 --save_steps 50000 --save_total_limit 10
My training configuration is as follows
"architectures": [ "BertForMaskedLM" ], "attention_probs_dropout_prob": 0.1, "gradient_checkpointing": false, "hidden_act": "gelu", "hidden_dropout_prob": 0.1, "hidden_size": 128, "initializer_range": 0.02, "intermediate_size": 256, "layer_norm_eps": 1e-12, "max_position_embeddings": 1536, "model_type": "bert", "num_attention_heads": 4, "num_hidden_layers": 4, "pad_token_id": 0, "position_embedding_type": "absolute", "transformers_version": "4.3.2", "type_vocab_size": 2, "use_cache": true, "vocab_size": 25000
Expected behavior
I am getting the error
loading configuration file ../Data/tokenizer/config.json Traceback (most recent call last): File "run_mlm.py", line 457, in <module> main() File "run_mlm.py", line 276, in main tokenizer = AutoTokenizer.from_pretrained(model_args.tokenizer_name, **tokenizer_kwargs) File ".../lib/python3.8/site-packages/transformers/models/auto/tokenization_auto.py", line 362, in from_pretrained config = AutoConfig.from_pretrained(pretrained_model_name_or_path, **kwargs) File ".../lib/python3.8/site-packages/transformers/models/auto/configuration_auto.py", line 379, in from_pretrained raise ValueError( ValueError: Unrecognized model in ../Data/tokenizer. Should have a
model_typekey in its config.json, or contain one of the following strings in its name: wav2vec2, convbert, led, blenderbot-small, retribert, mt5, t5, mobilebert, distilbert, albert, bert-generation, camembert, xlm-roberta, pegasus, marian, mbart, mpnet, bart, blenderbot, reformer, longformer, roberta, deberta, flaubert, fsmt, squeezebert, bert, openai-gpt, gpt2, transfo-xl, xlnet, xlm-prophetnet, prophetnet, xlm, ctrl, electra, encoder-decoder, funnel, lxmert, dpr, layoutlm, rag, tapas