huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
132.14k stars 26.33k forks source link

can't load checkpoint file from examples/run_language_modeling.py #4338

Closed rfernand2 closed 4 years ago

rfernand2 commented 4 years ago

🐛 Bug

Information

Model I am using (Bert, XLNet ...): GPT2 Language I am using the model on (English, Chinese ...): English The problem arises when using:

The tasks I am working on is:

To reproduce

Steps to reproduce the behavior:

  1. python ../examples/language-modeling/run_language_modeling.py ^ --output_dir=output ^ --overwrite_output_dir ^ --tokenizer=gpt2 ^ --model_type=gpt2 ^ --model_name_or_path=output/pytorch.pytorch_model.bin ^ --do_eval ^ --per_gpu_eval_batch_size=1 ^ --eval_data_file=%userprofile%/.data/wikitext-2/wikitext-2/wiki.test.tokens

This gives an error because "model_name_or_path" is assumed to be a JSON file that contained pretrained model info, not a saved checkpoint file. The error that occurs here is when trying to load the CONFIG file associated with a pretrained model.

I also tried to create a new "model_checkpoint" argument that I then pass into AutoModelWithLMHead.from_pretrained(), but that ends up with a model/checkpoint mismatch (looks like hidden size in checkpoint file =256, but current model=768). In my usage here, I have never changed the hidden size - just did the "do-train" option and it saved my checkpoints to the output directory. And now, I am just trying to verify I can eval on a checkpoint, and then also continue training on a checkpoint.

Expected behavior

I expected to be able to specify an checkpoint_path argument in the run_language_modeling.py that would load the checkpoint file and let me continue training on it and/or evaluate it.

Environment info

julien-c commented 4 years ago

--model_name_or_path should be a folder, so you should use just ./output instead.

rfernand2 commented 4 years ago

Thanks. Verified - that fixed it. Please add a note n the README.md to explain this. Thanks.

vincentwen1995 commented 4 years ago

Hi, may I ask how did you get these checkpoint files? I tried to specify the path to the checkpoint that is generated by the script during training (containing config.json, optimizer.pt, _pytorchmodel.bin, scheduler.pt, _trainingargs.bin), but I met with a Traceback like this

Traceback (most recent call last):
  File "run_language_modeling.py", line 277, in <module>
    main()
  File "run_language_modeling.py", line 186, in main
    tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path, cache_dir=model_args.cache_dir)
  File "H:\Anaconda3\envs\env_name\lib\site-packages\transformers\tokenization_auto.py", line 203, in from_pretrained
    return tokenizer_class_py.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
  File "H:\Anaconda3\envs\env_name\lib\site-packages\transformers\tokenization_utils.py", line 902, in from_pretrained
    return cls._from_pretrained(*inputs, **kwargs)
  File "H:\Anaconda3\envs\env_name\lib\site-packages\transformers\tokenization_utils.py", line 1007, in _from_pretrained
    list(cls.vocab_files_names.values()),
OSError: Model name 'C:\\path-to-ckpt\\checkpoint-17500' was not found in tokenizers model name list (bert-base-uncased, bert-large-uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased, bert-base-multilingual-cased, bert-base-chinese, bert-base-german-cased, bert-large-uncased-whole-word-masking, bert-large-cased-whole-word-masking, bert-large-uncased-whole-word-masking-finetuned-squad, bert-large-cased-whole-word-masking-finetuned-squad, bert-base-cased-finetuned-mrpc, bert-base-german-dbmdz-cased, bert-base-german-dbmdz-uncased, bert-base-finnish-cased-v1, bert-base-finnish-uncased-v1, bert-base-dutch-cased). We assumed 'C:\\path-to-ckpt\\checkpoint-17500' was a path, a model identifier, or url to a directory containing vocabulary files named ['vocab.txt'] but couldn't find such vocabulary files at this path or url.

which technically says that the checkpoint folder misses some other files. I wonder where this mismatch comes from if I used the same script to train.

prashantmore277 commented 2 years ago

Those who are new to this issue I just figured it out and save your time 😜😀

What is this error about? ==> When you run the model for the first time it downloads some files { pytorch_model.bin } and if your internet is broken accidentally between processes it will continue running the pipeline file without completely downloading that pytorch_model.bin file so it will raise this issue.

Steps : 1 ] Go to C:// Users / UserName / .cache 2 ] Delete .cache folder 3 ] And Done Just Run The Model Once Again......

You can connect me through @prashantmore999 { Twitter }