huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
132.76k stars 26.45k forks source link

Transformers 4.21.0: Can't load XLMRoberta checkpoints #18373

Closed kkie02 closed 2 years ago

kkie02 commented 2 years ago

System Info

Transformers 4.21.0

Who can help?

@LysandreJik @sgugger

Information

Tasks

Reproduction

  1. cont_pre_model = XLMRobertaForMaskedLM.from_pretrained('xlm-roberta-base')
  2. cont_pre_training_args = TrainingArguments( output_dir=temp_dir, num_train_epochs=40, per_device_train_batch_size=4, save_steps=5000, logging_steps=50, save_total_limit=3, prediction_loss_only=True, evaluation_strategy='no', learning_rate=2e-5, warmup_steps=cont_pre_warmup_steps, dataloader_num_workers=0, disable_tqdm=False, gradient_accumulation_steps=8, fp16=True )
  3. cont_pre_trainer = Trainer( model=cont_pre_model, args=cont_pre_training_args, train_dataset=cont_pre_dataset, data_collator=cont_pre_collator ) Then start to train, cont_pre_trainer.train(), save some checkpoints
  4. Corrupt it, try to continue from checkpoint cont_pre_trainer.train(resume_from_checkpoint=True)

Expected behavior

Describe: I used XLMRobertaForMaskedLM.from_pretrained('xlm-roberta-base') to continue pretraining, the training process was too long, so I save checkpoints regularly. I did this in Google Colab. Several days ago, I can't load any saved checkpoints by using "cont_pre_trainer.train(resume_from_checkpoint=True)", there is always such an error: RuntimeError: Error(s) in loading state_dict for XLMRobertaForMaskedLM: Missing key(s) in state_dict: "lm_head.decoder.weight", "lm_head.decoder.bias".

The reason: XLMRobertaForMaskedLM doesn't have "lm_head.decoder.weight", "lm_head.decoder.bias". And state_dict of a PyTorch module is an OrderedDict and it complains about missing keys. Maybe you should use such a command somewhere: load_state_dict(state_dict, strict=False)

How to solve it by myself: I rollbacked transformers to version 4.20.1 and it worked then.

Problem Conclusion: Transformers version 4.21.0 can't load checkpoints that trained on both version 4.20.1 and version 4.21.0. (Transformers version 4.20.1 works normally, I use it to process checkpoints trained on version 4.20.1 or version 4.21.0)

LysandreJik commented 2 years ago

cc @sgugger

sgugger commented 2 years ago

We don't support resuming training with a different version of Transformers that initiated it, as it would require just freezing the whole Trainer forever: any bug fix or feature added in it won't work with a resumed checkpoint.

harshit-sethi09 commented 2 years ago

I am facing the same error with Transformers version 4.21.0 - model trained on same transformers version and loading the best model after training gives this error. I am using xlm-roberta-base with AutoModelForMaskedLM

RuntimeError: Error(s) in loading state_dict for XLMRobertaForMaskedLM: Missing key(s) in state_dict: "lm_head.decoder.weight", "lm_head.decoder.bias".

sgugger commented 2 years ago

Thanks for reporting @harshit-sethi09, with the initial report I thought this was a change in the XLM-RoBERTa model that was causing problems across versions but the whole reload is broken in 4.21.0 because of the changes in #18221 .

The PR mentioned above should fix it and we will soon make a patch release with it.