Closed kkie02 closed 2 years ago
cc @sgugger
We don't support resuming training with a different version of Transformers that initiated it, as it would require just freezing the whole Trainer
forever: any bug fix or feature added in it won't work with a resumed checkpoint.
I am facing the same error with Transformers version 4.21.0 - model trained on same transformers version and loading the best model after training gives this error. I am using xlm-roberta-base
with AutoModelForMaskedLM
RuntimeError: Error(s) in loading state_dict for XLMRobertaForMaskedLM: Missing key(s) in state_dict: "lm_head.decoder.weight", "lm_head.decoder.bias".
Thanks for reporting @harshit-sethi09, with the initial report I thought this was a change in the XLM-RoBERTa model that was causing problems across versions but the whole reload is broken in 4.21.0 because of the changes in #18221 .
The PR mentioned above should fix it and we will soon make a patch release with it.
System Info
Transformers 4.21.0
Who can help?
@LysandreJik @sgugger
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Expected behavior
Describe: I used XLMRobertaForMaskedLM.from_pretrained('xlm-roberta-base') to continue pretraining, the training process was too long, so I save checkpoints regularly. I did this in Google Colab. Several days ago, I can't load any saved checkpoints by using "cont_pre_trainer.train(resume_from_checkpoint=True)", there is always such an error: RuntimeError: Error(s) in loading state_dict for XLMRobertaForMaskedLM: Missing key(s) in state_dict: "lm_head.decoder.weight", "lm_head.decoder.bias".
The reason: XLMRobertaForMaskedLM doesn't have "lm_head.decoder.weight", "lm_head.decoder.bias". And state_dict of a PyTorch module is an OrderedDict and it complains about missing keys. Maybe you should use such a command somewhere: load_state_dict(state_dict, strict=False)
How to solve it by myself: I rollbacked transformers to version 4.20.1 and it worked then.
Problem Conclusion: Transformers version 4.21.0 can't load checkpoints that trained on both version 4.20.1 and version 4.21.0. (Transformers version 4.20.1 works normally, I use it to process checkpoints trained on version 4.20.1 or version 4.21.0)