Can't load FSMT model after resizing token embedding

alex96k commented 2 years ago

System Info

Environment info:

transformers: 4.19.2
Platform: Linux elementary OS 6.1 Jólnir
Python version: 3.8.10
PyTorch version: 1.12.1+cu113
Using GPU in script?: No
Using distributed or parallel set-up in script?: No

@stas00

Who can help?

@stas00

Expected behavior / Issue

I am having issues to reload a saved FSMT model when the token embedding has been resized. This error doesn't appear with other models such as T5 or MT5. The similar error occured previously for other models as well but has been fixed (-> #9055 or #8706). However it doesn't seem to be fixed for the FSMT model. Currently I receive the following error:

RuntimeError: Error(s) in loading state_dict for FSMTForConditionalGeneration:
        size mismatch for model.encoder.embed_tokens.weight: copying a param with shape torch.Size([42026, 1024]) from checkpoint, the shape in current model is torch.Size([42024, 1024]).
        size mismatch for model.decoder.embed_tokens.weight: copying a param with shape torch.Size([42026, 1024]) from checkpoint, the shape in current model is torch.Size([42024, 1024]).

Any idea how to solve this? Thanks a lot and all the best!

Reproduction

from transformers import FSMTForConditionalGeneration, FSMTTokenizer

SAVING_PATH = "/tmp/test_model_fsmt"
model_class = FSMTForConditionalGeneration
model_path = "facebook/wmt19-de-en"

model = model_class.from_pretrained(model_path)
tokenizer = FSMTTokenizer.from_pretrained(model_path)

tokenizer.add_tokens(['test1', 'test2'])
model.resize_token_embeddings(len(tokenizer))

model.save_pretrained(SAVING_PATH)
tokenizer.save_pretrained(SAVING_PATH)

new_model = model_class.from_pretrained(SAVING_PATH)

sgugger commented 2 years ago

Thanks for the clear reproducer. Looking at the code, it looks like FSMT in general does not properly support the resize_token_embeddings API: it's not using the same config names for the vocab size (easily fixable) but also the method resizes both the encoder and decoder embeddings and in this case, it should only resize the encoder embedding probably.

In any case, I don't know the model as well as @stas00 so let's wait for him to chime in and advise on the best fix!

stas00 commented 1 year ago

@alex96k, would you by chance would like to tackle that?

The main difficulty with FSMT is that it has 2 unique dictionaries for many models, so some generic functionality is either not possible out of the box or requires some very careful thinking in order not to break other things. I think it's the only model of this kind in HF models.

There is an outstanding PR that was trying to bring FSMT in sync with the rest of the models: https://github.com/huggingface/transformers/pull/11218 but it proved to cause a speed regression so it was never merged, but perhaps it had this resolved already?

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

huggingface / transformers