nomic-ai / contrastors

Train Models Contrastively in Pytorch
Apache License 2.0
512 stars 37 forks source link

_IncompatibleKeys error when loading a contrastive pre-trained model #11

Closed kuanhsieh closed 7 months ago

kuanhsieh commented 7 months ago

Hi,

I followed the steps in the README.md and ran the suggested command to do contrastive pretraining:

cd src/contrastors
torchrun --nproc-per-node=8 train.py --config=configs/train/contrastive_pretrain.yaml --dtype=bf16

I changed was the output_dir variable in configs/train/contrastive_pretrain.yaml so that it would store the model on local disk ,e.g., I set it to output_dir: "nomic-embed-text-v1-unsupervised-1st-try". I also changed the data config configs/data/contrastive_pretrain.yaml so that I only used a subset of the data (to test it out).

However, when then I then went to run the contrastive fine-tuning step, this time changing the model_name in configs/train/contrastive_finetune.yaml from "nomic-ai/nomic-embed-text-v1-unsupervised" to "nomic-embed-text-v1-unsupervised-1st-try/final_model" so that I could try my own contrastive pretrained model, and changing the data config configs/data/finetune_triplets.yaml so that I only used a subset of the data, I got an _IncompatibleKeys error.

I believe this comes from the load_state_dict function. The missing keys and unexpected keys were of the form (for all 112 keys I think):

missing_keys: [‘emb_ln.weight', 'emb_ln.bias’,…]
unexpected_keys: [‘trunk.emb_ln.bias', 'trunk.emb_ln.weight’,…]

i.e., I think the unexpected_keys had an extra "trunk." prefix added to it causing the error (they both had exactly 112 keys).

I tried removing the "trunk." prefix (like a simplified version of what the remap_bert_state_dict function does in contrastors/models/encoder/bert.py) and reran, but then got the following Tensor size mismatch error:

RuntimeError: Error(s) in loading state_dict for NomicBertModel:
    size mismatch for embeddings.word_embeddings.weight: copying a param with shape torch.Size([30528, 768]) from checkpoint, the shape in current model is torch.Size([50257, 768]).

I'm not sure why there would be such a mismatch difference. Could someone please advise?

Many thanks.

zanussbaum commented 7 months ago

thanks for raising this! i am able to reproduce your bug. let me think a little bit about how to make this more clear for other users.

basically what's happening is here i check if pretrained is none and then try to load a NomicBertModel model from the model_name path, however the model is saved as a BiEncoder model.

the quick fix is to replace the following in your yaml

model_name: nomic-ai/nomic-bert-2048
pretrained: <path to checkpoint>

i'll think a little bit about how to make this cleaner. the reason i save the BiEncoder model vs. the underlying trunk object is that there are scenarios where there are learnable layers after the trunk in the BiEncoder model.

let me know if you still face any issues with this!

kuanhsieh commented 7 months ago

Hi, thank you very much for this! I sort of understand your thinking now. What you suggested worked without any issues. Thank you for being so responsive and helpful!