princeton-nlp / SimCSE

[EMNLP 2021] SimCSE: Simple Contrastive Learning of Sentence Embeddings https://arxiv.org/abs/2104.08821
MIT License
3.31k stars 502 forks source link

Difference in models between train and evaluation scripts. #267

Closed seanswyi closed 6 months ago

seanswyi commented 6 months ago

I noticed that when running evaluation via the Trainer's evaluate method whereas in the evaluation.py script you're not. The models seem to differ with the Trainer using a model for CL (e.g., BertForCL) whereas evaluation.py is using a simple HuggingFace pretrained model.

Is this intentional? I would think that the models should be the same. Not to mention that there's no checkpoint loading code in evaluation.py either. Please let me know if I'm mistaken. Thanks.

gaotianyu1350 commented 6 months ago

Hi,

The two should be equivalent if you convert the checkpoint before using evaluation.py (for how to convert the checkpoint for evaluation and inference, please refer to our readme). We keep this discrepancy because we want to keep our training code flexible and the inference code as close to huggingface as possible

seanswyi commented 6 months ago

Thanks for the reply. Actually, I did try running simcse_to_huggingface.py but it seems like the code is expecting a pytorch_model.bin file whereas the training script only saved a model.safetensors file. However, it seems like even if you don't have a converted file you can pass the directory containing the model.safetensors file to from_pretrained's pretrained_name_or_path argument (probably because the version of transformers you're using is 4.2.1 and I'm currently using 4.36.2 in my setup).

That is:

# Assuming `model.safetensors` is saved in `/data/models`.
from transformers import AutoModel

model = AutoModel.from_pretrained("/data/models")

I think I should be able to use this, but do you know if the two would be different?

gaotianyu1350 commented 6 months ago

hi,

Yes if not converted there would be a difference. I believe there should be a way to convert safetensors to pytorch_model.bin? (not super familiar with the latest transformers version). Another workaround is to downgrade to 4.2.1

seanswyi commented 6 months ago

Yeah you're right. For anyone else wondering, you can easily convert the safetensors file to the more traditional pytorch_model.bin file as follows:

>>> import torch
>>> from transformers import AutoModel
>>> model = AutoModel.from_pretrained(PATH_TO_SAFETENSORS)
>>> torch.save(model.state_dict(), PATH_TO_PYTORCH)

For anyone wondering why HuggingFace saves the model as safetensors, it's not so much HuggingFace itself but the Trainer object. When passing a TrainingArguments object to the Trainer, there's a save_safetensors arguments whose default value is set to True.

safetensors was already used as a default loading option from v4.30.0. But using safetensors as a default saving option was introduced in v4.35.0. You can read more about it here.