microsoft / unilm

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities
https://aka.ms/GeneralAI
MIT License
19.49k stars 2.48k forks source link

Error while fine-tuning the E5 model? #1145

Open abhishekverma1997 opened 1 year ago

abhishekverma1997 commented 1 year ago

@intfloat

Hi, I'm implementing a E5 fine tune task similar to 1066.

I am trying to run a simple E5-large fine-tuning with BM25 hard negatives. For simplicity and debugging I'm running the base simlm model first and will upgrade to E5 once this simple script executes without any issues.

To implement the above I am trying to run the train_biencoder_marco.sh script.

I have reduced the parameter values for simplicity. ( for example using reduced train & eval batch sizes, reduced train_n_passages to 2, reduced num_train_epochs to 3 etc.. )

Following is the configuration inside my train_biencoder_marco.sh script.

image

While executing I am faced with the following error message -

image

According to the error the model inputs from the transformer model in not getting passed to the compute_loss() function within simlm/src/trainers/biencoder_trainer.py file.

Please let me know if I am doing something wrong while executing or defining my parameters.

Thanks!

intfloat commented 1 year ago

Sorry but I was unable to reproduce your issue.

Following instructions at https://github.com/microsoft/unilm/tree/master/simlm , I execute the following command without changing any code:

bash scripts/download_msmarco_data.sh

export DATA_DIR=./data/msmarco_bm25_official/
export OUTPUT_DIR=./checkpoint/biencoder/

# Train bi-encoder
bash scripts/train_biencoder_marco.sh

Everything then works fine (I tested for both transformers version 4.15 and 4.29).

Can you provide more details about what code changes you have made (e.g., git diff)? Looks like you changed the train_biencoder.py file.

Also, could you try running the above command without changing any code?

abhishekverma1997 commented 1 year ago

Hey @intfloat,

Thanks for your prompt reply. The fresh git pull worked.

I had some issues with how my passage.jsonl file was structured for my own custom dataset. It's all sorted now and the fine tuning using BM25 Hard Negatives work fine.

I had a question regarding how you have calculated and assigned "positives" and "negatives" sores to the documents in the train.jsonl and dev.jsonl ?

One more question is why are the "positives" scores in train.jsonl "-1.0" for all the cases ?

Looking forward to the discussion.

Thanks!

intfloat commented 1 year ago

@abhishekverma1997

The positives are human-annotated relevant passages, and the negatives are the BM25 retrieved passages that are not annotated as relevant.

Only the scores at simlm/data/msmarco_distillation/kd_train.jsonl are used for knowledge distillation, the scores in other files are there only to keep the data format consistent.

A related issue: https://github.com/microsoft/unilm/issues/1149

abhishekverma1997 commented 1 year ago

@intfloat ,

Thank you for your explanation regarding the BM25 hard negatives train.jsonl and dev.jsonl files.

How are the scores in kd_train.jsonl and kd_dev.jsonl calculated ? If I want to calculate the positives and negative scores for my custom dataset to create my own kd_train and kd_dev files?