texttron / tevatron

Tevatron - A flexible toolkit for neural retrieval research and development.
http://tevatron.ai
Apache License 2.0
435 stars 87 forks source link

unable to reproduce repllama performance #129

Open amy-hyunji opened 1 month ago

amy-hyunji commented 1 month ago

Hello, thanks for sharing great work!

I tried trying repllama myself with the repllama branch but failed to reproduce the numbers. Could you check whether any of my hyperparameters are wrong? I add the training script underneath I am currently running on 8 A100 and the result I got is NDCG@10: 0.3959, NDCG@100: 0.4515 When I download the released model from hf I get the number in the paper, which I assume the issue is from the training not evaluation.

Thanks :)

deepspeed --master_port 40000 train.py \
  --deepspeed "ds_config.json" \
  --output_dir "model_repllama_lora_train.7b.re" \
  --model_name_or_path "meta-llama/Llama-2-7b-hf" \
  --save_steps 500 \
  --dataset_name "Tevatron/msmarco-passage" \
  --bf16 \
  --per_device_train_batch_size 16 \
  --gradient_accumulation_steps 1 \
  --gradient_checkpointing \
  --train_n_passages 16 \
  --learning_rate 1e-4 \
  --q_max_len 32 \
  --p_max_len 196 \
  --num_train_epochs 1 \
  --logging_steps 10 \
  --overwrite_output_dir \
  --dataset_proc_num 32 \
  --negatives_x_device \
  --warmup_steps 100 \
MXueguang commented 1 month ago

for reproducing repllama training, I'd suggest using the main branch and using the latest code

deepspeed --include localhost:0,1,2,3 --master_port 60000 --module tevatron.retriever.driver.train \
  --deepspeed deepspeed/ds_zero3_config.json \
  --output_dir retriever-mistral \
  --model_name_or_path mistralai/Mistral-7B-v0.1 \
  --lora \
  --lora_target_modules q_proj,k_proj,v_proj,o_proj,down_proj,up_proj,gate_proj \
  --save_steps 50 \
  --dataset_name Tevatron/msmarco-passage-aug \
  --query_prefix "Query: " \
  --passage_prefix "Passage: " \
  --bf16 \
  --pooling eos \
  --append_eos_token \
  --normalize \
  --temperature 0.01 \
  --per_device_train_batch_size 8 \
  --gradient_checkpointing \
  --train_group_size 16 \
  --learning_rate 1e-4 \
  --query_max_len 32 \
  --passage_max_len 156 \
  --num_train_epochs 1 \
  --logging_steps 10 \
  --overwrite_output_dir \
  --gradient_accumulation_steps 4

you can either use the mistral as initialization for a little higher effectiveness or use llama2 for reproducing.

Notice that the training data Tevatron/msmarco-passage-aug is the data used to train repllama

amy-hyunji commented 4 weeks ago

Hi,

Thank you for the reply! I tried training myself with the script but still had trouble with the reproduction. I am currently training on top of 8 A100 80G so I changed the per_device_train_batch_size and gradient_accumulation_steps for faster training. Would this cause a problem? I considered # of GPU per_device_train_batch_size gradient_accumulation_steps to be the same as the reported number 128. Below I added the script I used!

deepspeed --include localhost:0,1,2,3 --master_port 60000 --module tevatron.retriever.driver.train \
  --deepspeed deepspeed/ds_zero3_config.json \
  --output_dir retriever-mistral \
  --model_name_or_path meta-llama/Llama-2-7b-hf \
  --lora \
  --lora_target_modules q_proj,k_proj,v_proj,o_proj,down_proj,up_proj,gate_proj \
  --save_steps 50 \
  --dataset_name Tevatron/msmarco-passage-aug \
  --query_prefix "Query: " \
  --passage_prefix "Passage: " \
  --bf16 \
  --pooling eos \
  --append_eos_token \
  --normalize \
  --temperature 0.01 \
  **--per_device_train_batch_size 16 \**
  --gradient_checkpointing \
  --train_group_size 16 \
  --learning_rate 1e-4 \
  --query_max_len 32 \
  --passage_max_len 156 \
  --num_train_epochs 1 \
  --logging_steps 10 \
  --overwrite_output_dir \
  **--gradient_accumulation_steps 1**
MXueguang commented 4 weeks ago

I would suggest keep the accumulation steps same as before. If i remember correctly, @ArvinZhuang previously found finetune with 128 batch size directly may has some unstable loss and may need further tuning the temperature. @ArvinZhuang correct me if I am wrong.

ArvinZhuang commented 4 weeks ago

Hi @MXueguang @amy-hyunji, Yes I tried a similar training config as @amy-hyunji which set gradient_accumulation_steps to 1 and it won't work well..., adding back gradient_accumulation_steps to 4 will reproduce the results.

Note gradient_accumulation_steps will affect the number of negatives per training example, For example, if set per_device_train_batch_size 8 and gradient_accumulation_steps 4 with 4 gpus, this will result in a total batch size of 128 and the number of negatives per example is 4 8 16 = 512. If per_device_train_batch_size to 32, gradient_accumulation_steps 1 with 4 gpus. the batch size is also 128 but the number of negatives will be 4 32 16 = 2048 (@MXueguang correct me if I'm wrong).

This is odd to me as well as the experience from the literature is more negative is better...

riyajatar37003 commented 2 weeks ago

Traceback (most recent call last): File "/tmp/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3342, in save_model self._save(output_dir, state_dict=state_dict) File "/tmp/.local/lib/python3.10/site-packages/tevatron/retriever/trainer.py", line 31, in _save raise ValueError(f"Unsupported model class {self.model}") ValueError: Unsupported model class DenseModel(

why i am getting this error , during saviung a checkpoint

ArvinZhuang commented 2 weeks ago

Hi @riyajatar37003 , is your DenseModel a class of DenseModel(EncoderModel):? The save only support tevatron EncoderModel here, if the DenseModel is your own implemented model then you may need to add into the supported model list.

riyajatar37003 commented 2 weeks ago

what is mean by train group size and how it important

orionw commented 1 week ago

@MXueguang I also had a similar experience with failing to reproduce using that script. Using your suggested config above I get the 72.95 nDCG for DL19 and 70.6 nDCG@10 for DL20 (compared to the paper's 74.3 and 72.1). I used the gradient steps parameters (batch size of 8, gradient accumulation of 4) suggested by @ArvinZhuang.

Are you able to reproduce it with that code or did the recent updates to tevatron change some parameter that lowers performance? If I wanted to exactly reproduce, do you have the command that works with the November codebase?

ArvinZhuang commented 1 week ago

hi @orionw, which base llm were you using? llama2?

orionw commented 1 week ago

Yes, llama2

Full config (if helpful):

#!/bin/bash
deepspeed --include localhost:0,1,2,3 --master_port 60000 --module tevatron.retriever.driver.train \
  --deepspeed deepspeed/ds_zero3_config.json \
  --output_dir retriever-llama2 \
  --model_name_or_path meta-llama/Llama-2-7b-hf \
  --lora \
  --lora_target_modules q_proj,k_proj,v_proj,o_proj,down_proj,up_proj,gate_proj \
  --save_steps 200 \
  --dataset_name Tevatron/msmarco-passage-aug \
  --query_prefix "Query: " \
  --passage_prefix "Passage: " \
  --bf16 \
  --pooling eos \
  --append_eos_token \
  --normalize \
  --temperature 0.01 \
  --per_device_train_batch_size 8 \
  --gradient_checkpointing \
  --train_group_size 16 \
  --learning_rate 1e-4 \
  --query_max_len 32 \
  --passage_max_len 196 \
  --num_train_epochs 1 \
  --logging_steps 10 \
  --overwrite_output_dir \
  --warmup_steps 100 \
  --gradient_accumulation_steps 4
MXueguang commented 1 week ago

do you have msmarco dev set results of the ckpt you get, does it matches?

orionw commented 1 week ago

I didn't evaluate dev, I stopped at these two for compute reasons.

MXueguang commented 1 week ago

I see. I'll schedule a training to see how it goes on my end...

orionw commented 1 week ago

Thanks for looking into it!

MXueguang commented 1 week ago

one difference I observed is the lora_r parameters, it was set to 32 in original experiment, now the default one is 8. I am seeing if this would affect trec dl results.

orionw commented 1 week ago

Hmm, could be @MXueguang! Let me know if that fixes it!

I was curious about potential discrepancies with the number of GPUs/batch size? I don't know the exact command you used, but your paper said it was trained with 16 V100 GPUs. Perhaps having the batches distributed larger makes it better (like @ArvinZhuang was saying).

Could also be related to the cross device negatives/group size (did that logic change in the recent version?). I unfortunately don't have access to a node with 16 GPUs to test it out on.

MXueguang commented 1 week ago

yeah @orionw , I. am running it, but still need a day or so to get a number due to limited compute... In the original 16 V100 GPUs, --per_device_train_batch_size is set to 2. so the setting regarding batch size should be equivalent here, and so does the cross device negatives.

MXueguang commented 4 days ago

Hi @orionw, my reproduce with lora_r=32, others keeps same, gives: dev mrr@10: 41.6 dl19 ndcg@10: 74.6 dl20 ndcg@10: 71.4

dev/dl19 a bit higher and dl20 a bit lower than original experiments.

orionw commented 4 days ago

Awesome, thank you so much @MXueguang! Those differences could easily be due to random seeds. Really appreciate you looking into it :)