unable to reproduce repllama performance

amy-hyunji commented 1 month ago

Hello, thanks for sharing great work!

I tried trying repllama myself with the repllama branch but failed to reproduce the numbers. Could you check whether any of my hyperparameters are wrong? I add the training script underneath I am currently running on 8 A100 and the result I got is NDCG@10: 0.3959, NDCG@100: 0.4515 When I download the released model from hf I get the number in the paper, which I assume the issue is from the training not evaluation.

Thanks :)

deepspeed --master_port 40000 train.py \
  --deepspeed "ds_config.json" \
  --output_dir "model_repllama_lora_train.7b.re" \
  --model_name_or_path "meta-llama/Llama-2-7b-hf" \
  --save_steps 500 \
  --dataset_name "Tevatron/msmarco-passage" \
  --bf16 \
  --per_device_train_batch_size 16 \
  --gradient_accumulation_steps 1 \
  --gradient_checkpointing \
  --train_n_passages 16 \
  --learning_rate 1e-4 \
  --q_max_len 32 \
  --p_max_len 196 \
  --num_train_epochs 1 \
  --logging_steps 10 \
  --overwrite_output_dir \
  --dataset_proc_num 32 \
  --negatives_x_device \
  --warmup_steps 100 \

MXueguang commented 1 month ago

for reproducing repllama training, I'd suggest using the main branch and using the latest code

deepspeed --include localhost:0,1,2,3 --master_port 60000 --module tevatron.retriever.driver.train \
  --deepspeed deepspeed/ds_zero3_config.json \
  --output_dir retriever-mistral \
  --model_name_or_path mistralai/Mistral-7B-v0.1 \
  --lora \
  --lora_target_modules q_proj,k_proj,v_proj,o_proj,down_proj,up_proj,gate_proj \
  --save_steps 50 \
  --dataset_name Tevatron/msmarco-passage-aug \
  --query_prefix "Query: " \
  --passage_prefix "Passage: " \
  --bf16 \
  --pooling eos \
  --append_eos_token \
  --normalize \
  --temperature 0.01 \
  --per_device_train_batch_size 8 \
  --gradient_checkpointing \
  --train_group_size 16 \
  --learning_rate 1e-4 \
  --query_max_len 32 \
  --passage_max_len 156 \
  --num_train_epochs 1 \
  --logging_steps 10 \
  --overwrite_output_dir \
  --gradient_accumulation_steps 4

you can either use the mistral as initialization for a little higher effectiveness or use llama2 for reproducing.

Notice that the training data Tevatron/msmarco-passage-aug is the data used to train repllama

amy-hyunji commented 4 weeks ago

Hi,

Thank you for the reply! I tried training myself with the script but still had trouble with the reproduction. I am currently training on top of 8 A100 80G so I changed the per_device_train_batch_size and gradient_accumulation_steps for faster training. Would this cause a problem? I considered # of GPU per_device_train_batch_size gradient_accumulation_steps to be the same as the reported number 128. Below I added the script I used!

deepspeed --include localhost:0,1,2,3 --master_port 60000 --module tevatron.retriever.driver.train \
  --deepspeed deepspeed/ds_zero3_config.json \
  --output_dir retriever-mistral \
  --model_name_or_path meta-llama/Llama-2-7b-hf \
  --lora \
  --lora_target_modules q_proj,k_proj,v_proj,o_proj,down_proj,up_proj,gate_proj \
  --save_steps 50 \
  --dataset_name Tevatron/msmarco-passage-aug \
  --query_prefix "Query: " \
  --passage_prefix "Passage: " \
  --bf16 \
  --pooling eos \
  --append_eos_token \
  --normalize \
  --temperature 0.01 \
  **--per_device_train_batch_size 16 \**
  --gradient_checkpointing \
  --train_group_size 16 \
  --learning_rate 1e-4 \
  --query_max_len 32 \
  --passage_max_len 156 \
  --num_train_epochs 1 \
  --logging_steps 10 \
  --overwrite_output_dir \
  **--gradient_accumulation_steps 1**

MXueguang commented 4 weeks ago

I would suggest keep the accumulation steps same as before. If i remember correctly, @ArvinZhuang previously found finetune with 128 batch size directly may has some unstable loss and may need further tuning the temperature. @ArvinZhuang correct me if I am wrong.

ArvinZhuang commented 4 weeks ago

Hi @MXueguang @amy-hyunji, Yes I tried a similar training config as @amy-hyunji which set gradient_accumulation_steps to 1 and it won't work well..., adding back gradient_accumulation_steps to 4 will reproduce the results.

Note gradient_accumulation_steps will affect the number of negatives per training example, For example, if set per_device_train_batch_size 8 and gradient_accumulation_steps 4 with 4 gpus, this will result in a total batch size of 128 and the number of negatives per example is 4 8 16 = 512. If per_device_train_batch_size to 32, gradient_accumulation_steps 1 with 4 gpus. the batch size is also 128 but the number of negatives will be 4 32 16 = 2048 (@MXueguang correct me if I'm wrong).

This is odd to me as well as the experience from the literature is more negative is better...

riyajatar37003 commented 2 weeks ago

Traceback (most recent call last): File "/tmp/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3342, in save_model self._save(output_dir, state_dict=state_dict) File "/tmp/.local/lib/python3.10/site-packages/tevatron/retriever/trainer.py", line 31, in _save raise ValueError(f"Unsupported model class {self.model}") ValueError: Unsupported model class DenseModel(

why i am getting this error , during saviung a checkpoint

ArvinZhuang commented 2 weeks ago

Hi @riyajatar37003 , is your DenseModel a class of DenseModel(EncoderModel):? The save only support tevatron EncoderModel here, if the DenseModel is your own implemented model then you may need to add into the supported model list.

riyajatar37003 commented 2 weeks ago

what is mean by train group size and how it important