Open amy-hyunji opened 1 month ago
for reproducing repllama training, I'd suggest using the main branch and using the latest code
deepspeed --include localhost:0,1,2,3 --master_port 60000 --module tevatron.retriever.driver.train \
--deepspeed deepspeed/ds_zero3_config.json \
--output_dir retriever-mistral \
--model_name_or_path mistralai/Mistral-7B-v0.1 \
--lora \
--lora_target_modules q_proj,k_proj,v_proj,o_proj,down_proj,up_proj,gate_proj \
--save_steps 50 \
--dataset_name Tevatron/msmarco-passage-aug \
--query_prefix "Query: " \
--passage_prefix "Passage: " \
--bf16 \
--pooling eos \
--append_eos_token \
--normalize \
--temperature 0.01 \
--per_device_train_batch_size 8 \
--gradient_checkpointing \
--train_group_size 16 \
--learning_rate 1e-4 \
--query_max_len 32 \
--passage_max_len 156 \
--num_train_epochs 1 \
--logging_steps 10 \
--overwrite_output_dir \
--gradient_accumulation_steps 4
you can either use the mistral as initialization for a little higher effectiveness or use llama2 for reproducing.
Notice that the training data Tevatron/msmarco-passage-aug
is the data used to train repllama
Hi,
Thank you for the reply! I tried training myself with the script but still had trouble with the reproduction. I am currently training on top of 8 A100 80G so I changed the per_device_train_batch_size and gradient_accumulation_steps for faster training. Would this cause a problem? I considered # of GPU per_device_train_batch_size gradient_accumulation_steps to be the same as the reported number 128. Below I added the script I used!
deepspeed --include localhost:0,1,2,3 --master_port 60000 --module tevatron.retriever.driver.train \
--deepspeed deepspeed/ds_zero3_config.json \
--output_dir retriever-mistral \
--model_name_or_path meta-llama/Llama-2-7b-hf \
--lora \
--lora_target_modules q_proj,k_proj,v_proj,o_proj,down_proj,up_proj,gate_proj \
--save_steps 50 \
--dataset_name Tevatron/msmarco-passage-aug \
--query_prefix "Query: " \
--passage_prefix "Passage: " \
--bf16 \
--pooling eos \
--append_eos_token \
--normalize \
--temperature 0.01 \
**--per_device_train_batch_size 16 \**
--gradient_checkpointing \
--train_group_size 16 \
--learning_rate 1e-4 \
--query_max_len 32 \
--passage_max_len 156 \
--num_train_epochs 1 \
--logging_steps 10 \
--overwrite_output_dir \
**--gradient_accumulation_steps 1**
I would suggest keep the accumulation steps same as before. If i remember correctly, @ArvinZhuang previously found finetune with 128 batch size directly may has some unstable loss and may need further tuning the temperature. @ArvinZhuang correct me if I am wrong.
Hi @MXueguang @amy-hyunji, Yes I tried a similar training config as @amy-hyunji which set gradient_accumulation_steps to 1 and it won't work well..., adding back gradient_accumulation_steps to 4 will reproduce the results.
Note gradient_accumulation_steps will affect the number of negatives per training example, For example, if set per_device_train_batch_size 8 and gradient_accumulation_steps 4 with 4 gpus, this will result in a total batch size of 128 and the number of negatives per example is 4 8 16 = 512. If per_device_train_batch_size to 32, gradient_accumulation_steps 1 with 4 gpus. the batch size is also 128 but the number of negatives will be 4 32 16 = 2048 (@MXueguang correct me if I'm wrong).
This is odd to me as well as the experience from the literature is more negative is better...
Traceback (most recent call last): File "/tmp/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3342, in save_model self._save(output_dir, state_dict=state_dict) File "/tmp/.local/lib/python3.10/site-packages/tevatron/retriever/trainer.py", line 31, in _save raise ValueError(f"Unsupported model class {self.model}") ValueError: Unsupported model class DenseModel(
why i am getting this error , during saviung a checkpoint
Hi @riyajatar37003 , is your DenseModel a class of DenseModel(EncoderModel):? The save only support tevatron EncoderModel here, if the DenseModel is your own implemented model then you may need to add into the supported model list.
what is mean by train group size and how it important
@MXueguang I also had a similar experience with failing to reproduce using that script. Using your suggested config above I get the 72.95 nDCG for DL19 and 70.6 nDCG@10 for DL20 (compared to the paper's 74.3 and 72.1). I used the gradient steps parameters (batch size of 8, gradient accumulation of 4) suggested by @ArvinZhuang.
Are you able to reproduce it with that code or did the recent updates to tevatron
change some parameter that lowers performance? If I wanted to exactly reproduce, do you have the command that works with the November codebase?
hi @orionw, which base llm were you using? llama2?
Yes, llama2
Full config (if helpful):
#!/bin/bash
deepspeed --include localhost:0,1,2,3 --master_port 60000 --module tevatron.retriever.driver.train \
--deepspeed deepspeed/ds_zero3_config.json \
--output_dir retriever-llama2 \
--model_name_or_path meta-llama/Llama-2-7b-hf \
--lora \
--lora_target_modules q_proj,k_proj,v_proj,o_proj,down_proj,up_proj,gate_proj \
--save_steps 200 \
--dataset_name Tevatron/msmarco-passage-aug \
--query_prefix "Query: " \
--passage_prefix "Passage: " \
--bf16 \
--pooling eos \
--append_eos_token \
--normalize \
--temperature 0.01 \
--per_device_train_batch_size 8 \
--gradient_checkpointing \
--train_group_size 16 \
--learning_rate 1e-4 \
--query_max_len 32 \
--passage_max_len 196 \
--num_train_epochs 1 \
--logging_steps 10 \
--overwrite_output_dir \
--warmup_steps 100 \
--gradient_accumulation_steps 4
do you have msmarco dev set results of the ckpt you get, does it matches?
I didn't evaluate dev, I stopped at these two for compute reasons.
I see. I'll schedule a training to see how it goes on my end...
Thanks for looking into it!
one difference I observed is the lora_r parameters, it was set to 32 in original experiment, now the default one is 8. I am seeing if this would affect trec dl results.
Hmm, could be @MXueguang! Let me know if that fixes it!
I was curious about potential discrepancies with the number of GPUs/batch size? I don't know the exact command you used, but your paper said it was trained with 16 V100 GPUs. Perhaps having the batches distributed larger makes it better (like @ArvinZhuang was saying).
Could also be related to the cross device negatives/group size (did that logic change in the recent version?). I unfortunately don't have access to a node with 16 GPUs to test it out on.
yeah @orionw , I. am running it, but still need a day or so to get a number due to limited compute... In the original 16 V100 GPUs, --per_device_train_batch_size is set to 2. so the setting regarding batch size should be equivalent here, and so does the cross device negatives.
Hi @orionw, my reproduce with lora_r=32, others keeps same, gives: dev mrr@10: 41.6 dl19 ndcg@10: 74.6 dl20 ndcg@10: 71.4
dev/dl19 a bit higher and dl20 a bit lower than original experiments.
Awesome, thank you so much @MXueguang! Those differences could easily be due to random seeds. Really appreciate you looking into it :)
Hello, thanks for sharing great work!
I tried trying repllama myself with the repllama branch but failed to reproduce the numbers. Could you check whether any of my hyperparameters are wrong? I add the training script underneath I am currently running on 8 A100 and the result I got is NDCG@10: 0.3959, NDCG@100: 0.4515 When I download the released model from hf I get the number in the paper, which I assume the issue is from the training not evaluation.
Thanks :)