Closed yxk9810 closed 6 months ago
deepspeed=0.14.2 , transformers==4.37.0
deepspeed --include localhost:0,1 --master_port 60000 --module tevatron.retriever.driver.train \ --deepspeed deepspeed/ds_zero3_config.json \ --output_dir retriever-mistral \ --dataset_path /mnt/data/data/index/imp_data/train_data.jsonl \ --model_name_or_path /mnt/data/data//index/qwen-1.5 \ --lora \ --lora_target_modules q_proj \ --save_steps 50 \ --query_prefix "Query: " \ --passage_prefix "Passage: " \ --pooling eos \ --append_eos_token \ --normalize \ --report_to none \ --temperature 0.01 \ --per_device_train_batch_size 1 \ --gradient_checkpointing \ --train_group_size 1 \ --learning_rate 1e-4 \ --query_max_len 32 \ --passage_max_len 156 \ --num_train_epochs 1 \ --logging_steps 10 \ --overwrite_output_dir \ --gradient_accumulation_steps 4
AssertionError assert len(set(p.ds_id for p in self.params_in_ipg_bucket)) == len(self.params_in_ipg_bucket)
change reduce bucket size ,the problem was solved
deepspeed=0.14.2 , transformers==4.37.0
deepspeed --include localhost:0,1 --master_port 60000 --module tevatron.retriever.driver.train \ --deepspeed deepspeed/ds_zero3_config.json \ --output_dir retriever-mistral \ --dataset_path /mnt/data/data/index/imp_data/train_data.jsonl \ --model_name_or_path /mnt/data/data//index/qwen-1.5 \ --lora \ --lora_target_modules q_proj \ --save_steps 50 \ --query_prefix "Query: " \ --passage_prefix "Passage: " \ --pooling eos \ --append_eos_token \ --normalize \ --report_to none \ --temperature 0.01 \ --per_device_train_batch_size 1 \ --gradient_checkpointing \ --train_group_size 1 \ --learning_rate 1e-4 \ --query_max_len 32 \ --passage_max_len 156 \ --num_train_epochs 1 \ --logging_steps 10 \ --overwrite_output_dir \ --gradient_accumulation_steps 4
AssertionError assert len(set(p.ds_id for p in self.params_in_ipg_bucket)) == len(self.params_in_ipg_bucket)