unable to train - Githubissues

riyajatar37003 commented 2 weeks ago

these are steps followed to setup :

pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 git clone https://github.com/texttron/tevatron.git cd tevatron git checkout tevatron-v1 also git checkout main pip install transformers datasets peft pip install deepspeed accelerate pip install faiss-cpu pip install -e .

run the following command to run

python -m torch.distributed.run --nproc_per_node=1 -m tevatron.driver.train \ --output_dir retriever-mistral \ --model_name_or_path "/Mixtral-7b-instruct" \ --lora \ --lora_target_modules q_proj,k_proj,v_proj,o_proj,down_proj,up_proj,gate_proj \ --save_steps 50 \ --dataset_name Tevatron/msmarco-passage-aug \ --query_prefix "Query: " \ --passage_prefix "Passage: " \ --pooling eos \ --append_eos_token \ --normalize \ --fp16 \ --temperature 0.01 \ --per_device_train_batch_size 4 \ --gradient_checkpointing \ --train_group_size 16 \ --learning_rate 1e-4 \ --query_max_len 32 \ --passage_max_len 156 \ --num_train_epochs 1 \ --logging_steps 10 \ --overwrite_output_dir \ --gradient_accumulation_steps 4

i always got this error

/opt/conda/bin/python: Error while finding module specification for 'tevatron.driver.train' (ModuleNotFoundError: No module named 'tevatron.driver')

riyajatar37003 commented 2 weeks ago

[rank0]: raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") [rank0]: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--lora', '--lora_target_modules', 'q_proj,k_proj,v_proj,o_proj,down_proj,up_proj,gate_proj', '--query_prefix', 'Query: ', '--passage_prefix', 'Passage: ', '--pooling', 'eos', '--append_eos_token', '--temperature', '0.01', '--train_group_size', '16', '--query_max_len', '32', '--passage_max_len', '156']

riyajatar37003 commented 2 weeks ago

DDP does not support such use cases in default. You can try to use _set_static_graph() as a workaround if your module graph does not change over iterations. Parameter at index 447 with name encoder.base_model.model.layers.31.mlp.down_proj.lora_B.default.weight has been marked as ready twice.

texttron / tevatron

unable to train #131