texttron / tevatron

Tevatron - A flexible toolkit for neural retrieval research and development.
http://tevatron.ai
Apache License 2.0
435 stars 87 forks source link

[BIG-REFACTOR] Adapter saving problem #108

Closed yurinoviello closed 4 months ago

yurinoviello commented 4 months ago

Hello, I am using the big-refactor branch. I am doing an experiment with LoRA finetuning on Mistral, however, even if the process ends successfully, I am not able to load the adapter in any way.

Error:

...
RuntimeError: Error(s) in loading state_dict for MistralModel:
    size mismatch for layers.0.self_attn.q_proj.lora_A.default.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([16, 4096]).
...

By looking on different forums, I found that this could be a common problem when using Deepspeed with Zero-3 and LoRA, however when I execute the fine-tuning with Zero-2 or Zero-1 conf, i have the following exception.

AssertionError: The parameter 447 has already been reduced. Gradient computed twice for this partition. Multiple gradient reduction is currently not supported

I am using the default configuration with a custom dataset.

deepspeed  --module tevatron.retriever.driver.train \
  --deepspeed deepspeed/ds_zero3_config.json \
  --output_dir mistral_lora \
  --model_name_or_path intfloat/e5-mistral-7b-instruct \
  --lora \
  --lora_target_modules q_proj,k_proj,v_proj,o_proj,down_proj,up_proj,gate_proj \
  --save_steps 50 \
  --dataset_name yurinoviello/miracl_ita \
  --query_prefix "Instruct: Given a question, retrieve the passages that answer the question\nQuery: "\
  --passage_prefix "" \
  --bf16 \
  --pooling eos \
  --append_eos_token \
  --normalize \
  --temperature 0.01 \
  --per_device_train_batch_size 1 \
  --gradient_accumulation_steps 2 \
  --gradient_checkpointing \
  --train_group_size 8 \
  --learning_rate 1e-4 \
  --query_max_len 64 \
  --passage_max_len 512 \
  --num_train_epochs 1 \
  --logging_steps 10 \
  --overwrite_output_dir
MXueguang commented 4 months ago

yep I recently tried similar setup with mistral.

If you remove the safetensor file in the checkpoint folder, the loading should load the adaptor_model.bin successfully.

yurinoviello commented 4 months ago

Yes, you are right, removing the safetensor is the key.

Thanks so much for the quick response.