Closed shijian2001 closed 1 month ago
cc @qgallouedec
Thanks for reporting @shijian2001. I've encountered this error too. I will provide a fix asap. Feel free to open a PR if you manage to fix it.
@qgallouedec Sorry, I haven't located the specific bug yet. After debugging, I think there is no problem with the construction of the peft model. After forward, my two 40g A100 each occupied about 15g of vram (total 30g), and when backward, the vram was not enough
@shijian2001 can you double-check your command? When running it I get another error:
python examples/scripts/vsft_llava.py \
--dataset_name="HuggingFaceH4/llava-instruct-mix-vsft" \
--model_name_or_path="llava-hf/llava-1.5-7b-hf" \
--per_device_train_batch_size=8 \
--gradient_accumulation_steps=1 \
--output_dir="../logs/checkpoints/aug-vsft-llava-1.5-7b-hf" \
--gradient_checkpointing \
--remove_unused_columns=False \
--torch_dtype=float16 \
--fp16=True \
--use_peft=True \
--lora_r=64 \
--lora_alpha=16 \
--lora_target_modules="all-linear"
Traceback (most recent call last):
File "/fsx/qgallouedec/trl-2/examples/scripts/vsft_llava.py", line 206, in <module>
trainer.train()
File "/fsx/qgallouedec/trl-2/trl/trainer/sft_trainer.py", line 440, in train
output = super().train(*args, **kwargs)
File "/fsx/qgallouedec/miniconda3/envs/trl/lib/python3.10/site-packages/transformers/trainer.py", line 1932, in train
return inner_training_loop(
File "/fsx/qgallouedec/miniconda3/envs/trl/lib/python3.10/site-packages/transformers/trainer.py", line 2314, in _inner_training_loop
_grad_norm = self.accelerator.clip_grad_norm_(
File "/fsx/qgallouedec/miniconda3/envs/trl/lib/python3.10/site-packages/accelerate/accelerator.py", line 2269, in clip_grad_norm_
self.unscale_gradients()
File "/fsx/qgallouedec/miniconda3/envs/trl/lib/python3.10/site-packages/accelerate/accelerator.py", line 2219, in unscale_gradients
self.scaler.unscale_(opt)
File "/fsx/qgallouedec/miniconda3/envs/trl/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py", line 307, in unscale_
optimizer_state["found_inf_per_device"] = self._unscale_grads_(
File "/fsx/qgallouedec/miniconda3/envs/trl/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py", line 229, in _unscale_grads_
raise ValueError("Attempting to unscale FP16 gradients.")
Related: https://github.com/huggingface/trl/issues/1785#issuecomment-2203393922
Removing --fp16=True
solves the issue:
python examples/scripts/vsft_llava.py \
--dataset_name="HuggingFaceH4/llava-instruct-mix-vsft" \
--model_name_or_path="llava-hf/llava-1.5-7b-hf" \
--per_device_train_batch_size=8 \
--gradient_accumulation_steps=1 \
--output_dir="../logs/checkpoints/aug-vsft-llava-1.5-7b-hf" \
--gradient_checkpointing \
--remove_unused_columns=False \
--torch_dtype=float16 \
--use_peft=True \
--lora_r=64 \
--lora_alpha=16 \
--lora_target_modules="all-linear"
It requires around 48 GB of VRAM. If you get an OOM error, trying reducing the batch size.
@qgallouedec Thank you! However, when I followed your command and tried to set per_device_train_batch_size
to 1, I still get an OOM error on the 40g A100.
I used two A100 40g to fine-tune llava-7b with lora. When I used the lora vsft command you provided, I found that the error CUDA out of memory still appeared, so it seems that lora did not work. My command is as follows, in which I have modified the dataset path: