huggingface / trl

Train transformer language models with reinforcement learning.
http://hf.co/docs/trl
Apache License 2.0
9.56k stars 1.19k forks source link

Lora seems to be invalid when using vsft_llava.py #1786

Closed shijian2001 closed 1 month ago

shijian2001 commented 3 months ago

I used two A100 40g to fine-tune llava-7b with lora. When I used the lora vsft command you provided, I found that the error CUDA out of memory still appeared, so it seems that lora did not work. My command is as follows, in which I have modified the dataset path:


python examples/scripts/vsft_llava.py \
    --dataset_name="../subset/aug_llava_instruct_mix_vsft" \    
    --model_name_or_path="llava-hf/llava-1.5-7b-hf" \
    --report_to="wandb" \
    --learning_rate=1.4e-5 \
    --per_device_train_batch_size=8 \
    --gradient_accumulation_steps=1 \
    --output_dir="../logs/checkpoints/aug-vsft-llava-1.5-7b-hf" \
    --logging_steps=5 \
    --num_train_epochs=1 \
    --push_to_hub \
    --gradient_checkpointing \
    --remove_unused_columns=False \
    --torch_dtype=float16 \
    --fp16=True \ 
    --use_peft=True \
    --lora_r=64 \
    --lora_alpha=16 \
    --lora_target_modules="all-linear"
kashif commented 3 months ago

cc @qgallouedec

qgallouedec commented 3 months ago

Thanks for reporting @shijian2001. I've encountered this error too. I will provide a fix asap. Feel free to open a PR if you manage to fix it.

shijian2001 commented 3 months ago

@qgallouedec Sorry, I haven't located the specific bug yet. After debugging, I think there is no problem with the construction of the peft model. After forward, my two 40g A100 each occupied about 15g of vram (total 30g), and when backward, the vram was not enough

qgallouedec commented 2 months ago

@shijian2001 can you double-check your command? When running it I get another error:

python examples/scripts/vsft_llava.py \
    --dataset_name="HuggingFaceH4/llava-instruct-mix-vsft" \
    --model_name_or_path="llava-hf/llava-1.5-7b-hf" \
    --per_device_train_batch_size=8 \
    --gradient_accumulation_steps=1 \
    --output_dir="../logs/checkpoints/aug-vsft-llava-1.5-7b-hf" \
    --gradient_checkpointing \
    --remove_unused_columns=False \
    --torch_dtype=float16 \
    --fp16=True \
    --use_peft=True \
    --lora_r=64 \
    --lora_alpha=16 \
    --lora_target_modules="all-linear"
Traceback (most recent call last):
  File "/fsx/qgallouedec/trl-2/examples/scripts/vsft_llava.py", line 206, in <module>
    trainer.train()
  File "/fsx/qgallouedec/trl-2/trl/trainer/sft_trainer.py", line 440, in train
    output = super().train(*args, **kwargs)
  File "/fsx/qgallouedec/miniconda3/envs/trl/lib/python3.10/site-packages/transformers/trainer.py", line 1932, in train
    return inner_training_loop(
  File "/fsx/qgallouedec/miniconda3/envs/trl/lib/python3.10/site-packages/transformers/trainer.py", line 2314, in _inner_training_loop
    _grad_norm = self.accelerator.clip_grad_norm_(
  File "/fsx/qgallouedec/miniconda3/envs/trl/lib/python3.10/site-packages/accelerate/accelerator.py", line 2269, in clip_grad_norm_
    self.unscale_gradients()
  File "/fsx/qgallouedec/miniconda3/envs/trl/lib/python3.10/site-packages/accelerate/accelerator.py", line 2219, in unscale_gradients
    self.scaler.unscale_(opt)
  File "/fsx/qgallouedec/miniconda3/envs/trl/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py", line 307, in unscale_
    optimizer_state["found_inf_per_device"] = self._unscale_grads_(
  File "/fsx/qgallouedec/miniconda3/envs/trl/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py", line 229, in _unscale_grads_
    raise ValueError("Attempting to unscale FP16 gradients.")

Related: https://github.com/huggingface/trl/issues/1785#issuecomment-2203393922

Removing --fp16=True solves the issue:

python examples/scripts/vsft_llava.py \
    --dataset_name="HuggingFaceH4/llava-instruct-mix-vsft" \
    --model_name_or_path="llava-hf/llava-1.5-7b-hf" \
    --per_device_train_batch_size=8 \
    --gradient_accumulation_steps=1 \
    --output_dir="../logs/checkpoints/aug-vsft-llava-1.5-7b-hf" \
    --gradient_checkpointing \
    --remove_unused_columns=False \
    --torch_dtype=float16 \
    --use_peft=True \
    --lora_r=64 \
    --lora_alpha=16 \
    --lora_target_modules="all-linear"

It requires around 48 GB of VRAM. If you get an OOM error, trying reducing the batch size.

shijian2001 commented 2 months ago

@qgallouedec Thank you! However, when I followed your command and tried to set per_device_train_batch_size to 1, I still get an OOM error on the 40g A100.