meta-llama / llama-recipes

Scripts for fine-tuning Meta Llama with composable FSDP & PEFT methods to cover single/multi-node GPUs. Supports default & custom datasets for applications such as summarization and Q&A. Supporting a number of candid inference solutions such as HF TGI, VLLM for local or cloud deployment. Demo apps to showcase Meta Llama for WhatsApp & Messenger.
12.02k stars 1.87k forks source link

prefix-tuning RuntimeError: Trying to backward through the graph a second time (or directly access saved tensors after they have already been freed). #226

Closed hhh12hhh closed 1 month ago

hhh12hhh commented 1 year ago

System Info

PyTorch version: 2.0.1+cu117 Is debug build: False CUDA used to build PyTorch: 11.7 ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.4 LTS (x86_64) GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0 Clang version: Could not collect CMake version: version 3.27.0 Libc version: glibc-2.31

Python version: 3.9.17 (main, Jul 5 2023, 20:41:20) [GCC 11.2.0] (64-bit runtime) Python platform: Linux-5.15.0-71-generic-x86_64-with-glibc2.31 Is CUDA available: True CUDA runtime version: Could not collect CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3090 GPU 1: NVIDIA GeForce RTX 3090

Nvidia driver version: 495.29.05 Versions of relevant libraries: mypy-extensions==1.0.0 numpy==1.23.5 torch==2.0.1 torchdata==0.6.1 torchtext==0.15.2 torchvision==0.15.2 numpy = 1.23.5 torch = 2.0.1 torchdata = 0.6.1 torchtext =0.15.2 torchvision = 0.15.2

Information

🐛 Describe the bug

I encountered the above error while fine-tuning the model with prefix here is my fine-tuning script:

CUDA_VISIBLE_DEVICES=0,1 torchrun --nnodes 1 --nproc_per_node 1 examples/finetuning.py \
    --use_peft \
    --peft_method prefix \
    --model_name ../model/llama-2-7b-chat-hf \
    --use_fp16 \
    --output_dir ./output \
    --dataset alpaca_dataset \
    --data_path ./data.json \
    --batch_size_training 16 \
    --num_epochs 3 \
    --quantization 

Error logs

Traceback (most recent call last): File "/home/zxy/llama2/llama2-lora-fine-tuning/llama-recipes-main/examples/finetuning.py", line 8, in fire.Fire(main) File "/root/anaconda3/envs/llama2/lib/python3.9/site-packages/fire/core.py", line 141, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) File "/root/anaconda3/envs/llama2/lib/python3.9/site-packages/fire/core.py", line 475, in _Fire component, remaining_args = _CallAndUpdateTrace( File "/root/anaconda3/envs/llama2/lib/python3.9/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace component = fn(*varargs, *kwargs) File "/root/anaconda3/envs/llama2/lib/python3.9/site-packages/llama_recipes/finetuning.py", line 237, in main results = train( File "/root/anaconda3/envs/llama2/lib/python3.9/site-packages/llama_recipes/utils/train_utils.py", line 84, in train scaler.scale(loss).backward() File "/root/anaconda3/envs/llama2/lib/python3.9/site-packages/torch/_tensor.py", line 487, in backward torch.autograd.backward( File "/root/anaconda3/envs/llama2/lib/python3.9/site-packages/torch/autograd/init.py", line 200, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass File "/root/anaconda3/envs/llama2/lib/python3.9/site-packages/torch/autograd/function.py", line 274, in apply return user_fn(self, args) File "/root/anaconda3/envs/llama2/lib/python3.9/site-packages/torch/utils/checkpoint.py", line 157, in backward torch.autograd.backward(outputs_with_grad, args_with_grad) File "/root/anaconda3/envs/llama2/lib/python3.9/site-packages/torch/autograd/init.py", line 200, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: Trying to backward through the graph a second time (or directly access saved tensors after they have already been freed). Saved intermediate values of the graph are freed when you call .backward() or autograd.grad(). Specify retain_graph=True if you need to backward through the graph a second time or if you need to access saved tensors after calling backward.

Expected behavior

I want to know if I wrote something wrong or other reasons, how to solve it

JunoLiusj commented 5 months ago

Encounter the same problem! when the finetuning method turn to ptuning or others, there weren't be this problem. Is there any thing wrong with the peft.PrefixTuningConfig?

HamidShojanazeri commented 5 months ago

@JunoLiusj if you are using it with FSDP unfortunately its not supported, https://github.com/meta-llama/llama-recipes/pull/482