hiyouga / LLaMA-Factory

Unify Efficient Fine-Tuning of 100+ LLMs
Apache License 2.0
25.26k stars 3.13k forks source link

CUDA out of memory | QLORA | Llama 3 70B | 4 * NVIDIA A10G 24 Gb #4559

Closed russellorv closed 2 days ago

russellorv commented 2 days ago

Reminder

System Info

Configured as 4*A10G graphics cards (total 96 Gb)

llamafactory-cli train examples/train_qlora/llama3_lora_sft_gptq.yaml

### model
model_name_or_path: TechxGenus/Meta-Llama-3-70B-Instruct-GPTQ
quantization_bit: 4

### method
stage: sft
do_train: true
finetuning_type: lora
lora_target: all

flash_attn: fa2
low_cpu_mem_usage: true

### dataset
dataset: identity
template: llama3
cutoff_len: 1024
max_samples: 1000
overwrite_cache: true
preprocessing_num_workers: 16

### output
output_dir: saves/llama3-8b/qlora/sft
logging_steps: 10
save_steps: 500
plot_loss: true
overwrite_output_dir: true

### train
per_device_train_batch_size: 1
gradient_accumulation_steps: 1
learning_rate: 1.0e-4
num_train_epochs: 3.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
fp16: false
ddp_timeout: 180000000

### eval
val_size: 0.1
per_device_eval_batch_size: 1
eval_strategy: steps
eval_steps: 500

Reproduction

OUTPUT

warnings.warn(
[rank1]: Traceback (most recent call last):
[rank1]:   File "/home/ec2-user/LLaMA-Factory/src/llamafactory/launcher.py", line 23, in <module>
[rank1]:     launch()
[rank1]:   File "/home/ec2-user/LLaMA-Factory/src/llamafactory/launcher.py", line 19, in launch
[rank1]:     run_exp()
[rank1]:   File "/home/ec2-user/LLaMA-Factory/src/llamafactory/train/tuner.py", line 50, in run_exp
[rank1]:     run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
[rank1]:   File "/home/ec2-user/LLaMA-Factory/src/llamafactory/train/sft/workflow.py", line 49, in run_sft
[rank1]:     model = load_model(tokenizer, model_args, finetuning_args, training_args.do_train)
[rank1]:   File "/home/ec2-user/LLaMA-Factory/src/llamafactory/model/loader.py", line 152, in load_model
[rank1]:     model = AutoModelForCausalLM.from_pretrained(**init_kwargs)
[rank1]:   File "/home/ec2-user/anaconda3/envs/lf4/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 563, in from_pretrained
[rank1]:     return model_class.from_pretrained(
[rank1]:   File "/home/ec2-user/anaconda3/envs/lf4/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3754, in from_pretrained
[rank1]:     ) = cls._load_pretrained_model(
[rank1]:   File "/home/ec2-user/anaconda3/envs/lf4/lib/python3.10/site-packages/transformers/modeling_utils.py", line 4214, in _load_pretrained_model
[rank1]:     new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
[rank1]:   File "/home/ec2-user/anaconda3/envs/lf4/lib/python3.10/site-packages/transformers/modeling_utils.py", line 887, in _load_state_dict_into_meta_model
[rank1]:     set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)
[rank1]:   File "/home/ec2-user/anaconda3/envs/lf4/lib/python3.10/site-packages/accelerate/utils/modeling.py", line 400, in set_module_tensor_to_device
[rank1]:     new_value = value.to(device)
[rank1]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU  has a total capacity of 21.99 GiB of which 57.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.54 GiB is allocated by PyTorch, and 5.91 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/ec2-user/LLaMA-Factory/src/llamafactory/launcher.py", line 23, in <module>
[rank0]:     launch()
[rank0]:   File "/home/ec2-user/LLaMA-Factory/src/llamafactory/launcher.py", line 19, in launch
[rank0]:     run_exp()
[rank0]:   File "/home/ec2-user/LLaMA-Factory/src/llamafactory/train/tuner.py", line 50, in run_exp
[rank0]:     run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
[rank0]:   File "/home/ec2-user/LLaMA-Factory/src/llamafactory/train/sft/workflow.py", line 49, in run_sft
[rank0]:     model = load_model(tokenizer, model_args, finetuning_args, training_args.do_train)
[rank0]:   File "/home/ec2-user/LLaMA-Factory/src/llamafactory/model/loader.py", line 152, in load_model
[rank0]:     model = AutoModelForCausalLM.from_pretrained(**init_kwargs)
[rank0]:   File "/home/ec2-user/anaconda3/envs/lf4/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 563, in from_pretrained
[rank0]:     return model_class.from_pretrained(
[rank0]:   File "/home/ec2-user/anaconda3/envs/lf4/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3754, in from_pretrained
[rank0]:     ) = cls._load_pretrained_model(
[rank0]:   File "/home/ec2-user/anaconda3/envs/lf4/lib/python3.10/site-packages/transformers/modeling_utils.py", line 4214, in _load_pretrained_model
[rank0]:     new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
[rank0]:   File "/home/ec2-user/anaconda3/envs/lf4/lib/python3.10/site-packages/transformers/modeling_utils.py", line 887, in _load_state_dict_into_meta_model
[rank0]:     set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)
[rank0]:   File "/home/ec2-user/anaconda3/envs/lf4/lib/python3.10/site-packages/accelerate/utils/modeling.py", line 400, in set_module_tensor_to_device
[rank0]:     new_value = value.to(device)
[rank0]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 
[rank2]: Traceback (most recent call last):
[rank2]:   File "/home/ec2-user/LLaMA-Factory/src/llamafactory/launcher.py", line 23, in <module>
[rank2]:     launch()
[rank2]:   File "/home/ec2-user/LLaMA-Factory/src/llamafactory/launcher.py", line 19, in launch
[rank2]:     run_exp()
[rank2]:   File "/home/ec2-user/LLaMA-Factory/src/llamafactory/train/tuner.py", line 50, in run_exp
[rank2]:     run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
[rank2]:   File "/home/ec2-user/LLaMA-Factory/src/llamafactory/train/sft/workflow.py", line 49, in run_sft
[rank2]:     model = load_model(tokenizer, model_args, finetuning_args, training_args.do_train)
[rank2]:   File "/home/ec2-user/LLaMA-Factory/src/llamafactory/model/loader.py", line 152, in load_model
[rank2]:     model = AutoModelForCausalLM.from_pretrained(**init_kwargs)
[rank2]:   File "/home/ec2-user/anaconda3/envs/lf4/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 563, in from_pretrained
[rank2]:     return model_class.from_pretrained(
[rank2]:   File "/home/ec2-user/anaconda3/envs/lf4/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3754, in from_pretrained
[rank2]:     ) = cls._load_pretrained_model(
[rank2]:   File "/home/ec2-user/anaconda3/envs/lf4/lib/python3.10/site-packages/transformers/modeling_utils.py", line 4214, in _load_pretrained_model
[rank2]:     new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
[rank2]:   File "/home/ec2-user/anaconda3/envs/lf4/lib/python3.10/site-packages/transformers/modeling_utils.py", line 887, in _load_state_dict_into_meta_model
[rank2]:     set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)
[rank2]:   File "/home/ec2-user/anaconda3/envs/lf4/lib/python3.10/site-packages/accelerate/utils/modeling.py", line 400, in set_module_tensor_to_device
[rank2]:     new_value = value.to(device)
[rank2]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU  has a total capacity of 21.99 GiB of which 57.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.54 GiB is allocated by PyTorch, and 5.91 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[rank3]: Traceback (most recent call last):
[rank3]:   File "/home/ec2-user/LLaMA-Factory/src/llamafactory/launcher.py", line 23, in <module>
[rank3]:     launch()
[rank3]:   File "/home/ec2-user/LLaMA-Factory/src/llamafactory/launcher.py", line 19, in launch
[rank3]:     run_exp()
[rank3]:   File "/home/ec2-user/LLaMA-Factory/src/llamafactory/train/tuner.py", line 50, in run_exp
[rank3]:     run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
[rank3]:   File "/home/ec2-user/LLaMA-Factory/src/llamafactory/train/sft/workflow.py", line 49, in run_sft
[rank3]:     model = load_model(tokenizer, model_args, finetuning_args, training_args.do_train)
[rank3]:   File "/home/ec2-user/LLaMA-Factory/src/llamafactory/model/loader.py", line 152, in load_model
[rank3]:     model = AutoModelForCausalLM.from_pretrained(**init_kwargs)
[rank3]:   File "/home/ec2-user/anaconda3/envs/lf4/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 563, in from_pretrained
[rank3]:     return model_class.from_pretrained(
[rank3]:   File "/home/ec2-user/anaconda3/envs/lf4/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3754, in from_pretrained
[rank3]:     ) = cls._load_pretrained_model(
[rank3]:   File "/home/ec2-user/anaconda3/envs/lf4/lib/python3.10/site-packages/transformers/modeling_utils.py", line 4214, in _load_pretrained_model
[rank3]:     new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
[rank3]:   File "/home/ec2-user/anaconda3/envs/lf4/lib/python3.10/site-packages/transformers/modeling_utils.py", line 887, in _load_state_dict_into_meta_model
[rank3]:     set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)
[rank3]:   File "/home/ec2-user/anaconda3/envs/lf4/lib/python3.10/site-packages/accelerate/utils/modeling.py", line 400, in set_module_tensor_to_device
[rank3]:     new_value = value.to(device)
[rank3]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU  has a total capacity of 21.99 GiB of which 57.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.54 GiB is allocated by PyTorch, and 5.91 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
E0626 11:39:55.640000 139979115779904 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 19815) of binary: /home/ec2-user/anaconda3/envs/lf4/bin/python3.10
Traceback (most recent call last):
  File "/home/ec2-user/anaconda3/envs/lf4/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/ec2-user/anaconda3/envs/lf4/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
  File "/home/ec2-user/anaconda3/envs/lf4/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main
    run(args)
  File "/home/ec2-user/anaconda3/envs/lf4/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run
    elastic_launch(
  File "/home/ec2-user/anaconda3/envs/lf4/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/ec2-user/anaconda3/envs/lf4/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/home/ec2-user/LLaMA-Factory/src/llamafactory/launcher.py FAILED

Expected behavior

What resources are needed to train 70B GPTQ models using which qlora?

Others

No response

hiyouga commented 2 days ago

70B GPTQ finetuning requires at least 48GB VRAM per GPU, you should use fsdp+qlora without GPTQ model

russellorv commented 2 days ago

@hiyouga Thanks for the very quick reply!) That is, by increasing the number of video cards to 8 A10G 24 Gb, I still won't solve the problem?

Is there a way to create a qlora adapter for the model I'm using - TechxGenus/Meta-Llama-3-70B-Instruct-GPTQ?

hiyouga commented 2 days ago

70B GPTQ finetuning requires at least 48GB VRAM per GPU

russellorv commented 2 days ago

@hiyouga Sorry for asking again (but I need to figure it out) 70B GPTQ finetuning requires at least 48GB of total VRAM (in this case I have 96 Gb of VRAM) or 48GB VRAM on 1 video card ?

hiyouga commented 2 days ago

48GB on 1 video card