hiyouga / LLaMA-Factory

Unified Efficient Fine-Tuning of 100+ LLMs (ACL 2024)
https://arxiv.org/abs/2403.13372
Apache License 2.0
33.25k stars 4.09k forks source link

一机多卡(T4*2),增量预训练,lora微调显存溢出 #5723

Closed Shame-fight closed 2 weeks ago

Shame-fight commented 2 weeks ago

Reminder

System Info

accelerate 0.34.2 aiofiles 23.2.1 aiohappyeyeballs 2.4.3 aiohttp 3.10.9 aiosignal 1.3.1 annotated-types 0.7.0 anyio 4.6.0 async-timeout 4.0.3 attrs 24.2.0 av 13.1.0 certifi 2024.8.30 charset-normalizer 3.3.2 click 8.1.7 cloudpickle 3.0.0 contourpy 1.3.0 cycler 0.12.1 datasets 2.21.0 deepspeed 0.14.4 dill 0.3.8 diskcache 5.6.3 distro 1.9.0 docstring_parser 0.16 einops 0.8.0 exceptiongroup 1.2.2 fastapi 0.115.0 ffmpy 0.4.0 filelock 3.16.1 fire 0.7.0 fonttools 4.54.1 frozenlist 1.4.1 fsspec 2024.6.1 gguf 0.10.0 gradio 4.44.1 gradio_client 1.3.0 h11 0.14.0 hjson 3.1.0 httpcore 1.0.6 httptools 0.6.1 httpx 0.27.2 huggingface-hub 0.25.1 idna 3.10 importlib_metadata 8.5.0 importlib_resources 6.4.5 interegular 0.3.3 jieba 0.42.1 Jinja2 3.1.4 jiter 0.6.1 joblib 1.4.2 jsonschema 4.23.0 jsonschema-specifications 2024.10.1 kiwisolver 1.4.7 lark 1.2.2 llamafactory 0.9.1.dev0 /nanshu_data/jgx/LLaMA-Factory_yingji llvmlite 0.43.0 lm-format-enforcer 0.10.6 markdown-it-py 3.0.0 MarkupSafe 2.1.5 matplotlib 3.9.2 mdurl 0.1.2 mistral_common 1.4.4 modelscope 1.18.1 mpmath 1.3.0 msgpack 1.1.0 msgspec 0.18.6 multidict 6.1.0 multiprocess 0.70.16 nest-asyncio 1.6.0 networkx 3.3 ninja 1.11.1.1 nltk 3.9.1 numba 0.60.0 numpy 1.26.4 nvidia-cublas-cu12 12.1.3.1 nvidia-cuda-cupti-cu12 12.1.105 nvidia-cuda-nvrtc-cu12 12.1.105 nvidia-cuda-runtime-cu12 12.1.105 nvidia-cudnn-cu12 9.1.0.70 nvidia-cufft-cu12 11.0.2.54 nvidia-curand-cu12 10.3.2.106 nvidia-cusolver-cu12 11.4.5.107 nvidia-cusparse-cu12 12.1.0.106 nvidia-ml-py 12.560.30 nvidia-nccl-cu12 2.20.5 nvidia-nvjitlink-cu12 12.6.77 nvidia-nvtx-cu12 12.1.105 openai 1.51.2 orjson 3.10.7 outlines 0.0.46 packaging 24.1 pandas 2.2.3 partial-json-parser 0.2.1.1.post4 peft 0.12.0 pillow 10.4.0 pip 24.2 prometheus_client 0.21.0 prometheus-fastapi-instrumentator 7.0.0 propcache 0.2.0 protobuf 5.28.2 psutil 6.0.0 py-cpuinfo 9.0.0 pyairports 2.1.1 pyarrow 17.0.0 pycountry 24.6.1 pydantic 2.9.2 pydantic_core 2.23.4 pydub 0.25.1 Pygments 2.18.0 pyparsing 3.1.4 python-dateutil 2.9.0.post0 python-dotenv 1.0.1 python-multipart 0.0.12 pytz 2024.2 PyYAML 6.0.2 pyzmq 26.2.0 ray 2.37.0 referencing 0.35.1 regex 2024.9.11 requests 2.32.3 rich 13.9.2 rouge-chinese 1.0.3 rpds-py 0.20.0 ruff 0.6.9 safetensors 0.4.5 scipy 1.14.1 semantic-version 2.10.0 sentencepiece 0.2.0 setuptools 68.2.2 shellingham 1.5.4 shtab 1.7.1 six 1.16.0 sniffio 1.3.1 sse-starlette 2.1.3 starlette 0.38.6 sympy 1.13.3 termcolor 2.5.0 tiktoken 0.7.0 tokenizers 0.20.0 tomlkit 0.12.0 torch 2.4.0 torchvision 0.19.0 tqdm 4.66.5 transformers 4.45.2 transformers-stream-generator 0.0.5 triton 3.0.0 trl 0.9.6 typer 0.12.5 typing_extensions 4.12.2 tyro 0.8.11 tzdata 2024.2 urllib3 2.2.3 uvicorn 0.31.0 uvloop 0.20.0 vllm 0.6.2 watchfiles 0.24.0 websockets 12.0 wheel 0.44.0 xformers 0.0.27.post2 xxhash 3.5.0 yarl 1.14.0 zipp 3.20.2

Reproduction

运行指令:FORCE_TORCHRUN=1 llamafactory-cli train /nanshu_data/jgx/LLaMA-Factory_yingji/examples/train_lora/llama3_lora_pt_ds3.yaml

llama3_lora_pt_ds3.yaml:

model

model_name_or_path: /nanshu_data/jgx/LLM_Model/Qwen/Qwen2___5-7B

method

stage: pt do_train: true finetuning_type: lora lora_target: all deepspeed: examples/deepspeed/ds_z3_config.json

dataset

dataset: yingji_pt template: default cutoff_len: 1024 max_samples: 1000 overwrite_cache: true preprocessing_num_workers: 16

output

output_dir: saves/llama3-8b/lora/sft logging_steps: 10 save_steps: 500 plot_loss: true overwrite_output_dir: true

train

per_device_train_batch_size: 1 gradient_accumulation_steps: 2 learning_rate: 1.0e-4 num_train_epochs: 3.0 lr_scheduler_type: cosine warmup_ratio: 0.1 bf16: true ddp_timeout: 180000000

eval

val_size: 0.1 per_device_eval_batch_size: 1 eval_strategy: steps eval_steps: 500

报错: [2024-10-16 17:06:41,066] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [WARNING] async_io requires the dev libaio .so object and headers but these were not found. [WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. [WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH [WARNING] NVIDIA Inference is only supported on Ampere and newer architectures [WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4 [WARNING] using untested triton version (3.0.0), only 1.0.0 is known to be compatible /root/anaconda3/envs/yingji_llama/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: torch.cuda.amp.custom_fwd(args...) is deprecated. Please use torch.amp.custom_fwd(args..., device_type='cuda') instead. def forward(ctx, input, weight, bias=None): /root/anaconda3/envs/yingji_llama/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: torch.cuda.amp.custom_bwd(args...) is deprecated. Please use torch.amp.custom_bwd(args..., device_type='cuda') instead. def backward(ctx, grad_output): 10/16/2024 17:06:45 - INFO - llamafactory.cli - Initializing distributed tasks at: 127.0.0.1:26329 W1016 17:06:47.651285 139844198057792 torch/distributed/run.py:779] W1016 17:06:47.651285 139844198057792 torch/distributed/run.py:779] W1016 17:06:47.651285 139844198057792 torch/distributed/run.py:779] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W1016 17:06:47.651285 139844198057792 torch/distributed/run.py:779] [2024-10-16 17:06:52,835] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-10-16 17:06:52,839] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [WARNING] async_io requires the dev libaio .so object and headers but these were not found. [WARNING] async_io requires the dev libaio .so object and headers but these were not found. [WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. [WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH [WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. [WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH [WARNING] NVIDIA Inference is only supported on Ampere and newer architectures [WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4 [WARNING] using untested triton version (3.0.0), only 1.0.0 is known to be compatible [WARNING] NVIDIA Inference is only supported on Ampere and newer architectures [WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4 [WARNING] using untested triton version (3.0.0), only 1.0.0 is known to be compatible /root/anaconda3/envs/yingji_llama/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: torch.cuda.amp.custom_fwd(args...) is deprecated. Please use torch.amp.custom_fwd(args..., device_type='cuda') instead. def forward(ctx, input, weight, bias=None): /root/anaconda3/envs/yingji_llama/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: torch.cuda.amp.custom_bwd(args...) is deprecated. Please use torch.amp.custom_bwd(args..., device_type='cuda') instead. def backward(ctx, grad_output): /root/anaconda3/envs/yingji_llama/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: torch.cuda.amp.custom_fwd(args...) is deprecated. Please use torch.amp.custom_fwd(args..., device_type='cuda') instead. def forward(ctx, input, weight, bias=None): /root/anaconda3/envs/yingji_llama/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: torch.cuda.amp.custom_bwd(args...) is deprecated. Please use torch.amp.custom_bwd(args..., device_type='cuda') instead. def backward(ctx, grad_output): 10/16/2024 17:06:54 - WARNING - llamafactory.hparams.parser - ddp_find_unused_parameters needs to be set as False for LoRA in DDP training. 。。。 [INFO|trainer.py:2243] 2024-10-16 17:07:37,245 >> Running training [INFO|trainer.py:2244] 2024-10-16 17:07:37,246 >> Num examples = 13,878 [INFO|trainer.py:2245] 2024-10-16 17:07:37,246 >> Num Epochs = 3 [INFO|trainer.py:2246] 2024-10-16 17:07:37,246 >> Instantaneous batch size per device = 1 [INFO|trainer.py:2249] 2024-10-16 17:07:37,246 >> Total train batch size (w. parallel, distributed & accumulation) = 16 [INFO|trainer.py:2250] 2024-10-16 17:07:37,246 >> Gradient Accumulation steps = 8 [INFO|trainer.py:2251] 2024-10-16 17:07:37,246 >> Total optimization steps = 2,601 [INFO|trainer.py:2252] 2024-10-16 17:07:37,249 >> Number of trainable parameters = 14,823,424 0%| | 0/2601 [00:00<?, ?it/s]/root/anaconda3/envs/yingji_llama/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:600: UserWarning: torch.utils.checkpoint: the use_reentrant parameter should be passed explicitly. In version 2.4 we will raise an exception if use_reentrant is not passed. use_reentrant=False is recommended, but if you need to preserve the current default behavior, you can pass use_reentrant=True. Refer to docs for more details on the differences between the two variants. return fn(*args, *kwargs) /root/anaconda3/envs/yingji_llama/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:600: UserWarning: torch.utils.checkpoint: the use_reentrant parameter should be passed explicitly. In version 2.4 we will raise an exception if use_reentrant is not passed. use_reentrant=False is recommended, but if you need to preserve the current default behavior, you can pass use_reentrant=True. Refer to docs for more details on the differences between the two variants. return fn(args, **kwargs) rank0: Traceback (most recent call last): rank0: File "/nanshu_data/jgx/LLaMA-Factory_yingji/src/llamafactory/launcher.py", line 23, in

rank0: File "/nanshu_data/jgx/LLaMA-Factory_yingji/src/llamafactory/launcher.py", line 19, in launch

rank0: File "/nanshu_data/jgx/LLaMA-Factory_yingji/src/llamafactory/train/tuner.py", line 48, in run_exp rank0: run_pt(model_args, data_args, training_args, finetuning_args, callbacks) rank0: File "/nanshu_data/jgx/LLaMA-Factory_yingji/src/llamafactory/train/pt/workflow.py", line 63, in run_pt rank0: train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint) rank0: File "/root/anaconda3/envs/yingji_llama/lib/python3.10/site-packages/transformers/trainer.py", line 2052, in train rank0: return inner_training_loop( rank0: File "/root/anaconda3/envs/yingji_llama/lib/python3.10/site-packages/transformers/trainer.py", line 2388, in _inner_training_loop rank0: tr_loss_step = self.training_step(model, inputs) rank0: File "/root/anaconda3/envs/yingji_llama/lib/python3.10/site-packages/transformers/trainer.py", line 3518, in training_step rank0: self.accelerator.backward(loss, **kwargs) rank0: File "/root/anaconda3/envs/yingji_llama/lib/python3.10/site-packages/accelerate/accelerator.py", line 2196, in backward

rank0: File "/root/anaconda3/envs/yingji_llama/lib/python3.10/site-packages/torch/_tensor.py", line 521, in backward

rank0: File "/root/anaconda3/envs/yingji_llama/lib/python3.10/site-packages/torch/autograd/init.py", line 289, in backward

rank0: File "/root/anaconda3/envs/yingji_llama/lib/python3.10/site-packages/torch/autograd/graph.py", line 768, in _engine_run_backward rank0: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass rank0: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 254.00 MiB. GPU 0 has a total capacity of 14.58 GiB of which 172.44 MiB is free. Process 27800 has 552.00 MiB memory in use. Including non-PyTorch memory, this process has 13.71 GiB memory in use. Process 28005 has 162.00 MiB memory in use. Of the allocated memory 13.22 GiB is allocated by PyTorch, and 289.31 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) /root/anaconda3/envs/yingji_llama/lib/python3.10/site-packages/torch/utils/checkpoint.py:295: FutureWarning: torch.cpu.amp.autocast(args...) is deprecated. Please use torch.amp.autocast('cpu', args...) instead. with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(*ctx.cpu_autocast_kwargs): # type: ignore[attr-defined] 0%| | 0/2601 [00:06<?, ?it/s] W1016 17:07:44.500197 139844198057792 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 28005 closing signal SIGTERM E1016 17:07:44.915546 139844198057792 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 28004) of binary: /root/anaconda3/envs/yingji_llama/bin/python Traceback (most recent call last): File "/root/anaconda3/envs/yingji_llama/bin/torchrun", line 8, in sys.exit(main()) File "/root/anaconda3/envs/yingji_llama/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 348, in wrapper return f(args, **kwargs) File "/root/anaconda3/envs/yingji_llama/lib/python3.10/site-packages/torch/distributed/run.py", line 901, in main run(args) File "/root/anaconda3/envs/yingji_llama/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run elastic_launch( File "/root/anaconda3/envs/yingji_llama/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/root/anaconda3/envs/yingji_llama/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

/nanshu_data/jgx/LLaMA-Factory_yingji/src/llamafactory/launcher.py FAILED

Failures:

------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-10-16_17:07:44 host : node3.cluster.local rank : 0 (local_rank: 0) exitcode : 1 (pid: 28004) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ### Expected behavior 请问如何解决 ### Others _No response_
hiyouga commented 2 weeks ago

see https://github.com/hiyouga/LLaMA-Factory/issues/4614

Shame-fight commented 1 week ago

但我是使用双T4显卡(16G*2)lora增量预训练微调QWEN2.5 7B,根据项目中参数和显存关系,应该不会报显存错误? 配置文件如下:

model

model_name_or_path: /nanshu_data/jgx/LLM_Model/Qwen/Qwen2___5-7B

method

stage: pt do_train: true finetuning_type: lora lora_target: all deepspeed: examples/deepspeed/ds_z3_config.json

dataset

dataset: yingji_pt template: default cutoff_len: 1024 max_samples: 1000 overwrite_cache: true preprocessing_num_workers: 16

output

output_dir: saves/llama3-8b/lora/sft logging_steps: 10 save_steps: 500 plot_loss: true overwrite_output_dir: true

train

per_device_train_batch_size: 1 gradient_accumulation_steps: 2 learning_rate: 1.0e-4 num_train_epochs: 3.0 lr_scheduler_type: cosine warmup_ratio: 0.1 bf16: true ddp_timeout: 180000000

eval

val_size: 0.1 per_device_eval_batch_size: 1 eval_strategy: steps eval_steps: 500 @hiyouga