Closed Shame-fight closed 2 weeks ago
但我是使用双T4显卡(16G*2)lora增量预训练微调QWEN2.5 7B,根据项目中参数和显存关系,应该不会报显存错误? 配置文件如下:
model_name_or_path: /nanshu_data/jgx/LLM_Model/Qwen/Qwen2___5-7B
stage: pt do_train: true finetuning_type: lora lora_target: all deepspeed: examples/deepspeed/ds_z3_config.json
dataset: yingji_pt template: default cutoff_len: 1024 max_samples: 1000 overwrite_cache: true preprocessing_num_workers: 16
output_dir: saves/llama3-8b/lora/sft logging_steps: 10 save_steps: 500 plot_loss: true overwrite_output_dir: true
per_device_train_batch_size: 1 gradient_accumulation_steps: 2 learning_rate: 1.0e-4 num_train_epochs: 3.0 lr_scheduler_type: cosine warmup_ratio: 0.1 bf16: true ddp_timeout: 180000000
val_size: 0.1 per_device_eval_batch_size: 1 eval_strategy: steps eval_steps: 500 @hiyouga
Reminder
System Info
accelerate 0.34.2 aiofiles 23.2.1 aiohappyeyeballs 2.4.3 aiohttp 3.10.9 aiosignal 1.3.1 annotated-types 0.7.0 anyio 4.6.0 async-timeout 4.0.3 attrs 24.2.0 av 13.1.0 certifi 2024.8.30 charset-normalizer 3.3.2 click 8.1.7 cloudpickle 3.0.0 contourpy 1.3.0 cycler 0.12.1 datasets 2.21.0 deepspeed 0.14.4 dill 0.3.8 diskcache 5.6.3 distro 1.9.0 docstring_parser 0.16 einops 0.8.0 exceptiongroup 1.2.2 fastapi 0.115.0 ffmpy 0.4.0 filelock 3.16.1 fire 0.7.0 fonttools 4.54.1 frozenlist 1.4.1 fsspec 2024.6.1 gguf 0.10.0 gradio 4.44.1 gradio_client 1.3.0 h11 0.14.0 hjson 3.1.0 httpcore 1.0.6 httptools 0.6.1 httpx 0.27.2 huggingface-hub 0.25.1 idna 3.10 importlib_metadata 8.5.0 importlib_resources 6.4.5 interegular 0.3.3 jieba 0.42.1 Jinja2 3.1.4 jiter 0.6.1 joblib 1.4.2 jsonschema 4.23.0 jsonschema-specifications 2024.10.1 kiwisolver 1.4.7 lark 1.2.2 llamafactory 0.9.1.dev0 /nanshu_data/jgx/LLaMA-Factory_yingji llvmlite 0.43.0 lm-format-enforcer 0.10.6 markdown-it-py 3.0.0 MarkupSafe 2.1.5 matplotlib 3.9.2 mdurl 0.1.2 mistral_common 1.4.4 modelscope 1.18.1 mpmath 1.3.0 msgpack 1.1.0 msgspec 0.18.6 multidict 6.1.0 multiprocess 0.70.16 nest-asyncio 1.6.0 networkx 3.3 ninja 1.11.1.1 nltk 3.9.1 numba 0.60.0 numpy 1.26.4 nvidia-cublas-cu12 12.1.3.1 nvidia-cuda-cupti-cu12 12.1.105 nvidia-cuda-nvrtc-cu12 12.1.105 nvidia-cuda-runtime-cu12 12.1.105 nvidia-cudnn-cu12 9.1.0.70 nvidia-cufft-cu12 11.0.2.54 nvidia-curand-cu12 10.3.2.106 nvidia-cusolver-cu12 11.4.5.107 nvidia-cusparse-cu12 12.1.0.106 nvidia-ml-py 12.560.30 nvidia-nccl-cu12 2.20.5 nvidia-nvjitlink-cu12 12.6.77 nvidia-nvtx-cu12 12.1.105 openai 1.51.2 orjson 3.10.7 outlines 0.0.46 packaging 24.1 pandas 2.2.3 partial-json-parser 0.2.1.1.post4 peft 0.12.0 pillow 10.4.0 pip 24.2 prometheus_client 0.21.0 prometheus-fastapi-instrumentator 7.0.0 propcache 0.2.0 protobuf 5.28.2 psutil 6.0.0 py-cpuinfo 9.0.0 pyairports 2.1.1 pyarrow 17.0.0 pycountry 24.6.1 pydantic 2.9.2 pydantic_core 2.23.4 pydub 0.25.1 Pygments 2.18.0 pyparsing 3.1.4 python-dateutil 2.9.0.post0 python-dotenv 1.0.1 python-multipart 0.0.12 pytz 2024.2 PyYAML 6.0.2 pyzmq 26.2.0 ray 2.37.0 referencing 0.35.1 regex 2024.9.11 requests 2.32.3 rich 13.9.2 rouge-chinese 1.0.3 rpds-py 0.20.0 ruff 0.6.9 safetensors 0.4.5 scipy 1.14.1 semantic-version 2.10.0 sentencepiece 0.2.0 setuptools 68.2.2 shellingham 1.5.4 shtab 1.7.1 six 1.16.0 sniffio 1.3.1 sse-starlette 2.1.3 starlette 0.38.6 sympy 1.13.3 termcolor 2.5.0 tiktoken 0.7.0 tokenizers 0.20.0 tomlkit 0.12.0 torch 2.4.0 torchvision 0.19.0 tqdm 4.66.5 transformers 4.45.2 transformers-stream-generator 0.0.5 triton 3.0.0 trl 0.9.6 typer 0.12.5 typing_extensions 4.12.2 tyro 0.8.11 tzdata 2024.2 urllib3 2.2.3 uvicorn 0.31.0 uvloop 0.20.0 vllm 0.6.2 watchfiles 0.24.0 websockets 12.0 wheel 0.44.0 xformers 0.0.27.post2 xxhash 3.5.0 yarl 1.14.0 zipp 3.20.2
Reproduction
运行指令:FORCE_TORCHRUN=1 llamafactory-cli train /nanshu_data/jgx/LLaMA-Factory_yingji/examples/train_lora/llama3_lora_pt_ds3.yaml
llama3_lora_pt_ds3.yaml:
model
model_name_or_path: /nanshu_data/jgx/LLM_Model/Qwen/Qwen2___5-7B
method
stage: pt do_train: true finetuning_type: lora lora_target: all deepspeed: examples/deepspeed/ds_z3_config.json
dataset
dataset: yingji_pt template: default cutoff_len: 1024 max_samples: 1000 overwrite_cache: true preprocessing_num_workers: 16
output
output_dir: saves/llama3-8b/lora/sft logging_steps: 10 save_steps: 500 plot_loss: true overwrite_output_dir: true
train
per_device_train_batch_size: 1 gradient_accumulation_steps: 2 learning_rate: 1.0e-4 num_train_epochs: 3.0 lr_scheduler_type: cosine warmup_ratio: 0.1 bf16: true ddp_timeout: 180000000
eval
val_size: 0.1 per_device_eval_batch_size: 1 eval_strategy: steps eval_steps: 500
报错: [2024-10-16 17:06:41,066] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [WARNING] async_io requires the dev libaio .so object and headers but these were not found. [WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. [WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH [WARNING] NVIDIA Inference is only supported on Ampere and newer architectures [WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4 [WARNING] using untested triton version (3.0.0), only 1.0.0 is known to be compatible /root/anaconda3/envs/yingji_llama/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning:
torch.cuda.amp.custom_fwd(args...)
is deprecated. Please usetorch.amp.custom_fwd(args..., device_type='cuda')
instead. def forward(ctx, input, weight, bias=None): /root/anaconda3/envs/yingji_llama/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning:torch.cuda.amp.custom_bwd(args...)
is deprecated. Please usetorch.amp.custom_bwd(args..., device_type='cuda')
instead. def backward(ctx, grad_output): 10/16/2024 17:06:45 - INFO - llamafactory.cli - Initializing distributed tasks at: 127.0.0.1:26329 W1016 17:06:47.651285 139844198057792 torch/distributed/run.py:779] W1016 17:06:47.651285 139844198057792 torch/distributed/run.py:779] W1016 17:06:47.651285 139844198057792 torch/distributed/run.py:779] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W1016 17:06:47.651285 139844198057792 torch/distributed/run.py:779] [2024-10-16 17:06:52,835] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-10-16 17:06:52,839] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [WARNING] async_io requires the dev libaio .so object and headers but these were not found. [WARNING] async_io requires the dev libaio .so object and headers but these were not found. [WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. [WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH [WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. [WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH [WARNING] NVIDIA Inference is only supported on Ampere and newer architectures [WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4 [WARNING] using untested triton version (3.0.0), only 1.0.0 is known to be compatible [WARNING] NVIDIA Inference is only supported on Ampere and newer architectures [WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4 [WARNING] using untested triton version (3.0.0), only 1.0.0 is known to be compatible /root/anaconda3/envs/yingji_llama/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning:torch.cuda.amp.custom_fwd(args...)
is deprecated. Please usetorch.amp.custom_fwd(args..., device_type='cuda')
instead. def forward(ctx, input, weight, bias=None): /root/anaconda3/envs/yingji_llama/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning:torch.cuda.amp.custom_bwd(args...)
is deprecated. Please usetorch.amp.custom_bwd(args..., device_type='cuda')
instead. def backward(ctx, grad_output): /root/anaconda3/envs/yingji_llama/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning:torch.cuda.amp.custom_fwd(args...)
is deprecated. Please usetorch.amp.custom_fwd(args..., device_type='cuda')
instead. def forward(ctx, input, weight, bias=None): /root/anaconda3/envs/yingji_llama/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning:torch.cuda.amp.custom_bwd(args...)
is deprecated. Please usetorch.amp.custom_bwd(args..., device_type='cuda')
instead. def backward(ctx, grad_output): 10/16/2024 17:06:54 - WARNING - llamafactory.hparams.parser -ddp_find_unused_parameters
needs to be set as False for LoRA in DDP training. 。。。 [INFO|trainer.py:2243] 2024-10-16 17:07:37,245 >> Running training [INFO|trainer.py:2244] 2024-10-16 17:07:37,246 >> Num examples = 13,878 [INFO|trainer.py:2245] 2024-10-16 17:07:37,246 >> Num Epochs = 3 [INFO|trainer.py:2246] 2024-10-16 17:07:37,246 >> Instantaneous batch size per device = 1 [INFO|trainer.py:2249] 2024-10-16 17:07:37,246 >> Total train batch size (w. parallel, distributed & accumulation) = 16 [INFO|trainer.py:2250] 2024-10-16 17:07:37,246 >> Gradient Accumulation steps = 8 [INFO|trainer.py:2251] 2024-10-16 17:07:37,246 >> Total optimization steps = 2,601 [INFO|trainer.py:2252] 2024-10-16 17:07:37,249 >> Number of trainable parameters = 14,823,424 0%| | 0/2601 [00:00<?, ?it/s]/root/anaconda3/envs/yingji_llama/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:600: UserWarning: torch.utils.checkpoint: the use_reentrant parameter should be passed explicitly. In version 2.4 we will raise an exception if use_reentrant is not passed. use_reentrant=False is recommended, but if you need to preserve the current default behavior, you can pass use_reentrant=True. Refer to docs for more details on the differences between the two variants. return fn(*args, *kwargs) /root/anaconda3/envs/yingji_llama/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:600: UserWarning: torch.utils.checkpoint: the use_reentrant parameter should be passed explicitly. In version 2.4 we will raise an exception if use_reentrant is not passed. use_reentrant=False is recommended, but if you need to preserve the current default behavior, you can pass use_reentrant=True. Refer to docs for more details on the differences between the two variants. return fn(args, **kwargs) rank0: Traceback (most recent call last): rank0: File "/nanshu_data/jgx/LLaMA-Factory_yingji/src/llamafactory/launcher.py", line 23, inrank0: File "/nanshu_data/jgx/LLaMA-Factory_yingji/src/llamafactory/launcher.py", line 19, in launch
rank0: File "/nanshu_data/jgx/LLaMA-Factory_yingji/src/llamafactory/train/tuner.py", line 48, in run_exp rank0: run_pt(model_args, data_args, training_args, finetuning_args, callbacks) rank0: File "/nanshu_data/jgx/LLaMA-Factory_yingji/src/llamafactory/train/pt/workflow.py", line 63, in run_pt rank0: train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint) rank0: File "/root/anaconda3/envs/yingji_llama/lib/python3.10/site-packages/transformers/trainer.py", line 2052, in train rank0: return inner_training_loop( rank0: File "/root/anaconda3/envs/yingji_llama/lib/python3.10/site-packages/transformers/trainer.py", line 2388, in _inner_training_loop rank0: tr_loss_step = self.training_step(model, inputs) rank0: File "/root/anaconda3/envs/yingji_llama/lib/python3.10/site-packages/transformers/trainer.py", line 3518, in training_step rank0: self.accelerator.backward(loss, **kwargs) rank0: File "/root/anaconda3/envs/yingji_llama/lib/python3.10/site-packages/accelerate/accelerator.py", line 2196, in backward
rank0: File "/root/anaconda3/envs/yingji_llama/lib/python3.10/site-packages/torch/_tensor.py", line 521, in backward
rank0: File "/root/anaconda3/envs/yingji_llama/lib/python3.10/site-packages/torch/autograd/init.py", line 289, in backward
rank0: File "/root/anaconda3/envs/yingji_llama/lib/python3.10/site-packages/torch/autograd/graph.py", line 768, in _engine_run_backward rank0: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass rank0: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 254.00 MiB. GPU 0 has a total capacity of 14.58 GiB of which 172.44 MiB is free. Process 27800 has 552.00 MiB memory in use. Including non-PyTorch memory, this process has 13.71 GiB memory in use. Process 28005 has 162.00 MiB memory in use. Of the allocated memory 13.22 GiB is allocated by PyTorch, and 289.31 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) /root/anaconda3/envs/yingji_llama/lib/python3.10/site-packages/torch/utils/checkpoint.py:295: FutureWarning:
sys.exit(main())
File "/root/anaconda3/envs/yingji_llama/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 348, in wrapper
return f( args, **kwargs)
File "/root/anaconda3/envs/yingji_llama/lib/python3.10/site-packages/torch/distributed/run.py", line 901, in main
run(args)
File "/root/anaconda3/envs/yingji_llama/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/root/anaconda3/envs/yingji_llama/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/anaconda3/envs/yingji_llama/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
torch.cpu.amp.autocast(args...)
is deprecated. Please usetorch.amp.autocast('cpu', args...)
instead. with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(*ctx.cpu_autocast_kwargs): # type: ignore[attr-defined] 0%| | 0/2601 [00:06<?, ?it/s] W1016 17:07:44.500197 139844198057792 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 28005 closing signal SIGTERM E1016 17:07:44.915546 139844198057792 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 28004) of binary: /root/anaconda3/envs/yingji_llama/bin/python Traceback (most recent call last): File "/root/anaconda3/envs/yingji_llama/bin/torchrun", line 8, in/nanshu_data/jgx/LLaMA-Factory_yingji/src/llamafactory/launcher.py FAILED
Failures: