Shame-fight commented 2 weeks ago

Reminder

[X] I have read the README and searched the existing issues.

System Info

accelerate 0.34.2 aiofiles 23.2.1 aiohappyeyeballs 2.4.3 aiohttp 3.10.9 aiosignal 1.3.1 annotated-types 0.7.0 anyio 4.6.0 async-timeout 4.0.3 attrs 24.2.0 av 13.1.0 certifi 2024.8.30 charset-normalizer 3.3.2 click 8.1.7 cloudpickle 3.0.0 contourpy 1.3.0 cycler 0.12.1 datasets 2.21.0 deepspeed 0.14.4 dill 0.3.8 diskcache 5.6.3 distro 1.9.0 docstring_parser 0.16 einops 0.8.0 exceptiongroup 1.2.2 fastapi 0.115.0 ffmpy 0.4.0 filelock 3.16.1 fire 0.7.0 fonttools 4.54.1 frozenlist 1.4.1 fsspec 2024.6.1 gguf 0.10.0 gradio 4.44.1 gradio_client 1.3.0 h11 0.14.0 hjson 3.1.0 httpcore 1.0.6 httptools 0.6.1 httpx 0.27.2 huggingface-hub 0.25.1 idna 3.10 importlib_metadata 8.5.0 importlib_resources 6.4.5 interegular 0.3.3 jieba 0.42.1 Jinja2 3.1.4 jiter 0.6.1 joblib 1.4.2 jsonschema 4.23.0 jsonschema-specifications 2024.10.1 kiwisolver 1.4.7 lark 1.2.2 llamafactory 0.9.1.dev0 /nanshu_data/jgx/LLaMA-Factory_yingji llvmlite 0.43.0 lm-format-enforcer 0.10.6 markdown-it-py 3.0.0 MarkupSafe 2.1.5 matplotlib 3.9.2 mdurl 0.1.2 mistral_common 1.4.4 modelscope 1.18.1 mpmath 1.3.0 msgpack 1.1.0 msgspec 0.18.6 multidict 6.1.0 multiprocess 0.70.16 nest-asyncio 1.6.0 networkx 3.3 ninja 1.11.1.1 nltk 3.9.1 numba 0.60.0 numpy 1.26.4 nvidia-cublas-cu12 12.1.3.1 nvidia-cuda-cupti-cu12 12.1.105 nvidia-cuda-nvrtc-cu12 12.1.105 nvidia-cuda-runtime-cu12 12.1.105 nvidia-cudnn-cu12 9.1.0.70 nvidia-cufft-cu12 11.0.2.54 nvidia-curand-cu12 10.3.2.106 nvidia-cusolver-cu12 11.4.5.107 nvidia-cusparse-cu12 12.1.0.106 nvidia-ml-py 12.560.30 nvidia-nccl-cu12 2.20.5 nvidia-nvjitlink-cu12 12.6.77 nvidia-nvtx-cu12 12.1.105 openai 1.51.2 orjson 3.10.7 outlines 0.0.46 packaging 24.1 pandas 2.2.3 partial-json-parser 0.2.1.1.post4 peft 0.12.0 pillow 10.4.0 pip 24.2 prometheus_client 0.21.0 prometheus-fastapi-instrumentator 7.0.0 propcache 0.2.0 protobuf 5.28.2 psutil 6.0.0 py-cpuinfo 9.0.0 pyairports 2.1.1 pyarrow 17.0.0 pycountry 24.6.1 pydantic 2.9.2 pydantic_core 2.23.4 pydub 0.25.1 Pygments 2.18.0 pyparsing 3.1.4 python-dateutil 2.9.0.post0 python-dotenv 1.0.1 python-multipart 0.0.12 pytz 2024.2 PyYAML 6.0.2 pyzmq 26.2.0 ray 2.37.0 referencing 0.35.1 regex 2024.9.11 requests 2.32.3 rich 13.9.2 rouge-chinese 1.0.3 rpds-py 0.20.0 ruff 0.6.9 safetensors 0.4.5 scipy 1.14.1 semantic-version 2.10.0 sentencepiece 0.2.0 setuptools 68.2.2 shellingham 1.5.4 shtab 1.7.1 six 1.16.0 sniffio 1.3.1 sse-starlette 2.1.3 starlette 0.38.6 sympy 1.13.3 termcolor 2.5.0 tiktoken 0.7.0 tokenizers 0.20.0 tomlkit 0.12.0 torch 2.4.0 torchvision 0.19.0 tqdm 4.66.5 transformers 4.45.2 transformers-stream-generator 0.0.5 triton 3.0.0 trl 0.9.6 typer 0.12.5 typing_extensions 4.12.2 tyro 0.8.11 tzdata 2024.2 urllib3 2.2.3 uvicorn 0.31.0 uvloop 0.20.0 vllm 0.6.2 watchfiles 0.24.0 websockets 12.0 wheel 0.44.0 xformers 0.0.27.post2 xxhash 3.5.0 yarl 1.14.0 zipp 3.20.2

Reproduction

运行指令：FORCE_TORCHRUN=1 llamafactory-cli train /nanshu_data/jgx/LLaMA-Factory_yingji/examples/train_lora/llama3_lora_pt_ds3.yaml

llama3_lora_pt_ds3.yaml:

model

model_name_or_path: /nanshu_data/jgx/LLM_Model/Qwen/Qwen2___5-7B

method

stage: pt do_train: true finetuning_type: lora lora_target: all deepspeed: examples/deepspeed/ds_z3_config.json

dataset

dataset: yingji_pt template: default cutoff_len: 1024 max_samples: 1000 overwrite_cache: true preprocessing_num_workers: 16

output

output_dir: saves/llama3-8b/lora/sft logging_steps: 10 save_steps: 500 plot_loss: true overwrite_output_dir: true

train

per_device_train_batch_size: 1 gradient_accumulation_steps: 2 learning_rate: 1.0e-4 num_train_epochs: 3.0 lr_scheduler_type: cosine warmup_ratio: 0.1 bf16: true ddp_timeout: 180000000

eval

val_size: 0.1 per_device_eval_batch_size: 1 eval_strategy: steps eval_steps: 500

报错： [2024-10-16 17:06:41,066] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [WARNING] async_io requires the dev libaio .so object and headers but these were not found. [WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. [WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH [WARNING] NVIDIA Inference is only supported on Ampere and newer architectures [WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4 [WARNING] using untested triton version (3.0.0), only 1.0.0 is known to be compatible /root/anaconda3/envs/yingji_llama/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: torch.cuda.amp.custom_fwd(args...) is deprecated. Please use torch.amp.custom_fwd(args..., device_type='cuda') instead. def forward(ctx, input, weight, bias=None): /root/anaconda3/envs/yingji_llama/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: torch.cuda.amp.custom_bwd(args...) is deprecated. Please use torch.amp.custom_bwd(args..., device_type='cuda') instead. def backward(ctx, grad_output): 10/16/2024 17:06:45 - INFO - llamafactory.cli - Initializing distributed tasks at: 127.0.0.1:26329 W1016 17:06:47.651285 139844198057792 torch/distributed/run.py:779] W1016 17:06:47.651285 139844198057792 torch/distributed/run.py:779] W1016 17:06:47.651285 139844198057792 torch/distributed/run.py:779] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W1016 17:06:47.651285 139844198057792 torch/distributed/run.py:779] [2024-10-16 17:06:52,835] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-10-16 17:06:52,839] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [WARNING] async_io requires the dev libaio .so object and headers but these were not found. [WARNING] async_io requires the dev libaio .so object and headers but these were not found. [WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. [WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH [WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. [WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH [WARNING] NVIDIA Inference is only supported on Ampere and newer architectures [WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4 [WARNING] using untested triton version (3.0.0), only 1.0.0 is known to be compatible [WARNING] NVIDIA Inference is only supported on Ampere and newer architectures [WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4 [WARNING] using untested triton version (3.0.0), only 1.0.0 is known to be compatible /root/anaconda3/envs/yingji_llama/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: torch.cuda.amp.custom_fwd(args...) is deprecated. Please use torch.amp.custom_fwd(args..., device_type='cuda') instead. def forward(ctx, input, weight, bias=None): /root/anaconda3/envs/yingji_llama/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: torch.cuda.amp.custom_bwd(args...) is deprecated. Please use torch.amp.custom_bwd(args..., device_type='cuda') instead. def backward(ctx, grad_output): /root/anaconda3/envs/yingji_llama/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: torch.cuda.amp.custom_fwd(args...) is deprecated. Please use torch.amp.custom_fwd(args..., device_type='cuda') instead. def forward(ctx, input, weight, bias=None): /root/anaconda3/envs/yingji_llama/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: torch.cuda.amp.custom_bwd(args...) is deprecated. Please use torch.amp.custom_bwd(args..., device_type='cuda') instead. def backward(ctx, grad_output): 10/16/2024 17:06:54 - WARNING - llamafactory.hparams.parser - ddp_find_unused_parameters needs to be set as False for LoRA in DDP training. 。。。 [INFO|trainer.py:2243] 2024-10-16 17:07:37,245 >> Running training [INFO|trainer.py:2244] 2024-10-16 17:07:37,246 >> Num examples = 13,878 [INFO|trainer.py:2245] 2024-10-16 17:07:37,246 >> Num Epochs = 3 [INFO|trainer.py:2246] 2024-10-16 17:07:37,246 >> Instantaneous batch size per device = 1 [INFO|trainer.py:2249] 2024-10-16 17:07:37,246 >> Total train batch size (w. parallel, distributed & accumulation) = 16 [INFO|trainer.py:2250] 2024-10-16 17:07:37,246 >> Gradient Accumulation steps = 8 [INFO|trainer.py:2251] 2024-10-16 17:07:37,246 >> Total optimization steps = 2,601 [INFO|trainer.py:2252] 2024-10-16 17:07:37,249 >> Number of trainable parameters = 14,823,424 0%| | 0/2601 [00:00<?, ?it/s]/root/anaconda3/envs/yingji_llama/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:600: UserWarning: torch.utils.checkpoint: the use_reentrant parameter should be passed explicitly. In version 2.4 we will raise an exception if use_reentrant is not passed. use_reentrant=False is recommended, but if you need to preserve the current default behavior, you can pass use_reentrant=True. Refer to docs for more details on the differences between the two variants. return fn(*args, *kwargs) /root/anaconda3/envs/yingji_llama/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:600: UserWarning: torch.utils.checkpoint: the use_reentrant parameter should be passed explicitly. In version 2.4 we will raise an exception if use_reentrant is not passed. use_reentrant=False is recommended, but if you need to preserve the current default behavior, you can pass use_reentrant=True. Refer to docs for more details on the differences between the two variants. return fn(args, **kwargs) rank0: Traceback (most recent call last): rank0: File "/nanshu_data/jgx/LLaMA-Factory_yingji/src/llamafactory/launcher.py", line 23, in

rank0: File "/nanshu_data/jgx/LLaMA-Factory_yingji/src/llamafactory/launcher.py", line 19, in launch

rank0: File "/nanshu_data/jgx/LLaMA-Factory_yingji/src/llamafactory/train/tuner.py", line 48, in run_exp rank0: run_pt(model_args, data_args, training_args, finetuning_args, callbacks) rank0: File "/nanshu_data/jgx/LLaMA-Factory_yingji/src/llamafactory/train/pt/workflow.py", line 63, in run_pt rank0: train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint) rank0: File "/root/anaconda3/envs/yingji_llama/lib/python3.10/site-packages/transformers/trainer.py", line 2052, in train rank0: return inner_training_loop( rank0: File "/root/anaconda3/envs/yingji_llama/lib/python3.10/site-packages/transformers/trainer.py", line 2388, in _inner_training_loop rank0: tr_loss_step = self.training_step(model, inputs) rank0: File "/root/anaconda3/envs/yingji_llama/lib/python3.10/site-packages/transformers/trainer.py", line 3518, in training_step rank0: self.accelerator.backward(loss, **kwargs) rank0: File "/root/anaconda3/envs/yingji_llama/lib/python3.10/site-packages/accelerate/accelerator.py", line 2196, in backward

rank0: File "/root/anaconda3/envs/yingji_llama/lib/python3.10/site-packages/torch/_tensor.py", line 521, in backward

rank0: File "/root/anaconda3/envs/yingji_llama/lib/python3.10/site-packages/torch/autograd/init.py", line 289, in backward

api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

/nanshu_data/jgx/LLaMA-Factory_yingji/src/llamafactory/launcher.py FAILED

Failures:

------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-10-16_17:07:44 host : node3.cluster.local rank : 0 (local_rank: 0) exitcode : 1 (pid: 28004) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ### Expected behavior 请问如何解决 ### Others _No response_

hiyouga commented 2 weeks ago

see https://github.com/hiyouga/LLaMA-Factory/issues/4614

Shame-fight commented 1 week ago

但我是使用双T4显卡（16G*2）lora增量预训练微调QWEN2.5 7B，根据项目中参数和显存关系，应该不会报显存错误？配置文件如下：

model

model_name_or_path: /nanshu_data/jgx/LLM_Model/Qwen/Qwen2___5-7B

method

stage: pt do_train: true finetuning_type: lora lora_target: all deepspeed: examples/deepspeed/ds_z3_config.json

dataset

dataset: yingji_pt template: default cutoff_len: 1024 max_samples: 1000 overwrite_cache: true preprocessing_num_workers: 16

output

output_dir: saves/llama3-8b/lora/sft logging_steps: 10 save_steps: 500 plot_loss: true overwrite_output_dir: true

train

per_device_train_batch_size: 1 gradient_accumulation_steps: 2 learning_rate: 1.0e-4 num_train_epochs: 3.0 lr_scheduler_type: cosine warmup_ratio: 0.1 bf16: true ddp_timeout: 180000000

eval

val_size: 0.1 per_device_eval_batch_size: 1 eval_strategy: steps eval_steps: 500 @hiyouga

hiyouga / LLaMA-Factory

一机多卡（T4*2）,增量预训练，lora微调显存溢出 #5723

Reminder

System Info

Reproduction

model

method

dataset

output

train

eval

/nanshu_data/jgx/LLaMA-Factory_yingji/src/llamafactory/launcher.py FAILED

model

method

dataset

output

train

eval