Closed Cucunnber closed 1 week ago
如果别的模型在本框架下没有出现显存不均匀的问题,那么可能是模型架构导致 建议尝试不同的 zero stage 和 batchsize 选择
如果别的模型在本框架下没有出现显存不均匀的问题,那么可能是模型架构导致 建议尝试不同的 zero stage 和 batchsize 选择
看了这些issue提出过类似的问题,都是比较新的模型+新版本训练框架在训练一段时间过后出现OOM情况,希望这个问题能重视下。
我也碰到类似问题,使用一张A400 80G LoRA微调 Qwen 14B, 一段时候后就OOM了。 按理说,LoRA微调14B, 只需要40G左右显存。 另外使用llama fatcory的webchat参数,在A40 48G上推理Qwen 14B, 推理一段时间后也OOM。 我怀疑缓存没有及时清理。
我也碰到类似问题,使用一张A400 80G LoRA微调 Qwen 14B, 一段时候后就OOM了。 按理说,LoRA微调14B, 只需要40G左右显存。 另外使用llama fatcory的webchat参数,在A40 48G上推理Qwen 14B, 推理一段时间后也OOM。 我怀疑缓存没有及时清理。
看起来Qwen系列的模型是重灾区啊
我也遇到了这个问题。Mistral-7b-instruct-v0.2 在 4*4090 训练一段时间后 OOM,sft lora。
82%|████████████████████████████████████████████████████████████████████████████████████████████████████▎ | 7060/8660 [5:05:53<1:22:15, 3.08s/it][rank3]: Traceback (most recent call last):
[rank3]: File "/root/autodl-tmp/fhy/LLaMA-Factory/src/llamafactory/launcher.py", line 9, in <module>
[rank3]: launch()
[rank3]: File "/root/autodl-tmp/fhy/LLaMA-Factory/src/llamafactory/launcher.py", line 5, in launch
[rank3]: run_exp()
[rank3]: File "/root/autodl-tmp/fhy/LLaMA-Factory/src/llamafactory/train/tuner.py", line 33, in run_exp
[rank3]: run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
[rank3]: File "/root/autodl-tmp/fhy/LLaMA-Factory/src/llamafactory/train/sft/workflow.py", line 73, in run_sft
[rank3]: train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/root/autodl-tmp/miniconda3/envs/fhy/lib/python3.11/site-packages/transformers/trainer.py", line 1885, in train
[rank3]: return inner_training_loop(
[rank3]: ^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/root/autodl-tmp/miniconda3/envs/fhy/lib/python3.11/site-packages/transformers/trainer.py", line 2216, in _inner_training_loop
[rank3]: tr_loss_step = self.training_step(model, inputs)
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/root/autodl-tmp/miniconda3/envs/fhy/lib/python3.11/site-packages/transformers/trainer.py", line 3250, in training_step
[rank3]: self.accelerator.backward(loss)
[rank3]: File "/root/autodl-tmp/miniconda3/envs/fhy/lib/python3.11/site-packages/accelerate/accelerator.py", line 2121, in backward
[rank3]: self.scaler.scale(loss).backward(**kwargs)
[rank3]: File "/root/autodl-tmp/miniconda3/envs/fhy/lib/python3.11/site-packages/torch/_tensor.py", line 525, in backward
[rank3]: torch.autograd.backward(
[rank3]: File "/root/autodl-tmp/miniconda3/envs/fhy/lib/python3.11/site-packages/torch/autograd/__init__.py", line 267, in backward
[rank3]: _engine_run_backward(
[rank3]: File "/root/autodl-tmp/miniconda3/envs/fhy/lib/python3.11/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward
[rank3]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/root/autodl-tmp/miniconda3/envs/fhy/lib/python3.11/site-packages/torch/autograd/function.py", line 301, in apply
[rank3]: return user_fn(self, *args)
[rank3]: ^^^^^^^^^^^^^^^^^^^^
[rank3]: File "/root/autodl-tmp/miniconda3/envs/fhy/lib/python3.11/site-packages/torch/utils/checkpoint.py", line 320, in backward
[rank3]: torch.autograd.backward(outputs_with_grad, args_with_grad)
[rank3]: File "/root/autodl-tmp/miniconda3/envs/fhy/lib/python3.11/site-packages/torch/autograd/__init__.py", line 267, in backward
[rank3]: _engine_run_backward(
[rank3]: File "/root/autodl-tmp/miniconda3/envs/fhy/lib/python3.11/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward
[rank3]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
[rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.25 GiB. GPU has a total capacity of 23.65 GiB of which 626.50 MiB is free. Process 432723 has 16.71 GiB memory in use. Process 503573 has 6.32 GiB memory in use. Of the allocated memory 15.51 GiB is allocated by PyTorch, and 624.56 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
W0617 15:02:54.047000 140668053652096 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 601834 closing signal SIGTERM
W0617 15:02:54.048000 140668053652096 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 601835 closing signal SIGTERM
W0617 15:02:54.049000 140668053652096 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 601836 closing signal SIGTERM
E0617 15:02:55.382000 140668053652096 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 3 (pid: 601837) of binary: /root/autodl-tmp/miniconda3/envs/fhy/bin/python
Traceback (most recent call last):
File "/root/autodl-tmp/miniconda3/envs/fhy/bin/torchrun", line 8, in <module>
sys.exit(main())
^^^^^^
File "/root/autodl-tmp/miniconda3/envs/fhy/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/root/autodl-tmp/miniconda3/envs/fhy/lib/python3.11/site-packages/torch/distributed/run.py", line 879, in main
run(args)
File "/root/autodl-tmp/miniconda3/envs/fhy/lib/python3.11/site-packages/torch/distributed/run.py", line 870, in run
elastic_launch(
File "/root/autodl-tmp/miniconda3/envs/fhy/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/autodl-tmp/miniconda3/envs/fhy/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/root/autodl-tmp/fhy/LLaMA-Factory/src/llamafactory/launcher.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-06-17_15:02:54
host : autodl-container-8bd44bbf43-cc7d373e
rank : 3 (local_rank: 3)
exitcode : 1 (pid: 601837)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
我也遇到了这个问题。Mistral-7b-instruct-v0.2 在 4*4090 训练一段时间后 OOM,sft lora。
82%|████████████████████████████████████████████████████████████████████████████████████████████████████▎ | 7060/8660 [5:05:53<1:22:15, 3.08s/it][rank3]: Traceback (most recent call last): [rank3]: File "/root/autodl-tmp/fhy/LLaMA-Factory/src/llamafactory/launcher.py", line 9, in <module> [rank3]: launch() [rank3]: File "/root/autodl-tmp/fhy/LLaMA-Factory/src/llamafactory/launcher.py", line 5, in launch [rank3]: run_exp() [rank3]: File "/root/autodl-tmp/fhy/LLaMA-Factory/src/llamafactory/train/tuner.py", line 33, in run_exp [rank3]: run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks) [rank3]: File "/root/autodl-tmp/fhy/LLaMA-Factory/src/llamafactory/train/sft/workflow.py", line 73, in run_sft [rank3]: train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint) [rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank3]: File "/root/autodl-tmp/miniconda3/envs/fhy/lib/python3.11/site-packages/transformers/trainer.py", line 1885, in train [rank3]: return inner_training_loop( [rank3]: ^^^^^^^^^^^^^^^^^^^^ [rank3]: File "/root/autodl-tmp/miniconda3/envs/fhy/lib/python3.11/site-packages/transformers/trainer.py", line 2216, in _inner_training_loop [rank3]: tr_loss_step = self.training_step(model, inputs) [rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank3]: File "/root/autodl-tmp/miniconda3/envs/fhy/lib/python3.11/site-packages/transformers/trainer.py", line 3250, in training_step [rank3]: self.accelerator.backward(loss) [rank3]: File "/root/autodl-tmp/miniconda3/envs/fhy/lib/python3.11/site-packages/accelerate/accelerator.py", line 2121, in backward [rank3]: self.scaler.scale(loss).backward(**kwargs) [rank3]: File "/root/autodl-tmp/miniconda3/envs/fhy/lib/python3.11/site-packages/torch/_tensor.py", line 525, in backward [rank3]: torch.autograd.backward( [rank3]: File "/root/autodl-tmp/miniconda3/envs/fhy/lib/python3.11/site-packages/torch/autograd/__init__.py", line 267, in backward [rank3]: _engine_run_backward( [rank3]: File "/root/autodl-tmp/miniconda3/envs/fhy/lib/python3.11/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward [rank3]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank3]: File "/root/autodl-tmp/miniconda3/envs/fhy/lib/python3.11/site-packages/torch/autograd/function.py", line 301, in apply [rank3]: return user_fn(self, *args) [rank3]: ^^^^^^^^^^^^^^^^^^^^ [rank3]: File "/root/autodl-tmp/miniconda3/envs/fhy/lib/python3.11/site-packages/torch/utils/checkpoint.py", line 320, in backward [rank3]: torch.autograd.backward(outputs_with_grad, args_with_grad) [rank3]: File "/root/autodl-tmp/miniconda3/envs/fhy/lib/python3.11/site-packages/torch/autograd/__init__.py", line 267, in backward [rank3]: _engine_run_backward( [rank3]: File "/root/autodl-tmp/miniconda3/envs/fhy/lib/python3.11/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward [rank3]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank3]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.25 GiB. GPU � has a total capacity of 23.65 GiB of which 626.50 MiB is free. Process 432723 has 16.71 GiB memory in use. Process 503573 has 6.32 GiB memory in use. Of the allocated memory 15.51 GiB is allocated by PyTorch, and 624.56 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) W0617 15:02:54.047000 140668053652096 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 601834 closing signal SIGTERM W0617 15:02:54.048000 140668053652096 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 601835 closing signal SIGTERM W0617 15:02:54.049000 140668053652096 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 601836 closing signal SIGTERM E0617 15:02:55.382000 140668053652096 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 3 (pid: 601837) of binary: /root/autodl-tmp/miniconda3/envs/fhy/bin/python Traceback (most recent call last): File "/root/autodl-tmp/miniconda3/envs/fhy/bin/torchrun", line 8, in <module> sys.exit(main()) ^^^^^^ File "/root/autodl-tmp/miniconda3/envs/fhy/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) ^^^^^^^^^^^^^^^^^^ File "/root/autodl-tmp/miniconda3/envs/fhy/lib/python3.11/site-packages/torch/distributed/run.py", line 879, in main run(args) File "/root/autodl-tmp/miniconda3/envs/fhy/lib/python3.11/site-packages/torch/distributed/run.py", line 870, in run elastic_launch( File "/root/autodl-tmp/miniconda3/envs/fhy/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/autodl-tmp/miniconda3/envs/fhy/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /root/autodl-tmp/fhy/LLaMA-Factory/src/llamafactory/launcher.py FAILED ------------------------------------------------------------ Failures: <NO_OTHER_FAILURES> ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-06-17_15:02:54 host : autodl-container-8bd44bbf43-cc7d373e rank : 3 (local_rank: 3) exitcode : 1 (pid: 601837) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================
可以尝试把do_eval关掉
可以尝试把do_eval关掉
我的 do_eval 是默认值 False
Reminder
Reproduction
训练框架为LLaMA-Factory-0.7.0
Expected behavior
codeqwen1.5-7B在进行continue pretrain时所用显存异常地大,且在训练一段时间后出现OOM
System Info
一开始发生OOM时我使用的是2节点,16张GPU
Others
之前我曾进行过多次模型训练,正常情况下训练7B的模型在这个batchsize与cutoff_len下不会爆OOM,并且通过nvidia-smi时能看出显存分配很不均匀。
暂时不清楚是训练框架的原因还是模型架构的原因,希望有大佬能解答。