Cucunnber commented 1 month ago

Reminder

[X] I have read the README and searched the existing issues.

Reproduction

训练框架为LLaMA-Factory-0.7.0

export NCCL_DEBUG=INFO
export NCCL_IB_DISABLE=0
export NCCL_SOCKET_IFNAME=eth10

model_path=codeqwen1.5-7B

dataset=codeqwen_0305

outputdir=codeqwen-pt-0527-new0305dataset

gradient_accumulation_steps=2

per_device_batchsize=2

epoch_num=2

learning_rate=1.5e-05

deepspeed  --hostfile hostfile.txt --master_addr=2.0.0.1 src/train.py --model_name_or_path $model_path  --stage pt \
--dataset $dataset \
--finetuning_type  full \
--overwrite_cache  true \
--flash_attn fa2 \
--preprocessing_num_workers 64 \
--template default \
--output_dir $outputdir \
--bf16  true  \
--lr_scheduler_type  cosine \
--do_train  true  \
--do_eval true \
--packing false \
--gradient_accumulation_steps  $gradient_accumulation_steps \
--gradient_checkpointing  true \
--learning_rate  $learning_rate \
--log_level  passive \
--logging_steps  10 \
--logging_strategy  steps \
--max_steps  -1 \
--num_train_epochs $epoch_num \
--report_to tensorboard \
--weight_decay 0.01 \
--cutoff_len 8192 \
--warmup_ratio 0.02 \
--eval_steps 200 \
--val_size 0.01 \
--evaluation_strategy steps \
--overwrite_output_dir  true  \
--per_device_train_batch_size  $per_device_batchsize \
--remove_unused_columns  true \
--save_strategy epoch \
--plot_loss \
--save_total_limit 3 \
--save_safetensors  true  \
--deepspeed=ds_z3_lr_schedule.json

Expected behavior

codeqwen1.5-7B在进行continue pretrain时所用显存异常地大，且在训练一段时间后出现OOM

ib125:     return F.cross_entropy(input, target, weight=self.weight,
ib125:   File "/home/chatgpt/.local/lib/python3.10/site-packages/torch/nn/functional.py", line 3053, in cross_entropy
ib125:     loss = loss_fct(shift_logits, shift_labels)
ib125:   File "/home/chatgpt/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
ib125:     return self._call_impl(*args, **kwargs)
ib125:   File "/home/chatgpt/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
ib125:     return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)
ib125: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 22.50 GiB. GPU 3 has a total capacty of 79.15 GiB of which 7.69 GiB is free. Including non-PyTorch memory, this process 4 GiB memory in use. Of the allocated memory 47.39 GiB is allocated by PyTorch, and 23.23 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_ze_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
ib125:     return forward_call(*args, **kwargs)
ib125:   File "/home/chatgpt/.local/lib/python3.10/site-packages/torch/nn/modules/loss.py", line 1179, in forward
ib125:     return F.cross_entropy(input, target, weight=self.weight,
ib125:   File "/home/chatgpt/.local/lib/python3.10/site-packages/torch/nn/functional.py", line 3053, in cross_entropy
ib125:     return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)
ib125: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 22.50 GiB. GPU 1 has a total capacty of 79.15 GiB of which 7.39 GiB is free. Including non-PyTorch memory, this process 4 GiB memory in use. Of the allocated memory 47.40 GiB is allocated by PyTorch, and 23.52 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_ze_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
ib125: Traceback (most recent call last):
ib125:   File "/var/mntpkg/LLaMA-Factory-0.7.0/src/train.py", line 14, in <module>
ib125:     main()
ib125:   File "/var/mntpkg/LLaMA-Factory-0.7.0/src/train.py", line 5, in main
ib125:     run_exp()
ib125:   File "/var/mntpkg/LLaMA-Factory-0.7.0/src/llmtuner/train/tuner.py", line 31, in run_exp
ib125:     run_pt(model_args, data_args, training_args, finetuning_args, callbacks)
ib125:   File "/var/mntpkg/LLaMA-Factory-0.7.0/src/llmtuner/train/pt/workflow.py", line 47, in run_pt
ib125:     train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
ib125:   File "/home/chatgpt/.local/lib/python3.10/site-packages/transformers/trainer.py", line 1780, in train

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-SXM4-80GB          Off | 00000000:1F:00.0 Off |                    0 |
| N/A   50C    P0             116W / 400W |  74831MiB / 81920MiB |     97%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM4-80GB          Off | 00000000:25:00.0 Off |                    0 |
| N/A   65C    P0             148W / 400W |  69291MiB / 81920MiB |     97%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM4-80GB          Off | 00000000:50:00.0 Off |                    0 |
| N/A   66C    P0             125W / 400W |  60269MiB / 81920MiB |     98%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-SXM4-80GB          Off | 00000000:55:00.0 Off |                    0 |
| N/A   52C    P0             125W / 400W |  36859MiB / 81920MiB |     97%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   4  NVIDIA A100-SXM4-80GB          Off | 00000000:90:00.0 Off |                    0 |
| N/A   52C    P0             147W / 400W |  36783MiB / 81920MiB |     98%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   5  NVIDIA A100-SXM4-80GB          Off | 00000000:95:00.0 Off |                    0 |
| N/A   66C    P0             163W / 400W |  36961MiB / 81920MiB |     97%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   6  NVIDIA A100-SXM4-80GB          Off | 00000000:CB:00.0 Off |                    0 |
| N/A   64C    P0             123W / 400W |  60133MiB / 81920MiB |     98%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   7  NVIDIA A100-SXM4-80GB          Off | 00000000:D1:00.0 Off |                    0 |
| N/A   50C    P0             140W / 400W |  36889MiB / 81920MiB |     97%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+

System Info

一开始发生OOM时我使用的是2节点，16张GPU

A100-SXM4-80GB X 16

- `transformers` version: 4.41.1
- Platform: Linux-5.15.0-86-generic-x86_64-with-glibc2.35
- Python version: 3.10.12
- Huggingface_hub version: 0.23.1
- Safetensors version: 0.4.2
- Accelerate version: 0.27.2
- Accelerate config:    - compute_environment: LOCAL_MACHINE
        - distributed_type: DEEPSPEED
        - use_cpu: False
        - debug: True
        - num_processes: 16
        - machine_rank: 0
        - num_machines: 2
        - main_process_ip: 2.0.0.1
        - main_process_port: 9995
        - rdzv_backend: static
        - same_network: True
        - main_training_function: main
        - deepspeed_config: {'deepspeed_config_file': 'deepspeed_z2_config_bf16.json', 'deepspeed_multinode_launcher': 'standard', 'zero3_init_flag': True}
        - downcast_bf16: no
        - tpu_use_cluster: False
        - tpu_use_sudo: False
        - tpu_env: []
- PyTorch version (GPU?): 2.1.1+cu121 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: <fill in>
- Using distributed or parallel set-up in script?: <fill in>

Others

之前我曾进行过多次模型训练，正常情况下训练7B的模型在这个batchsize与cutoff_len下不会爆OOM，并且通过nvidia-smi时能看出显存分配很不均匀。

暂时不清楚是训练框架的原因还是模型架构的原因，希望有大佬能解答。

hiyouga commented 1 month ago

如果别的模型在本框架下没有出现显存不均匀的问题，那么可能是模型架构导致建议尝试不同的 zero stage 和 batchsize 选择

Cucunnber commented 1 month ago

如果别的模型在本框架下没有出现显存不均匀的问题，那么可能是模型架构导致建议尝试不同的 zero stage 和 batchsize 选择

3663 #3631 #3310 #2908

看了这些issue提出过类似的问题，都是比较新的模型+新版本训练框架在训练一段时间过后出现OOM情况，希望这个问题能重视下。

nowang6 commented 1 month ago

我也碰到类似问题，使用一张A400 80G LoRA微调 Qwen 14B，一段时候后就OOM了。按理说，LoRA微调14B，只需要40G左右显存。另外使用llama fatcory的webchat参数，在A40 48G上推理Qwen 14B，推理一段时间后也OOM。我怀疑缓存没有及时清理。

Cucunnber commented 1 month ago

我也碰到类似问题，使用一张A400 80G LoRA微调 Qwen 14B，一段时候后就OOM了。按理说，LoRA微调14B，只需要40G左右显存。另外使用llama fatcory的webchat参数，在A40 48G上推理Qwen 14B，推理一段时间后也OOM。我怀疑缓存没有及时清理。

看起来Qwen系列的模型是重灾区啊

syGOAT commented 2 weeks ago

我也遇到了这个问题。Mistral-7b-instruct-v0.2 在 4*4090 训练一段时间后 OOM，sft lora。

 82%|████████████████████████████████████████████████████████████████████████████████████████████████████▎                      | 7060/8660 [5:05:53<1:22:15,  3.08s/it][rank3]: Traceback (most recent call last):
[rank3]:   File "/root/autodl-tmp/fhy/LLaMA-Factory/src/llamafactory/launcher.py", line 9, in <module>
[rank3]:     launch()
[rank3]:   File "/root/autodl-tmp/fhy/LLaMA-Factory/src/llamafactory/launcher.py", line 5, in launch
[rank3]:     run_exp()
[rank3]:   File "/root/autodl-tmp/fhy/LLaMA-Factory/src/llamafactory/train/tuner.py", line 33, in run_exp
[rank3]:     run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
[rank3]:   File "/root/autodl-tmp/fhy/LLaMA-Factory/src/llamafactory/train/sft/workflow.py", line 73, in run_sft
[rank3]:     train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
[rank3]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/root/autodl-tmp/miniconda3/envs/fhy/lib/python3.11/site-packages/transformers/trainer.py", line 1885, in train
[rank3]:     return inner_training_loop(
[rank3]:            ^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/root/autodl-tmp/miniconda3/envs/fhy/lib/python3.11/site-packages/transformers/trainer.py", line 2216, in _inner_training_loop
[rank3]:     tr_loss_step = self.training_step(model, inputs)
[rank3]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/root/autodl-tmp/miniconda3/envs/fhy/lib/python3.11/site-packages/transformers/trainer.py", line 3250, in training_step
[rank3]:     self.accelerator.backward(loss)
[rank3]:   File "/root/autodl-tmp/miniconda3/envs/fhy/lib/python3.11/site-packages/accelerate/accelerator.py", line 2121, in backward
[rank3]:     self.scaler.scale(loss).backward(**kwargs)
[rank3]:   File "/root/autodl-tmp/miniconda3/envs/fhy/lib/python3.11/site-packages/torch/_tensor.py", line 525, in backward
[rank3]:     torch.autograd.backward(
[rank3]:   File "/root/autodl-tmp/miniconda3/envs/fhy/lib/python3.11/site-packages/torch/autograd/__init__.py", line 267, in backward
[rank3]:     _engine_run_backward(
[rank3]:   File "/root/autodl-tmp/miniconda3/envs/fhy/lib/python3.11/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward
[rank3]:     return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/root/autodl-tmp/miniconda3/envs/fhy/lib/python3.11/site-packages/torch/autograd/function.py", line 301, in apply
[rank3]:     return user_fn(self, *args)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/root/autodl-tmp/miniconda3/envs/fhy/lib/python3.11/site-packages/torch/utils/checkpoint.py", line 320, in backward
[rank3]:     torch.autograd.backward(outputs_with_grad, args_with_grad)
[rank3]:   File "/root/autodl-tmp/miniconda3/envs/fhy/lib/python3.11/site-packages/torch/autograd/__init__.py", line 267, in backward
[rank3]:     _engine_run_backward(
[rank3]:   File "/root/autodl-tmp/miniconda3/envs/fhy/lib/python3.11/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward
[rank3]:     return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.25 GiB. GPU  has a total capacity of 23.65 GiB of which 626.50 MiB is free. Process 432723 has 16.71 GiB memory in use. Process 503573 has 6.32 GiB memory in use. Of the allocated memory 15.51 GiB is allocated by PyTorch, and 624.56 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
W0617 15:02:54.047000 140668053652096 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 601834 closing signal SIGTERM
W0617 15:02:54.048000 140668053652096 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 601835 closing signal SIGTERM
W0617 15:02:54.049000 140668053652096 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 601836 closing signal SIGTERM
E0617 15:02:55.382000 140668053652096 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 3 (pid: 601837) of binary: /root/autodl-tmp/miniconda3/envs/fhy/bin/python
Traceback (most recent call last):
  File "/root/autodl-tmp/miniconda3/envs/fhy/bin/torchrun", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/root/autodl-tmp/miniconda3/envs/fhy/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/root/autodl-tmp/miniconda3/envs/fhy/lib/python3.11/site-packages/torch/distributed/run.py", line 879, in main
    run(args)
  File "/root/autodl-tmp/miniconda3/envs/fhy/lib/python3.11/site-packages/torch/distributed/run.py", line 870, in run
    elastic_launch(
  File "/root/autodl-tmp/miniconda3/envs/fhy/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/autodl-tmp/miniconda3/envs/fhy/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/root/autodl-tmp/fhy/LLaMA-Factory/src/llamafactory/launcher.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-06-17_15:02:54
  host      : autodl-container-8bd44bbf43-cc7d373e
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 601837)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

Cucunnber commented 2 weeks ago

我也遇到了这个问题。Mistral-7b-instruct-v0.2 在 4*4090 训练一段时间后 OOM，sft lora。

 82%|████████████████████████████████████████████████████████████████████████████████████████████████████▎                      | 7060/8660 [5:05:53<1:22:15,  3.08s/it][rank3]: Traceback (most recent call last):
[rank3]:   File "/root/autodl-tmp/fhy/LLaMA-Factory/src/llamafactory/launcher.py", line 9, in <module>
[rank3]:     launch()
[rank3]:   File "/root/autodl-tmp/fhy/LLaMA-Factory/src/llamafactory/launcher.py", line 5, in launch
[rank3]:     run_exp()
[rank3]:   File "/root/autodl-tmp/fhy/LLaMA-Factory/src/llamafactory/train/tuner.py", line 33, in run_exp
[rank3]:     run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
[rank3]:   File "/root/autodl-tmp/fhy/LLaMA-Factory/src/llamafactory/train/sft/workflow.py", line 73, in run_sft
[rank3]:     train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
[rank3]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/root/autodl-tmp/miniconda3/envs/fhy/lib/python3.11/site-packages/transformers/trainer.py", line 1885, in train
[rank3]:     return inner_training_loop(
[rank3]:            ^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/root/autodl-tmp/miniconda3/envs/fhy/lib/python3.11/site-packages/transformers/trainer.py", line 2216, in _inner_training_loop
[rank3]:     tr_loss_step = self.training_step(model, inputs)
[rank3]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/root/autodl-tmp/miniconda3/envs/fhy/lib/python3.11/site-packages/transformers/trainer.py", line 3250, in training_step
[rank3]:     self.accelerator.backward(loss)
[rank3]:   File "/root/autodl-tmp/miniconda3/envs/fhy/lib/python3.11/site-packages/accelerate/accelerator.py", line 2121, in backward
[rank3]:     self.scaler.scale(loss).backward(**kwargs)
[rank3]:   File "/root/autodl-tmp/miniconda3/envs/fhy/lib/python3.11/site-packages/torch/_tensor.py", line 525, in backward
[rank3]:     torch.autograd.backward(
[rank3]:   File "/root/autodl-tmp/miniconda3/envs/fhy/lib/python3.11/site-packages/torch/autograd/__init__.py", line 267, in backward
[rank3]:     _engine_run_backward(
[rank3]:   File "/root/autodl-tmp/miniconda3/envs/fhy/lib/python3.11/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward
[rank3]:     return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/root/autodl-tmp/miniconda3/envs/fhy/lib/python3.11/site-packages/torch/autograd/function.py", line 301, in apply
[rank3]:     return user_fn(self, *args)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/root/autodl-tmp/miniconda3/envs/fhy/lib/python3.11/site-packages/torch/utils/checkpoint.py", line 320, in backward
[rank3]:     torch.autograd.backward(outputs_with_grad, args_with_grad)
[rank3]:   File "/root/autodl-tmp/miniconda3/envs/fhy/lib/python3.11/site-packages/torch/autograd/__init__.py", line 267, in backward
[rank3]:     _engine_run_backward(
[rank3]:   File "/root/autodl-tmp/miniconda3/envs/fhy/lib/python3.11/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward
[rank3]:     return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.25 GiB. GPU � has a total capacity of 23.65 GiB of which 626.50 MiB is free. Process 432723 has 16.71 GiB memory in use. Process 503573 has 6.32 GiB memory in use. Of the allocated memory 15.51 GiB is allocated by PyTorch, and 624.56 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
W0617 15:02:54.047000 140668053652096 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 601834 closing signal SIGTERM
W0617 15:02:54.048000 140668053652096 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 601835 closing signal SIGTERM
W0617 15:02:54.049000 140668053652096 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 601836 closing signal SIGTERM
E0617 15:02:55.382000 140668053652096 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 3 (pid: 601837) of binary: /root/autodl-tmp/miniconda3/envs/fhy/bin/python
Traceback (most recent call last):
  File "/root/autodl-tmp/miniconda3/envs/fhy/bin/torchrun", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/root/autodl-tmp/miniconda3/envs/fhy/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/root/autodl-tmp/miniconda3/envs/fhy/lib/python3.11/site-packages/torch/distributed/run.py", line 879, in main
    run(args)
  File "/root/autodl-tmp/miniconda3/envs/fhy/lib/python3.11/site-packages/torch/distributed/run.py", line 870, in run
    elastic_launch(
  File "/root/autodl-tmp/miniconda3/envs/fhy/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/autodl-tmp/miniconda3/envs/fhy/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/root/autodl-tmp/fhy/LLaMA-Factory/src/llamafactory/launcher.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-06-17_15:02:54
  host      : autodl-container-8bd44bbf43-cc7d373e
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 601837)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

可以尝试把do_eval关掉

syGOAT commented 2 weeks ago

可以尝试把do_eval关掉

我的 do_eval 是默认值 False

hiyouga / LLaMA-Factory

预训练codeqwen1.5-7b时显存分布异常，训练一段时间后爆OOM #3908

Reminder

Reproduction

Expected behavior

System Info

Others

3663 #3631 #3310 #2908