fsdp_qlora fail - Githubissues

etemiz commented 1 month ago

Reminder

[X] I have read the README and searched the existing issues.

Reproduction

bash examples/extras/fsdp_qlora/single_node.sh

After reinstalling LLaMA-Factory with the latest commits without changing anything, I ran the above script. Which does sft to llama3-8b. It didn't work. One of the processes seemed to shut down during validation:

***** train metrics *****
  epoch                    =     2.9817
  total_flos               = 22902485GF
  train_loss               =     0.9921
  train_runtime            = 1:41:22.36
  train_samples_per_second =      0.484
  train_steps_per_second   =       0.03
Figure saved at: saves/llama3-8b/lora/sft/training_loss.png
05/26/2024 11:54:51 - WARNING - llamafactory.extras.ploting - No metric eval_loss to plot.
[INFO|trainer.py:3719] 2024-05-26 11:54:51,665 >> ***** Running Evaluation *****
[INFO|trainer.py:3721] 2024-05-26 11:54:51,665 >>   Num examples = 110
[INFO|trainer.py:3724] 2024-05-26 11:54:51,665 >>   Batch size = 1
 51%|█████████████████████████████████████████████████████████████████████████████████████████                                                                                      | 28/55 [00:32<00:31,  1.18s/it]W0526 11:55:29.603000 140709944127552 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 92950 closing signal SIGTERM
E0526 11:55:30.569000 140709944127552 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: -11) local_rank: 0 (pid: 92949) of binary: /home/dead/Desktop/ml/LLaMA-Factory/v/bin/python
Traceback (most recent call last):
  File "/home/dead/Desktop/ml/LLaMA-Factory/v/bin/accelerate", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/dead/Desktop/ml/LLaMA-Factory/v/lib/python3.11/site-packages/accelerate/commands/accelerate_cli.py", line 46, in main
    args.func(args)
  File "/home/dead/Desktop/ml/LLaMA-Factory/v/lib/python3.11/site-packages/accelerate/commands/launch.py", line 1069, in launch_command
    multi_gpu_launcher(args)
  File "/home/dead/Desktop/ml/LLaMA-Factory/v/lib/python3.11/site-packages/accelerate/commands/launch.py", line 718, in multi_gpu_launcher
    distrib_run.run(args)
  File "/home/dead/Desktop/ml/LLaMA-Factory/v/lib/python3.11/site-packages/torch/distributed/run.py", line 870, in run
    elastic_launch(
  File "/home/dead/Desktop/ml/LLaMA-Factory/v/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dead/Desktop/ml/LLaMA-Factory/v/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
=======================================================
src/train.py FAILED
-------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
-------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-05-26_11:55:29
  host      : localhost
  rank      : 0 (local_rank: 0)
  exitcode  : -11 (pid: 92949)
  error_file: <N/A>
  traceback : Signal 11 (SIGSEGV) received by PID 92949
=======================================================

Expected behavior

do training

System Info

GPUs: 2*RTX 3090

Others

No response

hiyouga commented 1 month ago

try disabling evaluation after training?

etemiz commented 1 month ago

how do I do that?

hiyouga commented 1 month ago

remove the eval args in yaml config

etemiz commented 1 month ago

same thing happened this time during training..

{'loss': 1.0392, 'grad_norm': 1.531546711921692, 'learning_rate': 8.078577175829324e-05, 'epoch': 0.88}                                                                                                             
 32%|██████████████████████████████████████████████████████▊                                                                                                                     | 65/204 [39:24<1:23:19, 35.97s/it]W0526 16:36:23.858000 140417675919424 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 101092 closing signal SIGTERM
E0526 16:36:24.875000 140417675919424 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: -11) local_rank: 0 (pid: 101091) of binary: /home/dead/Desktop/ml/lf-071/LLaMA-Factory/v/bin/python
Traceback (most recent call last):
  File "/home/dead/Desktop/ml/lf-071/LLaMA-Factory/v/bin/accelerate", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/dead/Desktop/ml/lf-071/LLaMA-Factory/v/lib/python3.11/site-packages/accelerate/commands/accelerate_cli.py", line 46, in main
    args.func(args)
  File "/home/dead/Desktop/ml/lf-071/LLaMA-Factory/v/lib/python3.11/site-packages/accelerate/commands/launch.py", line 1069, in launch_command
    multi_gpu_launcher(args)
  File "/home/dead/Desktop/ml/lf-071/LLaMA-Factory/v/lib/python3.11/site-packages/accelerate/commands/launch.py", line 718, in multi_gpu_launcher
    distrib_run.run(args)
  File "/home/dead/Desktop/ml/lf-071/LLaMA-Factory/v/lib/python3.11/site-packages/torch/distributed/run.py", line 870, in run
    elastic_launch(
  File "/home/dead/Desktop/ml/lf-071/LLaMA-Factory/v/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dead/Desktop/ml/lf-071/LLaMA-Factory/v/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
========================================================
src/train.py FAILED
--------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
--------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-05-26_16:36:23
  host      : localhost
  rank      : 0 (local_rank: 0)
  exitcode  : -11 (pid: 101091)
  error_file: <N/A>
  traceback : Signal 11 (SIGSEGV) received by PID 101091
========================================================

etemiz commented 1 month ago

I ran it again. It decided to run this time without an issue. Evals disabled.

training_loss

hiyouga / LLaMA-Factory

fsdp_qlora fail #3907

Reminder

Reproduction

Expected behavior

System Info

Others