Closed etemiz closed 1 month ago
try disabling evaluation after training?
how do I do that?
remove the eval args in yaml config
same thing happened this time during training..
{'loss': 1.0392, 'grad_norm': 1.531546711921692, 'learning_rate': 8.078577175829324e-05, 'epoch': 0.88}
32%|██████████████████████████████████████████████████████▊ | 65/204 [39:24<1:23:19, 35.97s/it]W0526 16:36:23.858000 140417675919424 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 101092 closing signal SIGTERM
E0526 16:36:24.875000 140417675919424 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: -11) local_rank: 0 (pid: 101091) of binary: /home/dead/Desktop/ml/lf-071/LLaMA-Factory/v/bin/python
Traceback (most recent call last):
File "/home/dead/Desktop/ml/lf-071/LLaMA-Factory/v/bin/accelerate", line 8, in <module>
sys.exit(main())
^^^^^^
File "/home/dead/Desktop/ml/lf-071/LLaMA-Factory/v/lib/python3.11/site-packages/accelerate/commands/accelerate_cli.py", line 46, in main
args.func(args)
File "/home/dead/Desktop/ml/lf-071/LLaMA-Factory/v/lib/python3.11/site-packages/accelerate/commands/launch.py", line 1069, in launch_command
multi_gpu_launcher(args)
File "/home/dead/Desktop/ml/lf-071/LLaMA-Factory/v/lib/python3.11/site-packages/accelerate/commands/launch.py", line 718, in multi_gpu_launcher
distrib_run.run(args)
File "/home/dead/Desktop/ml/lf-071/LLaMA-Factory/v/lib/python3.11/site-packages/torch/distributed/run.py", line 870, in run
elastic_launch(
File "/home/dead/Desktop/ml/lf-071/LLaMA-Factory/v/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/dead/Desktop/ml/lf-071/LLaMA-Factory/v/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
========================================================
src/train.py FAILED
--------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
--------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-05-26_16:36:23
host : localhost
rank : 0 (local_rank: 0)
exitcode : -11 (pid: 101091)
error_file: <N/A>
traceback : Signal 11 (SIGSEGV) received by PID 101091
========================================================
I ran it again. It decided to run this time without an issue. Evals disabled.
Reminder
Reproduction
After reinstalling LLaMA-Factory with the latest commits without changing anything, I ran the above script. Which does sft to llama3-8b. It didn't work. One of the processes seemed to shut down during validation:
Expected behavior
do training
System Info
GPUs: 2*RTX 3090
Others
No response