Open uditsharma7 opened 3 months ago
With smaller context length I got
W0904 15:27:23.932000 23280343537216 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 261407 closing signal SIGTERM
W0904 15:27:23.932000 23280343537216 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 261408 closing signal SIGTERM
W0904 15:27:23.932000 23280343537216 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 261410 closing signal SIGTERM
E0904 15:27:34.716000 23280343537216 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: -11) local_rank: 2 (pid: 261409) of binary: /dccstor/udit/env/easy_context/bin/python
Traceback (most recent call last):
File "/udit/env/easy_context/bin/accelerate", line 8, in <module>
sys.exit(main())
File "/udit/env/easy_context/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
args.func(args)
File "/udit/env/easy_context/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1091, in launch_command
deepspeed_launcher(args)
File "/udit/env/easy_context/lib/python3.10/site-packages/accelerate/commands/launch.py", line 787, in deepspeed_launcher
distrib_run.run(args)
File "/dccstor/udit/env/easy_context/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run
elastic_launch(
File "/udit/env/easy_context/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/udit/env/easy_context/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
========================================================
EasyContext/train.py FAILED
--------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
--------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-09-04_15:27:23
host : cccxc614.pok.ibm.com
rank : 2 (local_rank: 2)
exitcode : -11 (pid: 261409)
error_file: <N/A>
traceback : Signal 11 (SIGSEGV) received by PID 261409
========================================================
I am facing this issue while using
zigzag_ring_attn
with 128k context length. Has anyone run into the same problem?