Closed peterschmidt85 closed 1 month ago
I tried to update manually the deepspeed
library but it only caused other issues
same issue can be found here: https://github.com/microsoft/DeepSpeed/issues/5337 They suggested that changing 'log' to 'logger'.
I tried to update manually the
deepspeed
library but it only caused other issues
What issues are you facing with the manual upgrade? I upgraded it to deepspeed==0.14.4 and haven't faced any issue yet (though I haven't experimented extensively).
@MaveriQ I used another version. You're right, installing deepspeed==0.14.4
solves the issue.
@MaveriQ Now, I see the following issue BTW:
[rank0]: Traceback (most recent call last):
[rank0]: File "/workflow/alignment-handbook/scripts/run_sft.py", line 233, in <module>
[rank0]: main()
[rank0]: File "/workflow/alignment-handbook/scripts/run_sft.py", line 165, in main
[rank0]: trainer = SFTTrainer(
[rank0]: ^^^^^^^^^^^
[rank0]: File "/opt/conda/envs/workflow/lib/python3.11/site-packages/huggingface_hub/utils/_deprecation.py", line 101, in inner_f
[rank0]: return f(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^
[rank0]: File "/opt/conda/envs/workflow/lib/python3.11/site-packages/trl/trainer/sft_trainer.py", line 298, in __init__
[rank0]: self.dataset_num_proc = args.dataset_num_proc
[rank0]: ^^^^^^^^^^^^^^^^^^^^^
[rank0]: AttributeError: 'SFTConfig' object has no attribute 'dataset_num_proc'
E0728 09:45:37.376000 134577863571264 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 2588) of binary: /opt/conda/envs/workflow/bin/python3.11
Traceback (most recent call last):
File "/opt/conda/envs/workflow/bin/accelerate", line 8, in <module>
sys.exit(main())
^^^^^^
File "/opt/conda/envs/workflow/lib/python3.11/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
args.func(args)
File "/opt/conda/envs/workflow/lib/python3.11/site-packages/accelerate/commands/launch.py", line 1097, in launch_command
multi_gpu_launcher(args)
File "/opt/conda/envs/workflow/lib/python3.11/site-packages/accelerate/commands/launch.py", line 734, in multi_gpu_launcher
distrib_run.run(args)
File "/opt/conda/envs/workflow/lib/python3.11/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/opt/conda/envs/workflow/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/envs/workflow/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
scripts/run_sft.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-07-28_09:45:37
host : ip-172-31-26-129.us-west-2.compute.internal
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 2588)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
I guess the issue above will be fixed by #179
Steps to reproduce:
Actual behavior:
This issue looks similar to https://github.com/microsoft/DeepSpeed/issues/5337