huggingface / alignment-handbook

Robust recipes to align language models with human and AI preferences
https://huggingface.co/HuggingFaceH4
Apache License 2.0
4.49k stars 388 forks source link

Exception: cannot import name 'log' from 'torch.distributed.elastic.agent.server.api' #180

Closed peterschmidt85 closed 1 month ago

peterschmidt85 commented 1 month ago

Steps to reproduce:

git clone https://github.com/huggingface/alignment-handbook.git
cd alignment-handbook
pip install -q .
pip install -q flash-attn --no-build-isolation
accelerate launch --config_file ... scripts/run_sft.py ...

Actual behavior:

Traceback (most recent call last):
  File "/opt/conda/envs/workflow/lib/python3.11/site-packages/trl/import_utils.py", line 180, in _get_module
    return importlib.import_module("." + module_name, self.__name__)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/workflow/lib/python3.11/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<frozen importlib._bootstrap>", line 1204, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1176, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1147, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 690, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 940, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/opt/conda/envs/workflow/lib/python3.11/site-packages/trl/trainer/sft_trainer.py", line 41, in <module>
    from ..extras.dataset_formatting import get_formatting_func_from_dataset
  File "/opt/conda/envs/workflow/lib/python3.11/site-packages/trl/extras/dataset_formatting.py", line 7, in <module>
    from ..trainer.utils import ConstantLengthDataset
  File "/opt/conda/envs/workflow/lib/python3.11/site-packages/trl/trainer/utils.py", line 51, in <module>
    import deepspeed
  File "/opt/conda/envs/workflow/lib/python3.11/site-packages/deepspeed/__init__.py", line 22, in <module>
    from . import module_inject
  File "/opt/conda/envs/workflow/lib/python3.11/site-packages/deepspeed/module_inject/__init__.py", line 6, in <module>
    from .replace_module import replace_transformer_layer, revert_transformer_layer, ReplaceWithTensorSlicing, GroupQuantizer, generic_injection
  File "/opt/conda/envs/workflow/lib/python3.11/site-packages/deepspeed/module_inject/replace_module.py", line 587, in <module>
    from ..pipe import PipelineModule
  File "/opt/conda/envs/workflow/lib/python3.11/site-packages/deepspeed/pipe/__init__.py", line 6, in <module>
    from ..runtime.pipe import PipelineModule, LayerSpec, TiedLayerSpec
  File "/opt/conda/envs/workflow/lib/python3.11/site-packages/deepspeed/runtime/pipe/__init__.py", line 6, in <module>
    from .module import PipelineModule, LayerSpec, TiedLayerSpec
  File "/opt/conda/envs/workflow/lib/python3.11/site-packages/deepspeed/runtime/pipe/module.py", line 19, in <module>
    from ..activation_checkpointing import checkpointing
  File "/opt/conda/envs/workflow/lib/python3.11/site-packages/deepspeed/runtime/activation_checkpointing/checkpointing.py", line 26, in <module>
    from deepspeed.runtime.config import DeepSpeedConfig
  File "/opt/conda/envs/workflow/lib/python3.11/site-packages/deepspeed/runtime/config.py", line 41, in <module>
    from ..elasticity import (
  File "/opt/conda/envs/workflow/lib/python3.11/site-packages/deepspeed/elasticity/__init__.py", line 10, in <module>
    from .elastic_agent import DSElasticAgent
  File "/opt/conda/envs/workflow/lib/python3.11/site-packages/deepspeed/elasticity/elastic_agent.py", line 9, in <module>
    from torch.distributed.elastic.agent.server.api import log, _get_socket_with_port
ImportError: cannot import name 'log' from 'torch.distributed.elastic.agent.server.api' (/opt/conda/envs/workflow/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/workflow/alignment-handbook/scripts/run_sft.py", line 43, in <module>
    from trl import SFTTrainer, setup_chat_format
  File "<frozen importlib._bootstrap>", line 1229, in _handle_fromlist
  File "/opt/conda/envs/workflow/lib/python3.11/site-packages/trl/import_utils.py", line 171, in __getattr__
    value = getattr(module, name)
            ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/workflow/lib/python3.11/site-packages/trl/import_utils.py", line 170, in __getattr__
    module = self._get_module(self._class_to_module[name])
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/workflow/lib/python3.11/site-packages/trl/import_utils.py", line 182, in _get_module
    raise RuntimeError(
RuntimeError: Failed to import trl.trainer.sft_trainer because of the following error (look up to see its traceback):
cannot import name 'log' from 'torch.distributed.elastic.agent.server.api' (/opt/conda/envs/workflow/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py)
[2024-07-25 10:31:48,629] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-07-25 10:31:48,630] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Traceback (most recent call last):
  File "/opt/conda/envs/workflow/lib/python3.11/site-packages/trl/import_utils.py", line 180, in _get_module
    return importlib.import_module("." + module_name, self.__name__)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/workflow/lib/python3.11/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<frozen importlib._bootstrap>", line 1204, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1176, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1147, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 690, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 940, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/opt/conda/envs/workflow/lib/python3.11/site-packages/trl/trainer/sft_trainer.py", line 41, in <module>
    from ..extras.dataset_formatting import get_formatting_func_from_dataset
  File "/opt/conda/envs/workflow/lib/python3.11/site-packages/trl/extras/dataset_formatting.py", line 7, in <module>
    from ..trainer.utils import ConstantLengthDataset
  File "/opt/conda/envs/workflow/lib/python3.11/site-packages/trl/trainer/utils.py", line 51, in <module>
    import deepspeed
  File "/opt/conda/envs/workflow/lib/python3.11/site-packages/deepspeed/__init__.py", line 22, in <module>
    from . import module_inject
  File "/opt/conda/envs/workflow/lib/python3.11/site-packages/deepspeed/module_inject/__init__.py", line 6, in <module>
    from .replace_module import replace_transformer_layer, revert_transformer_layer, ReplaceWithTensorSlicing, GroupQuantizer, generic_injection
  File "/opt/conda/envs/workflow/lib/python3.11/site-packages/deepspeed/module_inject/replace_module.py", line 587, in <module>
    from ..pipe import PipelineModule
  File "/opt/conda/envs/workflow/lib/python3.11/site-packages/deepspeed/pipe/__init__.py", line 6, in <module>
    from ..runtime.pipe import PipelineModule, LayerSpec, TiedLayerSpec
  File "/opt/conda/envs/workflow/lib/python3.11/site-packages/deepspeed/runtime/pipe/__init__.py", line 6, in <module>
    from .module import PipelineModule, LayerSpec, TiedLayerSpec
  File "/opt/conda/envs/workflow/lib/python3.11/site-packages/deepspeed/runtime/pipe/module.py", line 19, in <module>
    from ..activation_checkpointing import checkpointing
  File "/opt/conda/envs/workflow/lib/python3.11/site-packages/deepspeed/runtime/activation_checkpointing/checkpointing.py", line 26, in <module>
    from deepspeed.runtime.config import DeepSpeedConfig
  File "/opt/conda/envs/workflow/lib/python3.11/site-packages/deepspeed/runtime/config.py", line 41, in <module>
    from ..elasticity import (
  File "/opt/conda/envs/workflow/lib/python3.11/site-packages/deepspeed/elasticity/__init__.py", line 10, in <module>
    from .elastic_agent import DSElasticAgent
  File "/opt/conda/envs/workflow/lib/python3.11/site-packages/deepspeed/elasticity/elastic_agent.py", line 9, in <module>
    from torch.distributed.elastic.agent.server.api import log, _get_socket_with_port
ImportError: cannot import name 'log' from 'torch.distributed.elastic.agent.server.api' (/opt/conda/envs/workflow/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/workflow/alignment-handbook/scripts/run_sft.py", line 43, in <module>
    from trl import SFTTrainer, setup_chat_format
  File "<frozen importlib._bootstrap>", line 1229, in _handle_fromlist
  File "/opt/conda/envs/workflow/lib/python3.11/site-packages/trl/import_utils.py", line 171, in __getattr__
    value = getattr(module, name)
            ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/workflow/lib/python3.11/site-packages/trl/import_utils.py", line 170, in __getattr__
    module = self._get_module(self._class_to_module[name])
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/workflow/lib/python3.11/site-packages/trl/import_utils.py", line 182, in _get_module
    raise RuntimeError(
RuntimeError: Failed to import trl.trainer.sft_trainer because of the following error (look up to see its traceback):
cannot import name 'log' from 'torch.distributed.elastic.agent.server.api' (/opt/conda/envs/workflow/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py)
/opt/conda/envs/workflow/lib/python3.11/site-packages/deepspeed/runtime/zero/linear.py:47: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
  @autocast_custom_fwd
/opt/conda/envs/workflow/lib/python3.11/site-packages/deepspeed/runtime/zero/linear.py:66: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
  @autocast_custom_bwd
/opt/conda/envs/workflow/lib/python3.11/site-packages/deepspeed/runtime/zero/linear.py:47: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
  @autocast_custom_fwd
/opt/conda/envs/workflow/lib/python3.11/site-packages/deepspeed/runtime/zero/linear.py:66: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
  @autocast_custom_bwd
Traceback (most recent call last):
  File "/opt/conda/envs/workflow/lib/python3.11/site-packages/trl/import_utils.py", line 180, in _get_module
    return importlib.import_module("." + module_name, self.__name__)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/workflow/lib/python3.11/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<frozen importlib._bootstrap>", line 1204, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1176, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1147, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 690, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 940, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/opt/conda/envs/workflow/lib/python3.11/site-packages/trl/trainer/sft_trainer.py", line 41, in <module>
    from ..extras.dataset_formatting import get_formatting_func_from_dataset
  File "/opt/conda/envs/workflow/lib/python3.11/site-packages/trl/extras/dataset_formatting.py", line 7, in <module>
    from ..trainer.utils import ConstantLengthDataset
  File "/opt/conda/envs/workflow/lib/python3.11/site-packages/trl/trainer/utils.py", line 51, in <module>
    import deepspeed
  File "/opt/conda/envs/workflow/lib/python3.11/site-packages/deepspeed/__init__.py", line 22, in <module>
    from . import module_inject
  File "/opt/conda/envs/workflow/lib/python3.11/site-packages/deepspeed/module_inject/__init__.py", line 6, in <module>
    from .replace_module import replace_transformer_layer, revert_transformer_layer, ReplaceWithTensorSlicing, GroupQuantizer, generic_injection
  File "/opt/conda/envs/workflow/lib/python3.11/site-packages/deepspeed/module_inject/replace_module.py", line 587, in <module>
    from ..pipe import PipelineModule
  File "/opt/conda/envs/workflow/lib/python3.11/site-packages/deepspeed/pipe/__init__.py", line 6, in <module>
    from ..runtime.pipe import PipelineModule, LayerSpec, TiedLayerSpec
  File "/opt/conda/envs/workflow/lib/python3.11/site-packages/deepspeed/runtime/pipe/__init__.py", line 6, in <module>
    from .module import PipelineModule, LayerSpec, TiedLayerSpec
  File "/opt/conda/envs/workflow/lib/python3.11/site-packages/deepspeed/runtime/pipe/module.py", line 19, in <module>
    from ..activation_checkpointing import checkpointing
  File "/opt/conda/envs/workflow/lib/python3.11/site-packages/deepspeed/runtime/activation_checkpointing/checkpointing.py", line 26, in <module>
    from deepspeed.runtime.config import DeepSpeedConfig
  File "/opt/conda/envs/workflow/lib/python3.11/site-packages/deepspeed/runtime/config.py", line 41, in <module>
    from ..elasticity import (
  File "/opt/conda/envs/workflow/lib/python3.11/site-packages/deepspeed/elasticity/__init__.py", line 10, in <module>
    from .elastic_agent import DSElasticAgent
  File "/opt/conda/envs/workflow/lib/python3.11/site-packages/deepspeed/elasticity/elastic_agent.py", line 9, in <module>
    from torch.distributed.elastic.agent.server.api import log, _get_socket_with_port
ImportError: cannot import name 'log' from 'torch.distributed.elastic.agent.server.api' (/opt/conda/envs/workflow/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/workflow/alignment-handbook/scripts/run_sft.py", line 43, in <module>
    from trl import SFTTrainer, setup_chat_format
  File "<frozen importlib._bootstrap>", line 1229, in _handle_fromlist
  File "/opt/conda/envs/workflow/lib/python3.11/site-packages/trl/import_utils.py", line 171, in __getattr__
    value = getattr(module, name)
            ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/workflow/lib/python3.11/site-packages/trl/import_utils.py", line 170, in __getattr__
    module = self._get_module(self._class_to_module[name])
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/workflow/lib/python3.11/site-packages/trl/import_utils.py", line 182, in _get_module
    raise RuntimeError(
RuntimeError: Failed to import trl.trainer.sft_trainer because of the following error (look up to see its traceback):
cannot import name 'log' from 'torch.distributed.elastic.agent.server.api' (/opt/conda/envs/workflow/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py)
Traceback (most recent call last):
  File "/opt/conda/envs/workflow/lib/python3.11/site-packages/trl/import_utils.py", line 180, in _get_module
    return importlib.import_module("." + module_name, self.__name__)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/workflow/lib/python3.11/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<frozen importlib._bootstrap>", line 1204, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1176, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1147, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 690, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 940, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/opt/conda/envs/workflow/lib/python3.11/site-packages/trl/trainer/sft_trainer.py", line 41, in <module>
    from ..extras.dataset_formatting import get_formatting_func_from_dataset
  File "/opt/conda/envs/workflow/lib/python3.11/site-packages/trl/extras/dataset_formatting.py", line 7, in <module>
    from ..trainer.utils import ConstantLengthDataset
  File "/opt/conda/envs/workflow/lib/python3.11/site-packages/trl/trainer/utils.py", line 51, in <module>
    import deepspeed
  File "/opt/conda/envs/workflow/lib/python3.11/site-packages/deepspeed/__init__.py", line 22, in <module>
    from . import module_inject
  File "/opt/conda/envs/workflow/lib/python3.11/site-packages/deepspeed/module_inject/__init__.py", line 6, in <module>
    from .replace_module import replace_transformer_layer, revert_transformer_layer, ReplaceWithTensorSlicing, GroupQuantizer, generic_injection
  File "/opt/conda/envs/workflow/lib/python3.11/site-packages/deepspeed/module_inject/replace_module.py", line 587, in <module>
    from ..pipe import PipelineModule
  File "/opt/conda/envs/workflow/lib/python3.11/site-packages/deepspeed/pipe/__init__.py", line 6, in <module>
    from ..runtime.pipe import PipelineModule, LayerSpec, TiedLayerSpec
  File "/opt/conda/envs/workflow/lib/python3.11/site-packages/deepspeed/runtime/pipe/__init__.py", line 6, in <module>
    from .module import PipelineModule, LayerSpec, TiedLayerSpec
  File "/opt/conda/envs/workflow/lib/python3.11/site-packages/deepspeed/runtime/pipe/module.py", line 19, in <module>
    from ..activation_checkpointing import checkpointing
  File "/opt/conda/envs/workflow/lib/python3.11/site-packages/deepspeed/runtime/activation_checkpointing/checkpointing.py", line 26, in <module>
    from deepspeed.runtime.config import DeepSpeedConfig
  File "/opt/conda/envs/workflow/lib/python3.11/site-packages/deepspeed/runtime/config.py", line 41, in <module>
    from ..elasticity import (
  File "/opt/conda/envs/workflow/lib/python3.11/site-packages/deepspeed/elasticity/__init__.py", line 10, in <module>
    from .elastic_agent import DSElasticAgent
  File "/opt/conda/envs/workflow/lib/python3.11/site-packages/deepspeed/elasticity/elastic_agent.py", line 9, in <module>
    from torch.distributed.elastic.agent.server.api import log, _get_socket_with_port
ImportError: cannot import name 'log' from 'torch.distributed.elastic.agent.server.api' (/opt/conda/envs/workflow/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/workflow/alignment-handbook/scripts/run_sft.py", line 43, in <module>
    from trl import SFTTrainer, setup_chat_format
  File "<frozen importlib._bootstrap>", line 1229, in _handle_fromlist
  File "/opt/conda/envs/workflow/lib/python3.11/site-packages/trl/import_utils.py", line 171, in __getattr__
    value = getattr(module, name)
            ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/workflow/lib/python3.11/site-packages/trl/import_utils.py", line 170, in __getattr__
    module = self._get_module(self._class_to_module[name])
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/workflow/lib/python3.11/site-packages/trl/import_utils.py", line 182, in _get_module
    raise RuntimeError(
RuntimeError: Failed to import trl.trainer.sft_trainer because of the following error (look up to see its traceback):
cannot import name 'log' from 'torch.distributed.elastic.agent.server.api' (/opt/conda/envs/workflow/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py)
W0725 10:31:49.169000 133818468906816 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 2406 closing signal SIGTERM
W0725 10:31:49.170000 133818468906816 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 2409 closing signal SIGTERM
W0725 10:31:49.170000 133818468906816 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 2410 closing signal SIGTERM
W0725 10:31:49.170000 133818468906816 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 2411 closing signal SIGTERM
W0725 10:31:49.170000 133818468906816 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 2412 closing signal SIGTERM
E0725 10:31:49.284000 133818468906816 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 1 (pid: 2407) of binary: /opt/conda/envs/workflow/bin/python
Traceback (most recent call last):
  File "/opt/conda/envs/workflow/bin/accelerate", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/opt/conda/envs/workflow/lib/python3.11/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
    args.func(args)
  File "/opt/conda/envs/workflow/lib/python3.11/site-packages/accelerate/commands/launch.py", line 1097, in launch_command
    multi_gpu_launcher(args)
  File "/opt/conda/envs/workflow/lib/python3.11/site-packages/accelerate/commands/launch.py", line 734, in multi_gpu_launcher
    distrib_run.run(args)
  File "/opt/conda/envs/workflow/lib/python3.11/site-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/opt/conda/envs/workflow/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/workflow/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
scripts/run_sft.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-07-25_10:31:49
  host      : ip-172-31-41-87.eu-north-1.compute.internal
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 2408)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2024-07-25_10:31:49
  host      : ip-172-31-41-87.eu-north-1.compute.internal
  rank      : 7 (local_rank: 7)
  exitcode  : 1 (pid: 2413)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-07-25_10:31:49
  host      : ip-172-31-41-87.eu-north-1.compute.internal
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 2407)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

This issue looks similar to https://github.com/microsoft/DeepSpeed/issues/5337

peterschmidt85 commented 1 month ago

I tried to update manually the deepspeed library but it only caused other issues

vananh0905 commented 1 month ago

same issue can be found here: https://github.com/microsoft/DeepSpeed/issues/5337 They suggested that changing 'log' to 'logger'.

MaveriQ commented 1 month ago

I tried to update manually the deepspeed library but it only caused other issues

What issues are you facing with the manual upgrade? I upgraded it to deepspeed==0.14.4 and haven't faced any issue yet (though I haven't experimented extensively).

peterschmidt85 commented 1 month ago

@MaveriQ I used another version. You're right, installing deepspeed==0.14.4 solves the issue.

peterschmidt85 commented 1 month ago

@MaveriQ Now, I see the following issue BTW:

[rank0]: Traceback (most recent call last):
[rank0]:   File "/workflow/alignment-handbook/scripts/run_sft.py", line 233, in <module>
[rank0]:     main()
[rank0]:   File "/workflow/alignment-handbook/scripts/run_sft.py", line 165, in main
[rank0]:     trainer = SFTTrainer(
[rank0]:               ^^^^^^^^^^^
[rank0]:   File "/opt/conda/envs/workflow/lib/python3.11/site-packages/huggingface_hub/utils/_deprecation.py", line 101, in inner_f
[rank0]:     return f(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/conda/envs/workflow/lib/python3.11/site-packages/trl/trainer/sft_trainer.py", line 298, in __init__
[rank0]:     self.dataset_num_proc = args.dataset_num_proc
[rank0]:                             ^^^^^^^^^^^^^^^^^^^^^
[rank0]: AttributeError: 'SFTConfig' object has no attribute 'dataset_num_proc'
E0728 09:45:37.376000 134577863571264 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 2588) of binary: /opt/conda/envs/workflow/bin/python3.11
Traceback (most recent call last):
  File "/opt/conda/envs/workflow/bin/accelerate", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/opt/conda/envs/workflow/lib/python3.11/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
    args.func(args)
  File "/opt/conda/envs/workflow/lib/python3.11/site-packages/accelerate/commands/launch.py", line 1097, in launch_command
    multi_gpu_launcher(args)
  File "/opt/conda/envs/workflow/lib/python3.11/site-packages/accelerate/commands/launch.py", line 734, in multi_gpu_launcher
    distrib_run.run(args)
  File "/opt/conda/envs/workflow/lib/python3.11/site-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/opt/conda/envs/workflow/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/workflow/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
scripts/run_sft.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-07-28_09:45:37
  host      : ip-172-31-26-129.us-west-2.compute.internal
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 2588)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
peterschmidt85 commented 1 month ago

I guess the issue above will be fixed by #179