ddp_time in TrainingArguments with deepspeed doesn't take effect

System Info

deepspeed version: 0.14.2
transformers version: 4.42.3
Platform: Linux-5.10.134-13.al8.x86_64-x86_64-with-glibc2.30
Python version: 3.11.9
Huggingface_hub version: 0.23.4
Safetensors version: 0.4.2
Accelerate version: 0.30.1
Accelerate config: not found
PyTorch version (GPU?): 2.3.0 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using distributed or parallel set-up in script?:
Using GPU in script?:
GPU type: NVIDIA A800-SXM4-80GB

Who can help?

@muellerzr Thanks for the contributions on integration of deepspeed in Trainer! I'm new to the use of deepspeed with trainer. I'm wondering if I got something wrong here.

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

I'm training on a single machine with 4*A800 and the command is nohup deepspeed train.py > train.log. The training stucks after several steps and raises timeout error: [rank2]:[E ProcessGroupNCCL.cpp:563] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3623, OpType=ALLREDUCE, NumelIn=12846593, NumelOut=12846593, Timeout(ms)=600000) ran for 600005 milliseconds before timing out. I've searched this issue on internet and tried to set higher ddp_time in TrainingArguments, but it seems the Timeout(ms) in the error msg doesn't change with the ddp_time value. Specifically, the default value ddp_time is 1800s while the error msg shows the Timeout is set to 600s. I'm wondering why this keeps happening and how to solve the problem (set a higher threshold?). Or does anyone know how to control the Timeout in this exact error msg?

The TrainingArguments I use is:

args = TrainingArguments(output_dir = "./ckpts",
                            eval_strategy = "epoch",
                            eval_delay = 0,
                            save_strategy = "epoch",
                            save_only_model = True,
                            per_device_train_batch_size = batch_size,
                            gradient_accumulation_steps = accumulation_step,
                            per_device_eval_batch_size = batch_size,
                            fp16 = using_fp16,
                            num_train_epochs = epochs,
                            dataloader_num_workers = 4,
                            dataloader_persistent_workers = True,
                            dataloader_prefetch_factor = 2,
                            ddp_timeout = 3600,
                            deepspeed="./ds_config.json",
                            logging_strategy = "epoch"
                            )

The ds_config is:

{
    "fp16": {
        "enabled": false
    },

    "optimizer": {
        "type": "Adam",
        "params": {
            "torch_adam": true,
            "adam_w_mode": true,
            "lr": "auto",
            "betas": "auto",
            "eps": "auto",
            "weight_decay": "auto"
        }
    },

    "scheduler": {
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": "auto",
            "warmup_max_lr": "auto",
            "warmup_num_steps": "auto"
        }
    },

    "zero_optimization": {
        "stage": 2,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "allgather_partitions": true,
        "allgather_bucket_size": 5e8,
        "overlap_comm": true,
        "reduce_scatter": true,
        "reduce_bucket_size": 5e8,
        "contiguous_gradients": true,
        "round_robin_gradients": true
    },

    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "steps_per_print": 2,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false
}

Expected behavior

The expected Timeout should be consistent with the ddp_time in TrainingArguments. And this problem can be raised after random epochs or steps so I think it's not related to the model or dataset.

Update: I think I found the where this bug lies in. Since the Timeout is set to 600s which is the default value in torch.distributed.init_process_group, and the default value is set by calling _get_default_timeout() in torch/distributed/distributed_c10d.py, I simply modify this function to raise exception whenever it's called:

def _get_default_timeout(backend: Backend) -> timedelta:
+    raise Exception(f"### get_timeout called by {backend}")
    # see note on nccl vs other backend timeout (constants.py)
    if backend == Backend.NCCL:
        if not isinstance(default_pg_nccl_timeout, timedelta):
            # TODO moco benchmark on CPU initializes pgnccl backend today, triggered this assert in CI before it was
            # changed to be a warning.  We should fix the moco model.
            warnings.warn("Attempted to get default timeout for nccl backend, but NCCL support is not compiled")
            return default_pg_timeout
        return default_pg_nccl_timeout
    else:
        return default_pg_timeout

This function should not be called if ddp_time is set, but the log shows it's called unexpectedly:

[rank1]: Traceback (most recent call last):
[rank1]:   File "/ossfs/workspace/creative_insight/src/train.py", line 128, in <module>
[rank1]:     trainer.train(resume_from_checkpoint = None)
[rank1]:   File "/opt/conda/envs/llava/lib/python3.11/site-packages/transformers/trainer.py", line 1932, in train
[rank1]:     return inner_training_loop(
[rank1]:            ^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/opt/conda/envs/llava/lib/python3.11/site-packages/transformers/trainer.py", line 2092, in _inner_training_loop
[rank1]:     model, self.optimizer, self.lr_scheduler = self.accelerator.prepare(
[rank1]:                                                ^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/opt/conda/envs/llava/lib/python3.11/site-packages/accelerate/accelerator.py", line 1284, in prepare
[rank1]:     result = self._prepare_deepspeed(*args)
[rank1]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/opt/conda/envs/llava/lib/python3.11/site-packages/accelerate/accelerator.py", line 1751, in _prepare_deepspeed
[rank1]:     engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
[rank1]:                                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/opt/conda/envs/llava/lib/python3.11/site-packages/deepspeed/__init__.py", line 181, in initialize
[rank1]:     engine = DeepSpeedEngine(args=args,
[rank1]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/opt/conda/envs/llava/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 262, in __init__
[rank1]:     self._configure_distributed_model(model)
[rank1]:   File "/opt/conda/envs/llava/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1145, in _configure_distributed_model
[rank1]:     self.data_parallel_group = groups._get_data_parallel_group()
[rank1]:                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/opt/conda/envs/llava/lib/python3.11/site-packages/deepspeed/utils/groups.py", line 404, in _get_data_parallel_group
[rank1]:     return _clone_world_group()
[rank1]:            ^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/opt/conda/envs/llava/lib/python3.11/site-packages/deepspeed/utils/groups.py", line 364, in _clone_world_group
[rank1]:     _WORLD_GROUP = dist.new_group(ranks=range(dist.get_world_size()))
[rank1]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/opt/conda/envs/llava/lib/python3.11/site-packages/deepspeed/comm/comm.py", line 185, in new_group
[rank1]:     return cdb.new_group(ranks)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/opt/conda/envs/llava/lib/python3.11/site-packages/deepspeed/comm/torch.py", line 351, in new_group
[rank1]:     return torch.distributed.new_group(ranks)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/opt/conda/envs/llava/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 89, in wrapper
[rank1]:     func_return = func(*args, **kwargs)
[rank1]:                   ^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/opt/conda/envs/llava/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 3868, in new_group
[rank1]:     return _new_group_with_tag(ranks, timeout, backend, pg_options, None, use_local_synchronization=use_local_synchronization)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/opt/conda/envs/llava/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 3901, in _new_group_with_tag
[rank1]:     timeout = _get_default_timeout(backend)
[rank1]:               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/opt/conda/envs/llava/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 607, in _get_default_timeout
[rank1]:     raise Exception(f"### get_timeout called by {backend}")
[rank1]: Exception: ### get_timeout called by nccl

Hope this can be helpful in debugging.

huggingface / transformers