huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
131.64k stars 26.21k forks source link

ddp_time in TrainingArguments with deepspeed doesn't take effect #32036

Closed Nidhogg-lyz closed 1 week ago

Nidhogg-lyz commented 1 month ago

System Info

Who can help?

@muellerzr Thanks for the contributions on integration of deepspeed in Trainer! I'm new to the use of deepspeed with trainer. I'm wondering if I got something wrong here.

Information

Tasks

Reproduction

I'm training on a single machine with 4*A800 and the command is nohup deepspeed train.py > train.log. The training stucks after several steps and raises timeout error: [rank2]:[E ProcessGroupNCCL.cpp:563] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3623, OpType=ALLREDUCE, NumelIn=12846593, NumelOut=12846593, Timeout(ms)=600000) ran for 600005 milliseconds before timing out. I've searched this issue on internet and tried to set higher ddp_time in TrainingArguments, but it seems the Timeout(ms) in the error msg doesn't change with the ddp_time value. Specifically, the default value ddp_time is 1800s while the error msg shows the Timeout is set to 600s. I'm wondering why this keeps happening and how to solve the problem (set a higher threshold?). Or does anyone know how to control the Timeout in this exact error msg?

The TrainingArguments I use is:

args = TrainingArguments(output_dir = "./ckpts",
                            eval_strategy = "epoch",
                            eval_delay = 0,
                            save_strategy = "epoch",
                            save_only_model = True,
                            per_device_train_batch_size = batch_size,
                            gradient_accumulation_steps = accumulation_step,
                            per_device_eval_batch_size = batch_size,
                            fp16 = using_fp16,
                            num_train_epochs = epochs,
                            dataloader_num_workers = 4,
                            dataloader_persistent_workers = True,
                            dataloader_prefetch_factor = 2,
                            ddp_timeout = 3600,
                            deepspeed="./ds_config.json",
                            logging_strategy = "epoch"
                            )

The ds_config is:

{
    "fp16": {
        "enabled": false
    },

    "optimizer": {
        "type": "Adam",
        "params": {
            "torch_adam": true,
            "adam_w_mode": true,
            "lr": "auto",
            "betas": "auto",
            "eps": "auto",
            "weight_decay": "auto"
        }
    },

    "scheduler": {
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": "auto",
            "warmup_max_lr": "auto",
            "warmup_num_steps": "auto"
        }
    },

    "zero_optimization": {
        "stage": 2,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "allgather_partitions": true,
        "allgather_bucket_size": 5e8,
        "overlap_comm": true,
        "reduce_scatter": true,
        "reduce_bucket_size": 5e8,
        "contiguous_gradients": true,
        "round_robin_gradients": true
    },

    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "steps_per_print": 2,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false
}

Expected behavior

The expected Timeout should be consistent with the ddp_time in TrainingArguments. And this problem can be raised after random epochs or steps so I think it's not related to the model or dataset.

Nidhogg-lyz commented 1 month ago

Update: I think I found the where this bug lies in. Since the Timeout is set to 600s which is the default value in torch.distributed.init_process_group, and the default value is set by calling _get_default_timeout() in torch/distributed/distributed_c10d.py, I simply modify this function to raise exception whenever it's called:

def _get_default_timeout(backend: Backend) -> timedelta:
+    raise Exception(f"### get_timeout called by {backend}")
    # see note on nccl vs other backend timeout (constants.py)
    if backend == Backend.NCCL:
        if not isinstance(default_pg_nccl_timeout, timedelta):
            # TODO moco benchmark on CPU initializes pgnccl backend today, triggered this assert in CI before it was
            # changed to be a warning.  We should fix the moco model.
            warnings.warn("Attempted to get default timeout for nccl backend, but NCCL support is not compiled")
            return default_pg_timeout
        return default_pg_nccl_timeout
    else:
        return default_pg_timeout

This function should not be called if ddp_time is set, but the log shows it's called unexpectedly:

[rank1]: Traceback (most recent call last):
[rank1]:   File "/ossfs/workspace/creative_insight/src/train.py", line 128, in <module>
[rank1]:     trainer.train(resume_from_checkpoint = None)
[rank1]:   File "/opt/conda/envs/llava/lib/python3.11/site-packages/transformers/trainer.py", line 1932, in train
[rank1]:     return inner_training_loop(
[rank1]:            ^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/opt/conda/envs/llava/lib/python3.11/site-packages/transformers/trainer.py", line 2092, in _inner_training_loop
[rank1]:     model, self.optimizer, self.lr_scheduler = self.accelerator.prepare(
[rank1]:                                                ^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/opt/conda/envs/llava/lib/python3.11/site-packages/accelerate/accelerator.py", line 1284, in prepare
[rank1]:     result = self._prepare_deepspeed(*args)
[rank1]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/opt/conda/envs/llava/lib/python3.11/site-packages/accelerate/accelerator.py", line 1751, in _prepare_deepspeed
[rank1]:     engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
[rank1]:                                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/opt/conda/envs/llava/lib/python3.11/site-packages/deepspeed/__init__.py", line 181, in initialize
[rank1]:     engine = DeepSpeedEngine(args=args,
[rank1]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/opt/conda/envs/llava/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 262, in __init__
[rank1]:     self._configure_distributed_model(model)
[rank1]:   File "/opt/conda/envs/llava/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1145, in _configure_distributed_model
[rank1]:     self.data_parallel_group = groups._get_data_parallel_group()
[rank1]:                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/opt/conda/envs/llava/lib/python3.11/site-packages/deepspeed/utils/groups.py", line 404, in _get_data_parallel_group
[rank1]:     return _clone_world_group()
[rank1]:            ^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/opt/conda/envs/llava/lib/python3.11/site-packages/deepspeed/utils/groups.py", line 364, in _clone_world_group
[rank1]:     _WORLD_GROUP = dist.new_group(ranks=range(dist.get_world_size()))
[rank1]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/opt/conda/envs/llava/lib/python3.11/site-packages/deepspeed/comm/comm.py", line 185, in new_group
[rank1]:     return cdb.new_group(ranks)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/opt/conda/envs/llava/lib/python3.11/site-packages/deepspeed/comm/torch.py", line 351, in new_group
[rank1]:     return torch.distributed.new_group(ranks)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/opt/conda/envs/llava/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 89, in wrapper
[rank1]:     func_return = func(*args, **kwargs)
[rank1]:                   ^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/opt/conda/envs/llava/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 3868, in new_group
[rank1]:     return _new_group_with_tag(ranks, timeout, backend, pg_options, None, use_local_synchronization=use_local_synchronization)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/opt/conda/envs/llava/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 3901, in _new_group_with_tag
[rank1]:     timeout = _get_default_timeout(backend)
[rank1]:               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/opt/conda/envs/llava/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 607, in _get_default_timeout
[rank1]:     raise Exception(f"### get_timeout called by {backend}")
[rank1]: Exception: ### get_timeout called by nccl

Hope this can be helpful in debugging.

github-actions[bot] commented 2 weeks ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.