microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
35.67k stars 4.15k forks source link

AssertionError: no sync context manager is incompatible with gradientpartitioning logic of ZeRo stage 3 #6793

Closed 66RomanReigns closed 1 day ago

66RomanReigns commented 3 days ago

I encountered an issue while using DeepSpeed with ZeRO Stage 3 optimization. I received the following error: no_sync is not compatible with ZeRO Stage 3. I’m not sure how to resolve this conflict.

If anyone has experience with this or knows how to resolve it, could you please guide me? Thank you in advance!

[rank0]: File "/root/miniconda3/envs/llama/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1997, in no_sync [rank0]: assert not self.zero_optimization_partition_gradients(), \ [rank0]: AssertionError: no_sync context manager is incompatible with gradient partitioning logic of ZeRO stage 3 0%| | 0/168 [00:00<?, ?it/s] W1126 23:28:07.821000 402381 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 402434 closing signal SIGTERM E1126 23:28:11.641000 402381 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 1 (pid: 402435) of binary: /root/miniconda3/envs/llama/bin/python

WeiluXu commented 3 days ago

My guess is wrong, please see thehir0's reply

thehir0 commented 3 days ago

my code snippet:

    def _broadcast_to_vllm(self, model: DeepSpeedEngine):
        # avoid OOM
        torch.cuda.empty_cache()
        model = model.module
        count, num_params = 0, len(list(model.named_parameters()))
        for name, param in model.named_parameters():
            count += 1  # empty_cache at last param

            # Fire all vllm engines for broadcast
            if torch.distributed.get_rank() == 0:
                shape = param.shape if self.accelerator.deepspeed_plugin.zero_stage != 3 else param.ds_shape
                refs = [
                    engine.update_weight.remote(name, dtype=param.dtype, shape=shape, empty_cache=count == num_params)
                    for engine in self.vllm_engines
                ]

            # For ZeRO-3, allgather sharded parameter and broadcast to all vllm engines by rank 0
            with deepspeed.zero.GatheredParameters([param], enabled=self.accelerator.deepspeed_plugin.zero_stage == 3):
                if torch.distributed.get_rank() == 0:
                    torch.distributed.broadcast(param.data, 0, group=self._model_update_group)
                    ray.get(refs)

with deepspeed version 0.16.0 i have same error on: deepspeed.zero.GatheredParameters([param], enabled=self.accelerator.deepspeed_plugin.zero_stage == 3)

with deepspeed version 0.15.4:

_broadcast_to_vllm
    with deepspeed.zero.GatheredParameters([param], enabled=self.accelerator.deepspeed_plugin.zero_stage == 3):
  File "/usr/local/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 2241, in __exit__
    self.params[0].partition(param_list=self.params, has_been_updated=False)
  File "/usr/local/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1386, in partition
    self._partition(param_list, has_been_updated=has_been_updated)
  File "/usr/local/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1535, in _partition
    self._partition_param(param, has_been_updated=has_been_updated)
  File "/usr/local/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1568, in _partition_param
    free_param(param)
  File "/usr/local/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 284, in free_param
    assert not param.ds_active_sub_modules, param.ds_summary()
AssertionError: {'id': 0, 'status': 'AVAILABLE', 'numel': 544997376, 'ds_numel': 544997376, 'shape': (152064, 3584), 'ds_shape': (152064, 3584), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': {2}, 'ds_tensor.shape': torch.Size([34062336])}

Everything works if grad_accum = 1, if grad_accum > 1, then these errors occur

hynnn commented 3 days ago

Same problem.... My training configuration hasn't changed, it worked yesterday, but today it doesn't 🚬 Has it been resolved?

LaoWangGB commented 3 days ago

use deepspeed==0.15.4 solve the problem.

yejoon-lee commented 3 days ago

I faced the same error with deepspeed==0.16.0, but it seems to be fine with deepspeed==0.15.4

yuanyangeli commented 3 days ago

use deepspeed==0.15.4 solve the problem.

it's work

Luxanna-Real commented 2 days ago

I faced the same error with deepspeed==0.16.0, but it seems to be fine with deepspeed==0.15.4

Thank you, this is very helpful.

inkcherry commented 2 days ago

same issue in Zero3 training, it was likely related to this https://github.com/microsoft/DeepSpeed/pull/6675