Closed 66RomanReigns closed 1 day ago
My guess is wrong, please see thehir0's reply
my code snippet:
def _broadcast_to_vllm(self, model: DeepSpeedEngine):
# avoid OOM
torch.cuda.empty_cache()
model = model.module
count, num_params = 0, len(list(model.named_parameters()))
for name, param in model.named_parameters():
count += 1 # empty_cache at last param
# Fire all vllm engines for broadcast
if torch.distributed.get_rank() == 0:
shape = param.shape if self.accelerator.deepspeed_plugin.zero_stage != 3 else param.ds_shape
refs = [
engine.update_weight.remote(name, dtype=param.dtype, shape=shape, empty_cache=count == num_params)
for engine in self.vllm_engines
]
# For ZeRO-3, allgather sharded parameter and broadcast to all vllm engines by rank 0
with deepspeed.zero.GatheredParameters([param], enabled=self.accelerator.deepspeed_plugin.zero_stage == 3):
if torch.distributed.get_rank() == 0:
torch.distributed.broadcast(param.data, 0, group=self._model_update_group)
ray.get(refs)
with deepspeed version 0.16.0 i have same error on: deepspeed.zero.GatheredParameters([param], enabled=self.accelerator.deepspeed_plugin.zero_stage == 3)
with deepspeed version 0.15.4:
_broadcast_to_vllm
with deepspeed.zero.GatheredParameters([param], enabled=self.accelerator.deepspeed_plugin.zero_stage == 3):
File "/usr/local/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 2241, in __exit__
self.params[0].partition(param_list=self.params, has_been_updated=False)
File "/usr/local/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1386, in partition
self._partition(param_list, has_been_updated=has_been_updated)
File "/usr/local/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1535, in _partition
self._partition_param(param, has_been_updated=has_been_updated)
File "/usr/local/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1568, in _partition_param
free_param(param)
File "/usr/local/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 284, in free_param
assert not param.ds_active_sub_modules, param.ds_summary()
AssertionError: {'id': 0, 'status': 'AVAILABLE', 'numel': 544997376, 'ds_numel': 544997376, 'shape': (152064, 3584), 'ds_shape': (152064, 3584), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': {2}, 'ds_tensor.shape': torch.Size([34062336])}
Everything works if grad_accum = 1, if grad_accum > 1, then these errors occur
Same problem.... My training configuration hasn't changed, it worked yesterday, but today it doesn't 🚬 Has it been resolved?
use deepspeed==0.15.4 solve the problem.
I faced the same error with deepspeed==0.16.0, but it seems to be fine with deepspeed==0.15.4
use deepspeed==0.15.4 solve the problem.
it's work
I faced the same error with deepspeed==0.16.0, but it seems to be fine with deepspeed==0.15.4
Thank you, this is very helpful.
same issue in Zero3 training, it was likely related to this https://github.com/microsoft/DeepSpeed/pull/6675
I encountered an issue while using DeepSpeed with ZeRO Stage 3 optimization. I received the following error: no_sync is not compatible with ZeRO Stage 3. I’m not sure how to resolve this conflict.
If anyone has experience with this or knows how to resolve it, could you please guide me? Thank you in advance!
[rank0]: File "/root/miniconda3/envs/llama/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1997, in no_sync [rank0]: assert not self.zero_optimization_partition_gradients(), \ [rank0]: AssertionError: no_sync context manager is incompatible with gradient partitioning logic of ZeRO stage 3 0%| | 0/168 [00:00<?, ?it/s] W1126 23:28:07.821000 402381 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 402434 closing signal SIGTERM E1126 23:28:11.641000 402381 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 1 (pid: 402435) of binary: /root/miniconda3/envs/llama/bin/python