Open ghost opened 3 years ago
Thanks @dtkatch for the reporting, could you provide the repro steps, which is helpful for us to investigate?
改用stage1,不要用stage2,就可以解决
Hello @eltonzheng . I am still facing this issue. This is the traceback I'm getting:
Traceback (most recent call last):
File "/cfs/home/u021543/pheye_llavar_accelerate.py", line 68, in <module>
accelerator.backward(loss)
File "/cfs/home/u021543/miniconda3/lib/python3.11/site-packages/accelerate/accelerator.py", line 1958, in backward
self.deepspeed_engine_wrapped.backward(loss, **kwargs)
File "/cfs/home/u021543/miniconda3/lib/python3.11/site-packages/accelerate/utils/deepspeed.py", line 167, in backward
self.engine.backward(loss, **kwargs)
File "/cfs/home/u021543/miniconda3/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/cfs/home/u021543/miniconda3/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1955, in backward
self.optimizer.backward(loss, retain_graph=retain_graph)
File "/cfs/home/u021543/miniconda3/lib/python3.11/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 2019, in backward
self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
File "/cfs/home/u021543/miniconda3/lib/python3.11/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
scaled_loss.backward(retain_graph=retain_graph)
File "/cfs/home/u021543/miniconda3/lib/python3.11/site-packages/torch/_tensor.py", line 492, in backward
torch.autograd.backward(
File "/cfs/home/u021543/miniconda3/lib/python3.11/site-packages/torch/autograd/__init__.py", line 251, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/cfs/home/u021543/miniconda3/lib/python3.11/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 865, in reduce_partition_and_remove_grads
self.reduce_ready_partitions_and_remove_grads(param, i)
File "/cfs/home/u021543/miniconda3/lib/python3.11/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1377, in reduce_ready_partitions_and_remove_grads
self.reduce_independent_p_g_buckets_and_remove_grads(param, i)
File "/cfs/home/u021543/miniconda3/lib/python3.11/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 911, in reduce_independent_p_g_buckets_and_remove_grads
new_grad_tensor.copy_(grad_reduc.view(-1))
^^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'view'
Like @HUAFOR said, changing to Stage 1 solves it, but I really need Stage 2.
Providing a code example isn't the easiest thing in my case, but I can try to describe what I am doing - I added different sets of LoRA adapters to a model (3 to be exact). This model processes images, so for each example what I am doing is using the same model with different LoRA adapters for the same image at different sizes. This makes it so that the model that processes images at higher resolutions has to use more forward passes, so the backward pass is much more expensive since it has to record multiple gradients.
TLDR: I am doing something that needs to store a lot of gradients for each example, that is why I wanted to use zero stage 2.
I'm trying to apply deepspeed stage 2 to stylegan2 but I get this error.
Here's my config:
And here's the full stack trace: