microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
33.78k stars 3.96k forks source link

Grad parameters are None with bigger autograd graphs #4941

Open miguelscarv opened 6 months ago

miguelscarv commented 6 months ago

I'm training a model that involves using multiple LoRA adapters (3 different sets of adapters to be precise). For each input (which in my case is an image) I have to pass one version through one set of LoRA parameters, another version of the image through another set of LoRA and the final version of the image through the final set of LoRA adapters. This consumes a lot of memory in the form of the autograd graph, I believe.

What is happening is that when I use the 3 sets of LoRA adapters I get errors in DeepSpeed claiming the the grad parameters are None. Here is the traceback using zero stage 3:

Traceback (most recent call last):
  File "/cfs/home/u021543/pheye_llavar_accelerate.py", line 68, in <module>
  0%|                                                                                       | 0/16 [00:02<?, ?it/s]
    accelerator.backward(loss)
  File "/cfs/home/u021543/miniconda3/lib/python3.11/site-packages/accelerate/accelerator.py", line 1958, in backward
Traceback (most recent call last):
  File "/cfs/home/u021543/pheye_llavar_accelerate.py", line 68, in <module>
    accelerator.backward(loss)
  File "/cfs/home/u021543/miniconda3/lib/python3.11/site-packages/accelerate/accelerator.py", line 1958, in backward
    self.deepspeed_engine_wrapped.backward(loss, **kwargs)
  File "/cfs/home/u021543/miniconda3/lib/python3.11/site-packages/accelerate/utils/deepspeed.py", line 167, in backward
    self.engine.backward(loss, **kwargs)
  File "/cfs/home/u021543/miniconda3/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
        self.deepspeed_engine_wrapped.backward(loss, **kwargs) 
         ^^^^^^^^^^  File "/cfs/home/u021543/miniconda3/lib/python3.11/site-packages/accelerate/utils/deepspeed.py", line 167, in backward
^^^^^^^^^^^
  File "/cfs/home/u021543/miniconda3/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1955, in backward
    self.engine.backward(loss, **kwargs)
  File "/cfs/home/u021543/miniconda3/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^
  File "/cfs/home/u021543/miniconda3/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1955, in backward
    self.optimizer.backward(loss, retain_graph=retain_graph)
  File "/cfs/home/u021543/miniconda3/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^
  File "/cfs/home/u021543/miniconda3/lib/python3.11/site-packages/deepspeed/runtime/zero/stage3.py", line 2135, in backward
    self.optimizer.backward(loss, retain_graph=retain_graph)
  File "/cfs/home/u021543/miniconda3/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^
  File "/cfs/home/u021543/miniconda3/lib/python3.11/site-packages/deepspeed/runtime/zero/stage3.py", line 2135, in backward
    self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
  File "/cfs/home/u021543/miniconda3/lib/python3.11/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
    scaled_loss.backward(retain_graph=retain_graph)
  File "/cfs/home/u021543/miniconda3/lib/python3.11/site-packages/torch/_tensor.py", line 492, in backward
    self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
    torch.autograd.backward(
  File "/cfs/home/u021543/miniconda3/lib/python3.11/site-packages/torch/autograd/__init__.py", line 251, in backward
  File "/cfs/home/u021543/miniconda3/lib/python3.11/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
    scaled_loss.backward(retain_graph=retain_graph)
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/cfs/home/u021543/miniconda3/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
  File "/cfs/home/u021543/miniconda3/lib/python3.11/site-packages/torch/_tensor.py", line 492, in backward
    ret_val = func(*args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^
  File "/cfs/home/u021543/miniconda3/lib/python3.11/site-packages/deepspeed/runtime/zero/stage3.py", line 1119, in reduce_partition_and_remove_grads
    torch.autograd.backward(
  File "/cfs/home/u021543/miniconda3/lib/python3.11/site-packages/torch/autograd/__init__.py", line 251, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/cfs/home/u021543/miniconda3/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
              ^^^^^^^^^^^^^^^^    ^self.reduce_ready_partitions_and_remove_grads(param)^
^^^
  File "/cfs/home/u021543/miniconda3/lib/python3.11/site-packages/deepspeed/runtime/zero/stage3.py", line 1119, in reduce_partition_and_remove_grads
  File "/cfs/home/u021543/miniconda3/lib/python3.11/site-packages/deepspeed/runtime/zero/stage3.py", line 1409, in reduce_ready_partitions_and_remove_grads
    self.reduce_ready_partitions_and_remove_grads(param)
  File "/cfs/home/u021543/miniconda3/lib/python3.11/site-packages/deepspeed/runtime/zero/stage3.py", line 1409, in reduce_ready_partitions_and_remove_grads
    self.reduce_independent_p_g_buckets_and_remove_grads(param)
  File "/cfs/home/u021543/miniconda3/lib/python3.11/site-packages/deepspeed/runtime/zero/stage3.py", line 1156, in reduce_independent_p_g_buckets_and_remove_grads
    self.reduce_independent_p_g_buckets_and_remove_grads(param)    
self.__add_grad_to_ipg_bucket(param)
  File "/cfs/home/u021543/miniconda3/lib/python3.11/site-packages/deepspeed/runtime/zero/stage3.py", line 1156, in reduce_independent_p_g_buckets_and_remove_grads
  File "/cfs/home/u021543/miniconda3/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^
  File "/cfs/home/u021543/miniconda3/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
         self.__add_grad_to_ipg_bucket(param) 
     ^^^^^  File "/cfs/home/u021543/miniconda3/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
^^^^^^^^^^^^^^^^
  File "/cfs/home/u021543/miniconda3/lib/python3.11/site-packages/deepspeed/runtime/zero/stage3.py", line 1164, in __add_grad_to_ipg_bucket
    ret_val = func(*args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^
  File "/cfs/home/u021543/miniconda3/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^    ^if self.contiguous_gradients and self.elements_in_ipg_bucket + param.grad.numel() <= self.reduce_bucket_size:^
^^^
  File "/cfs/home/u021543/miniconda3/lib/python3.11/site-packages/deepspeed/runtime/zero/stage3.py", line 1164, in __add_grad_to_ipg_bucket
                                                                   ^^^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'numel'
    if self.contiguous_gradients and self.elements_in_ipg_bucket + param.grad.numel() <= self.reduce_bucket_size:
                                                                   ^^^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'numel'

and here is the traceback using zero stage 2:

Traceback (most recent call last):
  File "/cfs/home/u021543/pheye_llavar_accelerate.py", line 68, in <module>
  0%|                                                                                                                                                                                                                | 0/16 [00:08<?, ?it/s]
    accelerator.backward(loss)
  File "/cfs/home/u021543/miniconda3/lib/python3.11/site-packages/accelerate/accelerator.py", line 1958, in backward
Traceback (most recent call last):
  File "/cfs/home/u021543/pheye_llavar_accelerate.py", line 68, in <module>
    accelerator.backward(loss)
  File "/cfs/home/u021543/miniconda3/lib/python3.11/site-packages/accelerate/accelerator.py", line 1958, in backward
    self.deepspeed_engine_wrapped.backward(loss, **kwargs)
  File "/cfs/home/u021543/miniconda3/lib/python3.11/site-packages/accelerate/utils/deepspeed.py", line 167, in backward
    self.engine.backward(loss, **kwargs)
  File "/cfs/home/u021543/miniconda3/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^
  File "/cfs/home/u021543/miniconda3/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1955, in backward
    self.deepspeed_engine_wrapped.backward(loss, **kwargs)
  File "/cfs/home/u021543/miniconda3/lib/python3.11/site-packages/accelerate/utils/deepspeed.py", line 167, in backward
    self.optimizer.backward(loss, retain_graph=retain_graph)
    self.engine.backward(loss, **kwargs)  File "/cfs/home/u021543/miniconda3/lib/python3.11/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 2019, in backward

  File "/cfs/home/u021543/miniconda3/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^
  File "/cfs/home/u021543/miniconda3/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1955, in backward
    self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
  File "/cfs/home/u021543/miniconda3/lib/python3.11/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
    scaled_loss.backward(retain_graph=retain_graph)
  File "/cfs/home/u021543/miniconda3/lib/python3.11/site-packages/torch/_tensor.py", line 492, in backward
    self.optimizer.backward(loss, retain_graph=retain_graph)
  File "/cfs/home/u021543/miniconda3/lib/python3.11/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 2019, in backward
    self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
  File "/cfs/home/u021543/miniconda3/lib/python3.11/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
    scaled_loss.backward(retain_graph=retain_graph)
  File "/cfs/home/u021543/miniconda3/lib/python3.11/site-packages/torch/_tensor.py", line 492, in backward
    torch.autograd.backward(
  File "/cfs/home/u021543/miniconda3/lib/python3.11/site-packages/torch/autograd/__init__.py", line 251, in backward
    torch.autograd.backward(
  File "/cfs/home/u021543/miniconda3/lib/python3.11/site-packages/torch/autograd/__init__.py", line 251, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/cfs/home/u021543/miniconda3/lib/python3.11/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 865, in reduce_partition_and_remove_grads
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/cfs/home/u021543/miniconda3/lib/python3.11/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 865, in reduce_partition_and_remove_grads
    self.reduce_ready_partitions_and_remove_grads(param, i)
  File "/cfs/home/u021543/miniconda3/lib/python3.11/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1377, in reduce_ready_partitions_and_remove_grads
    self.reduce_ready_partitions_and_remove_grads(param, i)
  File "/cfs/home/u021543/miniconda3/lib/python3.11/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1377, in reduce_ready_partitions_and_remove_grads
    self.reduce_independent_p_g_buckets_and_remove_grads(param, i)
  File "/cfs/home/u021543/miniconda3/lib/python3.11/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 911, in reduce_independent_p_g_buckets_and_remove_grads
    self.reduce_independent_p_g_buckets_and_remove_grads(param, i)
  File "/cfs/home/u021543/miniconda3/lib/python3.11/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 911, in reduce_independent_p_g_buckets_and_remove_grads
    new_grad_tensor.copy_(grad_reduc.view(-1))
                          ^^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'view'
    new_grad_tensor.copy_(grad_reduc.view(-1))
                          ^^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'view'

My problem is very similar to https://github.com/microsoft/DeepSpeed/issues/700#issue-795318541, because when using zero stage 1 I get no issue, but I really need to use stage 2.

System info

miguelscarv commented 5 months ago

I've come to the conclusion that this is happening because my lora weights are being set the .requires_grad attribute to False, although I am not sure where or why this only happens when I add 3 sets of LoRA adapters to my model

jdchang1 commented 5 months ago

@miguelscarv I am trying to attempt a similar pipeline as you where I have multiple adapters. Have you found a solution? Thanks!

miguelscarv commented 5 months ago

@jdchang1 Unfortunately I haven't, what I am doing is simply using deepseed stage 0 (DDP)

Andrewzh112 commented 3 months ago

@jdchang1 Unfortunately I haven't, what I am doing is simply using deepseed stage 0 (DDP)

@miguelscarv I am trying to attempt a similar pipeline as you where I have multiple adapters. Have you found a solution? Thanks!

have you guys found a solution to training multiple LoRAs? I am also doing something similar. Thanks