hpcaitech / ColossalAI

Making large AI models cheaper, faster and more accessible
https://www.colossalai.org
Apache License 2.0
38.62k stars 4.33k forks source link

[BUG]: Low_Level_Zero plugin crashes with LoRA #5909

Open Fallqs opened 2 months ago

Fallqs commented 2 months ago

Is there an existing issue for this bug?

🐛 Describe the bug

The line 808 of zero/low_level/low_level_optim.py assumes that every single parameter in model.parameters() is trainable. However, this is not true when it comes to LoRA tuning, resulting in training crashes.

To solve this issue, you may just add a shortcut below this for-loop:

for p in model.parameters():  # line 808
    if not p.requires_grad:
        continue
    ...

Environment

CUDA 12.1 PyTorch 2.1.2 ColossalAI 0.4.0 [This BUG is not observed in 0.3.5]

botbw commented 2 months ago

Hey @Fallqs thanks for reporting the bug and I will look into this. Btw will it be possible to share the code you are using or a min repro for the LoRA crash?

281LinChenjian commented 1 month ago

Sorry to bother you, could you please describe it in more detail? Because I am using the 0.3.6 version of colossalai, I put the following code in the corresponding position according to your code implementation, but it didn't work. Is it because I put it in the wrong position?i also want to use lora tuning.

this is my code:

def _sync_grad(self):
    for group_id in range(self.num_param_groups):
        param_group = self._working_param_groups[group_id]
        for param in param_group:
            if param.requires_grad and param.grad is not None:
                self._add_to_bucket(param, group_id)

    for p in model.parameters():  # line 808
        if not p.requires_grad:
            continue
        self._run_reduction()

this is my issue: rank0: Traceback (most recent call last): rank0: File "/home/yangl/LCJ_97/Open-Sora/scripts/finetune_lora.py", line 427, in

rank0: File "/home/yangl/LCJ_97/Open-Sora/scripts/finetune_lora.py", line 331, in main

rank0: File "/home/yangl/.local/lib/python3.9/site-packages/colossalai/zero/low_level/low_level_optim.py", line 597, in step rank0: working_grads = self._grad_store.get_working_grads_by_group_id(group_id) rank0: File "/home/yangl/.local/lib/python3.9/site-packages/colossalai/zero/low_level/bookkeeping/gradient_store.py", line 85, in get_working_grads_by_group_id rank0: for param_grads in self._grads_of_params[group_id].values(): rank0: KeyError: 0

Edenzzzz commented 1 month ago

Please share a minimum script to reproduce the error. Your code is wrong as _run_reduction reduces grads for all bucketed parameters. As far as I can tell, non-trainable params are not added to the bucket for reduction. https://github.com/hpcaitech/ColossalAI/blob/4ec17a7cdf07db4ec4dd6b6e01ba9b88d61b4f9f/colossalai/zero/low_level/low_level_optim.py#L652

281LinChenjian commented 1 month ago

Please share a minimum script to reproduce the error. Your code is wrong as _run_reduction reduces grads for all bucketed parameters. As far as I can tell, non-trainable params are not added to the bucket for reduction.

https://github.com/hpcaitech/ColossalAI/blob/4ec17a7cdf07db4ec4dd6b6e01ba9b88d61b4f9f/colossalai/zero/low_level/low_level_optim.py#L652

Thank you for your reply. Regarding the above issue, I have found that my code was added in the wrong location. He used line 808 from version 0.4.1. Then I have a new question:

def update_ipt(self, model): 
    for n,p in model.named_parameters():
        if "lora_" in n:
            # if p.grad is not None:
            #     print("grad:",p.grad)
            # print(p.requires_grad)
            # if not p.requires_grad:
            #     p.requires_grad = True  # Ensure requires_grad is True for 'lora_' parameters
            # p.retain_grad()
            if n not in self.ipt:
                self.ipt[n] = torch.zeros_like(p)
                self.exp_avg_ipt[n] = torch.zeros_like(p) 
                self.exp_avg_unc[n] = torch.zeros_like(p) 
            with torch.no_grad():
                # Calculate sensitivity 
                print("p.grad:",p.grad)
                self.ipt[n] = (p * p.grad).abs().detach()
                # Update sensitivity 
                self.exp_avg_ipt[n] = self.beta1 * self.exp_avg_ipt[n] + \
                                    (1-self.beta1)*self.ipt[n]
                # Update uncertainty 
                self.exp_avg_unc[n] = self.beta2 * self.exp_avg_unc[n] + \
                                    (1-self.beta2)*(self.ipt[n]-self.exp_avg_ipt[n]).abs()

When I tried to use p.grad, I found that an error occurred. After checking, I found that after using colorssalAI, I cannot directly access the gradient using p.grad. So the question is, how can we obtain gradient information?

rank0: Traceback (most recent call last): rank0: File "/home/yangl/LCJ_97/Open-Sora/scripts/finetune_lora.py", line 439, in

rank0: File "/home/yangl/LCJ_97/Open-Sora/scripts/finetune_lora.py", line 346, in main rank0: rankallocator.update_and_mask(model, epoch) rank0: File "/home/yangl/LCJ_97/AdaLoRA/loralib/loralib/adalora.py", line 320, in update_and_mask

rank0: File "/home/yangl/LCJ_97/AdaLoRA/loralib/loralib/adalora.py", line 228, in update_ipt rank0: self.ipt[n] = (p p.grad).abs().detach() rank0: TypeError: unsupported operand type(s) for : 'Parameter' and 'NoneType'

This is the website I searched for: https://github.com/hpcaitech/Open-Sora/issues/283

Thank you again for your enthusiastic response.

Edenzzzz commented 1 month ago

You can get the grads this way by calling get_partitioned_gradients_by_param_id, described in the issue you mentioned https://github.com/hpcaitech/Open-Sora/issues/283#issuecomment-2185800300

281LinChenjian commented 1 month ago

You can get the grads this way, described in the issue you mentioned hpcaitech/Open-Sora#283 (comment)

I have read the above code before, but it did not involve zero_optizer in my code implementation. Can you be more specific on how to implement it?

Edenzzzz commented 1 month ago

Does your training code involve an optimizer? That's what you're looking for

281LinChenjian commented 1 month ago

Does your training code involve an optimizer? That's what you're looking for

Sorry to bother you again, I will refine my question. The following is a minimal reproduction of my problem. However, it involves several methods of opensora that need to be imported. I used the code optimizer._grad_store.get_partitioned_gradients_by_param_id(0, id(p)) mentioned above to try to access the gradient of the parameters, but I did not get any value. The optimizer I used here is HybridAdam, and I used Booster which is not used in the link you gave. My question is how can I get the gradient in the case of the above code? This is my code:

import torch
import torch.nn as nn
import colossalai
from colossalai.booster import Booster
from colossalai.booster.plugin import GeminiPlugin
from colossalai.nn.optimizer import HybridAdam

from opensora.utils.train_utils import MaskGenerator, create_colossalai_plugin, update_ema
from opensora.utils.config_utils import define_experiment_workspace, parse_configs, save_training_config
colossalai.launch_from_torch({})
cfg = parse_configs(training=True)
cfg_dtype = cfg.get("dtype", "bf16")
plugin = create_colossalai_plugin(
    plugin=cfg.get("plugin", "zero2"),
    dtype=cfg_dtype,
    grad_clip=cfg.get("grad_clip", 0),
    sp_size=cfg.get("sp_size", 1),
    reduce_bucket_size_in_m=cfg.get("reduce_bucket_size_in_m", 20),
)
booster = Booster(plugin=plugin)

class Model(nn.Module):
    def __init__(self, *args, **kwargs) -> None:
        super().__init__(*args, **kwargs)
        self.embedding = nn.Embedding(100, 1024)
        self.lora_linear = nn.Linear(1024,1024)
        # self.lora_linear = loralib.SVDLinear(1024, 1024, r=12)

    def forward(self, x):
        embed = self.embedding(x)
        transform = self.lora_linear(embed)
        loss = (transform ** 2).sum()
        return loss

model = Model().train().cuda()

optimizer = HybridAdam(model.parameters(), lr=5e-5, betas=(0.9, 0.999), weight_decay=0)
model, optimizer = booster.boost(model, optimizer)[:2]

global_step=0
inputs = torch.tensor([1,2,3], device="cuda")
loss = model(inputs)

booster.backward(loss, optimizer)
print("loss:",loss)  # loss: tensor(1088., device='cuda:0', dtype=torch.bfloat16, grad_fn=<SumBackward0>)
optimizer.step()

for n, p in model.named_parameters():
    _grad = optimizer._grad_store.get_partitioned_gradients_by_param_id(0, id(p))
    print("grad:", _grad) # output:     grad:[]

This is my run command: python3 -m torch.distributed.run --nproc_per_node 1 /home/yangl/LCJ_97/Open-Sora/scripts/little_check.py configs/opensora-v1-2/train/stage1.py

I also tried a code that can successfully obtain the gradient, as follows:

import torch
import torch.nn as nn
import colossalai
from colossalai.booster import Booster
from colossalai.booster.plugin import GeminiPlugin
from colossalai.nn.optimizer import HybridAdam
import loralib 
from loralib import RankAllocator
from loralib import compute_orth_regu 
from opensora.utils.train_utils import MaskGenerator, create_colossalai_plugin, update_ema
from opensora.utils.config_utils import define_experiment_workspace, parse_configs, save_training_config
colossalai.launch_from_torch({})
cfg = parse_configs(training=True)
cfg_dtype = cfg.get("dtype", "bf16")
plugin = create_colossalai_plugin(
    plugin=cfg.get("plugin", "zero2"),
    dtype=cfg_dtype,
    grad_clip=cfg.get("grad_clip", 0),
    sp_size=cfg.get("sp_size", 1),
    reduce_bucket_size_in_m=cfg.get("reduce_bucket_size_in_m", 20),
)
booster = Booster(plugin=plugin)

class Model(nn.Module):
    def __init__(self, *args, **kwargs) -> None:
        super().__init__(*args, **kwargs)
        self.embedding = nn.Embedding(100, 1024)
        # self.embedding.requires_grad_(False)

        self.lora_linear = loralib.SVDLinear(1024, 1024, r=12)

    def forward(self, x):
        embed = self.embedding(x)
        transform = self.lora_linear(embed)
        loss = (transform ** 2).sum()
        return loss

model = Model().cuda()
loralib.mark_only_lora_as_trainable(model)
optimizer = HybridAdam(model.parameters(), lr=5e-5, betas=(0.9, 0.999), weight_decay=0)
# model, optimizer = booster.boost(model, optimizer)[:2]
rankallocator = RankAllocator(
    model, lora_r=12, target_rank=8,
    init_warmup=500, final_warmup=1500, mask_interval=10, 
    total_step=3000, beta1=0.85, beta2=0.85, 
)
global_step=0
inputs = torch.tensor([1,2,3], device="cuda")
loss = model(inputs)
# booster.backward(loss, optimizer)
print("loss:",loss)
(loss+compute_orth_regu(model, regu_weight=0.1)).backward()
optimizer.step()

for n, p in model.named_parameters():
    if p.grad is not None:
        print("不为None")
    if "lora_" in n:
        print("n,p:",n,p.shape)
        # grad = optimizer._grad_store.get_partitioned_gradients_by_param_id(0, id(p))
        print("grad:",p.grad)
        # print("grad:",grad)
rankallocator.update_and_mask(model, global_step)

I don't know what the difference is between the two. I think the difference is that one uses booster.backward(loss, optimizer) and the other uses loss.backward() to pass the gradient back. Is it possible that I can't get the gradient when I use bosster, or is there something wrong with my code? Because I plan to use LoRA to fine-tune a large model, so getting the gradient is very important for me. Can you provide some help here?

botbw commented 1 month ago

hey @281LinChenjian ,

Regarding the problem you've got:

Code snippet 1

here after optimizer updates the param, it clears the _grad_store and you can no longer access the gradient, please access the gradient after optimizer.backward(loss) and before optiizer.step().

Code snippet 2

Don't call loss.backward() if you are using our optimizer.

Since there is no universal API for gradient accessing it might be a bit tricky and confusing, do feel free to ask here or open another issue if you still have problem :)

281LinChenjian commented 1 month ago

Thank you for your generous help. I have thoroughly understood how to use this method. Thank you again for your patient answers!!!

281LinChenjian commented 1 month ago

hey @281LinChenjian ,

Regarding the problem you've got:

Code snippet 1

here after optimizer updates the param, it clears the _grad_store and you can no longer access the gradient, please access the gradient after optimizer.backward(loss) and before optiizer.step().

Code snippet 2

Don't call loss.backward() if you are using our optimizer.

  • booster.backward(loss, optimizer) calls optimizer.backward(loss) and finally reaches here, where the loss.backward() is called.
  • after loss.backward() is called, grad reduction will execute to make sure your gradients are reduced and correct (so don't call loss.backward() since it doesn't do so).
  • after gradient reduction, grad on param (i.e. param.grad) are zeroed here and that's why you see all param.grad are None.

Since there is no universal API for gradient accessing it might be a bit tricky and confusing, do feel free to ask here or open another issue if you still have problem :)

Sorry to bother you again, I found that when training with multiple graphics cards, the shape obtained by _grad = optimizer._grad_store.get_partitioned_gradients_by_param_id(0, id(p)) is smaller than p.shape. Specifically, when I use two graphics cards for training, the gradient shape I get is half of p, and when I use four graphics cards for training, the shape I get is one quarter of p. I think this is related to the implementation of your distributed training framework. Is this normal? How should I solve the current problem? The following is my code and error:

        for n,p in model.named_parameters():
            if "lora_" in n:
                if n not in self.ipt:
                    self.ipt[n] = torch.zeros_like(p)
                    self.exp_avg_ipt[n] = torch.zeros_like(p) 
                    self.exp_avg_unc[n] = torch.zeros_like(p) 
                with torch.no_grad():
                    # Calculate sensitivity 
                    print("n,p:",n,p.shape)
                    # print("p.grad:",optimizer._grad_store.get_partitioned_gradients_by_param_id(0, id(p)))
                    _grad = optimizer._grad_store.get_partitioned_gradients_by_param_id(0, id(p)) # meet some problems
                    print("_grad:",_grad[0].shape,len(_grad))
                    self.ipt[n] = (p * _grad[0].view(p.shape)).abs().detach()

This is the error when training with two graphics cards: d80b2f8a064a1fe76efc39a8262e9b5

This is the error when training with four graphics cards: 46068bf889ebbcdb8e337a282a07c88

Interestingly, 3456×1152=11990656×2,and 3456×1152=995328×4. So my gradient size and parameter size cannot match correctly, which is the problem I am currently facing. Apparently, the gradient I get in this way is one or two times less than the actual parameter amount. Is it because the implementation of each network structure is also evenly distributed on each card?

Edenzzzz commented 1 month ago

ZeRO splits gradients evenly across devices

281LinChenjian commented 1 month ago

ZeRO splits gradients evenly across devices

Is there any way to integrate them together, or is there any way to get the corresponding gradients on different graphics cards?

botbw commented 1 month ago

@281LinChenjian I guess you'll have to manually do torch.distributed.all_gather

For your reference