Open Fallqs opened 2 months ago
Hey @Fallqs thanks for reporting the bug and I will look into this. Btw will it be possible to share the code you are using or a min repro for the LoRA crash?
Sorry to bother you, could you please describe it in more detail? Because I am using the 0.3.6 version of colossalai, I put the following code in the corresponding position according to your code implementation, but it didn't work. Is it because I put it in the wrong position?i also want to use lora tuning.
this is my code:
def _sync_grad(self):
for group_id in range(self.num_param_groups):
param_group = self._working_param_groups[group_id]
for param in param_group:
if param.requires_grad and param.grad is not None:
self._add_to_bucket(param, group_id)
for p in model.parameters(): # line 808
if not p.requires_grad:
continue
self._run_reduction()
this is my issue:
rank0: Traceback (most recent call last):
rank0: File "/home/yangl/LCJ_97/Open-Sora/scripts/finetune_lora.py", line 427, in
rank0: File "/home/yangl/LCJ_97/Open-Sora/scripts/finetune_lora.py", line 331, in main
rank0: File "/home/yangl/.local/lib/python3.9/site-packages/colossalai/zero/low_level/low_level_optim.py", line 597, in step rank0: working_grads = self._grad_store.get_working_grads_by_group_id(group_id) rank0: File "/home/yangl/.local/lib/python3.9/site-packages/colossalai/zero/low_level/bookkeeping/gradient_store.py", line 85, in get_working_grads_by_group_id rank0: for param_grads in self._grads_of_params[group_id].values(): rank0: KeyError: 0
Please share a minimum script to reproduce the error. Your code is wrong as _run_reduction reduces grads for all bucketed parameters. As far as I can tell, non-trainable params are not added to the bucket for reduction. https://github.com/hpcaitech/ColossalAI/blob/4ec17a7cdf07db4ec4dd6b6e01ba9b88d61b4f9f/colossalai/zero/low_level/low_level_optim.py#L652
Please share a minimum script to reproduce the error. Your code is wrong as _run_reduction reduces grads for all bucketed parameters. As far as I can tell, non-trainable params are not added to the bucket for reduction.
Thank you for your reply. Regarding the above issue, I have found that my code was added in the wrong location. He used line 808 from version 0.4.1. Then I have a new question:
def update_ipt(self, model):
for n,p in model.named_parameters():
if "lora_" in n:
# if p.grad is not None:
# print("grad:",p.grad)
# print(p.requires_grad)
# if not p.requires_grad:
# p.requires_grad = True # Ensure requires_grad is True for 'lora_' parameters
# p.retain_grad()
if n not in self.ipt:
self.ipt[n] = torch.zeros_like(p)
self.exp_avg_ipt[n] = torch.zeros_like(p)
self.exp_avg_unc[n] = torch.zeros_like(p)
with torch.no_grad():
# Calculate sensitivity
print("p.grad:",p.grad)
self.ipt[n] = (p * p.grad).abs().detach()
# Update sensitivity
self.exp_avg_ipt[n] = self.beta1 * self.exp_avg_ipt[n] + \
(1-self.beta1)*self.ipt[n]
# Update uncertainty
self.exp_avg_unc[n] = self.beta2 * self.exp_avg_unc[n] + \
(1-self.beta2)*(self.ipt[n]-self.exp_avg_ipt[n]).abs()
When I tried to use p.grad, I found that an error occurred. After checking, I found that after using colorssalAI, I cannot directly access the gradient using p.grad. So the question is, how can we obtain gradient information?
rank0: Traceback (most recent call last):
rank0: File "/home/yangl/LCJ_97/Open-Sora/scripts/finetune_lora.py", line 439, in
rank0: File "/home/yangl/LCJ_97/Open-Sora/scripts/finetune_lora.py", line 346, in main rank0: rankallocator.update_and_mask(model, epoch) rank0: File "/home/yangl/LCJ_97/AdaLoRA/loralib/loralib/adalora.py", line 320, in update_and_mask
rank0: File "/home/yangl/LCJ_97/AdaLoRA/loralib/loralib/adalora.py", line 228, in update_ipt rank0: self.ipt[n] = (p p.grad).abs().detach() rank0: TypeError: unsupported operand type(s) for : 'Parameter' and 'NoneType'
This is the website I searched for: https://github.com/hpcaitech/Open-Sora/issues/283
Thank you again for your enthusiastic response.
You can get the grads this way by calling get_partitioned_gradients_by_param_id
, described in the issue you mentioned
https://github.com/hpcaitech/Open-Sora/issues/283#issuecomment-2185800300
You can get the grads this way, described in the issue you mentioned hpcaitech/Open-Sora#283 (comment)
I have read the above code before, but it did not involve zero_optizer in my code implementation. Can you be more specific on how to implement it?
Does your training code involve an optimizer? That's what you're looking for
Does your training code involve an optimizer? That's what you're looking for
Sorry to bother you again, I will refine my question. The following is a minimal reproduction of my problem. However, it involves several methods of opensora that need to be imported. I used the code optimizer._grad_store.get_partitioned_gradients_by_param_id(0, id(p)) mentioned above to try to access the gradient of the parameters, but I did not get any value. The optimizer I used here is HybridAdam, and I used Booster which is not used in the link you gave. My question is how can I get the gradient in the case of the above code? This is my code:
import torch
import torch.nn as nn
import colossalai
from colossalai.booster import Booster
from colossalai.booster.plugin import GeminiPlugin
from colossalai.nn.optimizer import HybridAdam
from opensora.utils.train_utils import MaskGenerator, create_colossalai_plugin, update_ema
from opensora.utils.config_utils import define_experiment_workspace, parse_configs, save_training_config
colossalai.launch_from_torch({})
cfg = parse_configs(training=True)
cfg_dtype = cfg.get("dtype", "bf16")
plugin = create_colossalai_plugin(
plugin=cfg.get("plugin", "zero2"),
dtype=cfg_dtype,
grad_clip=cfg.get("grad_clip", 0),
sp_size=cfg.get("sp_size", 1),
reduce_bucket_size_in_m=cfg.get("reduce_bucket_size_in_m", 20),
)
booster = Booster(plugin=plugin)
class Model(nn.Module):
def __init__(self, *args, **kwargs) -> None:
super().__init__(*args, **kwargs)
self.embedding = nn.Embedding(100, 1024)
self.lora_linear = nn.Linear(1024,1024)
# self.lora_linear = loralib.SVDLinear(1024, 1024, r=12)
def forward(self, x):
embed = self.embedding(x)
transform = self.lora_linear(embed)
loss = (transform ** 2).sum()
return loss
model = Model().train().cuda()
optimizer = HybridAdam(model.parameters(), lr=5e-5, betas=(0.9, 0.999), weight_decay=0)
model, optimizer = booster.boost(model, optimizer)[:2]
global_step=0
inputs = torch.tensor([1,2,3], device="cuda")
loss = model(inputs)
booster.backward(loss, optimizer)
print("loss:",loss) # loss: tensor(1088., device='cuda:0', dtype=torch.bfloat16, grad_fn=<SumBackward0>)
optimizer.step()
for n, p in model.named_parameters():
_grad = optimizer._grad_store.get_partitioned_gradients_by_param_id(0, id(p))
print("grad:", _grad) # output: grad:[]
This is my run command:
python3 -m torch.distributed.run --nproc_per_node 1 /home/yangl/LCJ_97/Open-Sora/scripts/little_check.py configs/opensora-v1-2/train/stage1.py
I also tried a code that can successfully obtain the gradient, as follows:
import torch
import torch.nn as nn
import colossalai
from colossalai.booster import Booster
from colossalai.booster.plugin import GeminiPlugin
from colossalai.nn.optimizer import HybridAdam
import loralib
from loralib import RankAllocator
from loralib import compute_orth_regu
from opensora.utils.train_utils import MaskGenerator, create_colossalai_plugin, update_ema
from opensora.utils.config_utils import define_experiment_workspace, parse_configs, save_training_config
colossalai.launch_from_torch({})
cfg = parse_configs(training=True)
cfg_dtype = cfg.get("dtype", "bf16")
plugin = create_colossalai_plugin(
plugin=cfg.get("plugin", "zero2"),
dtype=cfg_dtype,
grad_clip=cfg.get("grad_clip", 0),
sp_size=cfg.get("sp_size", 1),
reduce_bucket_size_in_m=cfg.get("reduce_bucket_size_in_m", 20),
)
booster = Booster(plugin=plugin)
class Model(nn.Module):
def __init__(self, *args, **kwargs) -> None:
super().__init__(*args, **kwargs)
self.embedding = nn.Embedding(100, 1024)
# self.embedding.requires_grad_(False)
self.lora_linear = loralib.SVDLinear(1024, 1024, r=12)
def forward(self, x):
embed = self.embedding(x)
transform = self.lora_linear(embed)
loss = (transform ** 2).sum()
return loss
model = Model().cuda()
loralib.mark_only_lora_as_trainable(model)
optimizer = HybridAdam(model.parameters(), lr=5e-5, betas=(0.9, 0.999), weight_decay=0)
# model, optimizer = booster.boost(model, optimizer)[:2]
rankallocator = RankAllocator(
model, lora_r=12, target_rank=8,
init_warmup=500, final_warmup=1500, mask_interval=10,
total_step=3000, beta1=0.85, beta2=0.85,
)
global_step=0
inputs = torch.tensor([1,2,3], device="cuda")
loss = model(inputs)
# booster.backward(loss, optimizer)
print("loss:",loss)
(loss+compute_orth_regu(model, regu_weight=0.1)).backward()
optimizer.step()
for n, p in model.named_parameters():
if p.grad is not None:
print("不为None")
if "lora_" in n:
print("n,p:",n,p.shape)
# grad = optimizer._grad_store.get_partitioned_gradients_by_param_id(0, id(p))
print("grad:",p.grad)
# print("grad:",grad)
rankallocator.update_and_mask(model, global_step)
I don't know what the difference is between the two. I think the difference is that one uses booster.backward(loss, optimizer) and the other uses loss.backward() to pass the gradient back. Is it possible that I can't get the gradient when I use bosster, or is there something wrong with my code? Because I plan to use LoRA to fine-tune a large model, so getting the gradient is very important for me. Can you provide some help here?
hey @281LinChenjian ,
Regarding the problem you've got:
here after optimizer updates the param, it clears the _grad_store
and you can no longer access the gradient, please access the gradient after optimizer.backward(loss)
and before optiizer.step()
.
Don't call loss.backward()
if you are using our optimizer.
booster.backward(loss, optimizer)
calls optimizer.backward(loss)
and finally reaches here, where the loss.backward()
is called.loss.backward()
is called, grad reduction will execute to make sure your gradients are reduced and correct (so don't call loss.backward()
since it doesn't do so).param.grad
) are zeroed here and that's why you see all param.grad
are None.Since there is no universal API for gradient accessing it might be a bit tricky and confusing, do feel free to ask here or open another issue if you still have problem :)
Thank you for your generous help. I have thoroughly understood how to use this method. Thank you again for your patient answers!!!
hey @281LinChenjian ,
Regarding the problem you've got:
Code snippet 1
here after optimizer updates the param, it clears the
_grad_store
and you can no longer access the gradient, please access the gradient afteroptimizer.backward(loss)
and beforeoptiizer.step()
.Code snippet 2
Don't call
loss.backward()
if you are using our optimizer.
booster.backward(loss, optimizer)
callsoptimizer.backward(loss)
and finally reaches here, where theloss.backward()
is called.- after
loss.backward()
is called, grad reduction will execute to make sure your gradients are reduced and correct (so don't callloss.backward()
since it doesn't do so).- after gradient reduction, grad on param (i.e.
param.grad
) are zeroed here and that's why you see allparam.grad
are None.Since there is no universal API for gradient accessing it might be a bit tricky and confusing, do feel free to ask here or open another issue if you still have problem :)
Sorry to bother you again, I found that when training with multiple graphics cards, the shape obtained by _grad = optimizer._grad_store.get_partitioned_gradients_by_param_id(0, id(p))
is smaller than p.shape
. Specifically, when I use two graphics cards for training, the gradient shape I get is half of p, and when I use four graphics cards for training, the shape I get is one quarter of p. I think this is related to the implementation of your distributed training framework. Is this normal? How should I solve the current problem?
The following is my code and error:
for n,p in model.named_parameters():
if "lora_" in n:
if n not in self.ipt:
self.ipt[n] = torch.zeros_like(p)
self.exp_avg_ipt[n] = torch.zeros_like(p)
self.exp_avg_unc[n] = torch.zeros_like(p)
with torch.no_grad():
# Calculate sensitivity
print("n,p:",n,p.shape)
# print("p.grad:",optimizer._grad_store.get_partitioned_gradients_by_param_id(0, id(p)))
_grad = optimizer._grad_store.get_partitioned_gradients_by_param_id(0, id(p)) # meet some problems
print("_grad:",_grad[0].shape,len(_grad))
self.ipt[n] = (p * _grad[0].view(p.shape)).abs().detach()
This is the error when training with two graphics cards:
This is the error when training with four graphics cards:
Interestingly, 3456×1152=11990656×2,and 3456×1152=995328×4. So my gradient size and parameter size cannot match correctly, which is the problem I am currently facing. Apparently, the gradient I get in this way is one or two times less than the actual parameter amount. Is it because the implementation of each network structure is also evenly distributed on each card?
ZeRO splits gradients evenly across devices
ZeRO splits gradients evenly across devices
Is there any way to integrate them together, or is there any way to get the corresponding gradients on different graphics cards?
Is there an existing issue for this bug?
🐛 Describe the bug
The line 808 of
zero/low_level/low_level_optim.py
assumes that every single parameter in model.parameters() is trainable. However, this is not true when it comes to LoRA tuning, resulting in training crashes.To solve this issue, you may just add a shortcut below this
for-loop
:Environment
CUDA 12.1 PyTorch 2.1.2 ColossalAI 0.4.0 [This BUG is not observed in 0.3.5]