Closed songkq closed 1 year ago
Thanks for reaching out. I have taken a quick look. It seems that the below lines (which seems a bit unnecessary for me).
x = x.view(x.size(0), -1, 1, x.size(3)).permute(0, 3, 1, 2).contiguous()
x = x.squeeze(-1)
changes the torch trace graph's construction, especially the stem vertices type from linear
or gemm
to matmul
. But the matmul
is currently not included into the supported operators yet, which caused the trouble.
See the below dependency graph with normal input for linear layer fake_input = torch.randn((1, 1024))
def forward(self, x):
return self.fc(x)
fake_input = torch.randn((1, 1024))
oto = OTO(model=model, dummy_input=fake_input)
versus the dependency graph under the exampling numerous preprocessing operators
def forward(self, x):
x = x.view(x.size(0), -1, 1, x.size(3)).permute(0, 3, 1, 2).contiguous()
x = x.squeeze(-1)
return self.fc(x)
fake_input = torch.randn((1, 512, 2, 81))
oto = OTO(model=model, dummy_input=fake_input)
Please see my below comments regarding how to utilize OTO more properly.
import torch
import torch.nn as nn
from only_train_once import OTO
class DemoNet(nn.Module):
def __init__(self) -> None:
super().__init__()
self.fc = nn.Sequential(
nn.Linear(1024, 512),
nn.Linear(512, 256)
)
def forward(self, x):
return self.fc(x)
if __name__ == "__main__":
model = DemoNet()
model.eval()
fake_input = torch.randn((1, 1024))
print(f"{model(fake_input).shape}")
oto = OTO(model=model, dummy_input=fake_input)
oto.visualize_zigs(view=False)
oto.random_set_zero_groups() # Randomly set a subset of ZIGs to be zero.
oto.compress()
Leverage Dependency Graph Visualization to visualize the dependency graphs which usually promptly reveal the root cause of potential failures. See the above oto.visualize_zigs(view=False)
, where a $model_name.pdf would be generated.
Check if the operator displayed in dependency graph is supported by OTOv2.
Hope the above help. Meanwhile, we are working on developing the next generation of the library and will keep adding more tutorials and documentations. Thanks for the usage of our tool! Feel free to leave any other feedback.
@tianyic Thanks.
It seems that conv1d
is exactly an alternative choice. By the way, I'm wondering whether we can configure OTO with a black list, where the unsupported operators can be automatically ignored and kept intact during pruning.
Also I think if we add some functionality to round
the number of pruned channels to the expected number (32 or 16 or 8, for example), it will be useful for deployment on edge devices such as NPU.
class DemoNet(nn.Module):
def __init__(self) -> None:
super().__init__()
self.conv1d = nn.Sequential(
nn.Conv1d(1024, 512, 1, 1, 0, bias=True),
nn.Conv1d(512, 256, 1, 1, 0, bias=True)
)
def forward(self, x):
# x: [1, 512, 2, 81]
x = x.view(x.size(0), -1, 1, x.size(3)).permute(0, 3, 1, 2).contiguous()
x = x.squeeze(-1).permute(0, 2, 1)
return self.conv1d(x)
if __name__ == "__main__":
model = DemoNet()
model.eval()
fake_input = torch.randn((1, 512, 2, 81))
print(f"{model(fake_input).shape}")
oto = OTO(model=model, dummy_input=fake_input)
oto.visualize_zigs(view=False)
oto.random_set_zero_groups() # Randomly set a subset of ZIGs to be zero.
oto.compress()
@tianyic How can I configure the parameters of oto.dhspg
when using the Adamw
optimizer?
Glad that you find some alternative operators to make the library work. The black list is a good idea. We will consider upon our bandwidth.
An official tutorial regarding applications with Adam and AdamW will be provided in about 2-3 weeks. For a hotfix for your question, please try the below optimizer setting.
optimizer = oto.dhspg(
variant="adamw",
lr=1e-3, # set same as the baseline training
weight_decay=1e-2, # set same as the baseline training
first_momentum=0.9, # set same as the baseline training
second_momentum=0.999, # set same as the baseline training
dampening=0.0, # set same as the baseline training
target_group_sparsity=0.8, # choose upon how much you wanna compress
start_pruning_steps=X* len(trainloader), # start pruning after X epochs, depends on total epochs, start pruning 1/5 total epochs is typically fine.
lmbda=1e-2, # larger value promote group sparsity more effectively
lmbda_amplify=20, # larger value promote group sparsity more effectively
hat_lmbda_coeff=1e3, # larger value promote group sparsity more effectively
epsilon=0.0 # enlarge it if group sparsity does not meet target_group_sparsity.
)
@tianyic Thanks. When I execute the step 2 of the pipeline, will the learned group_sparsity in the step 1 be reset from scratch? 1、oto training -> save model & optimizer checkpoint ->stop training 2、load checkpoint -> oto training resume -> oto.compress
Also I'm wondering if I can export the pruned model onnx through the pipeline: 1、oto training -> save model & optimizer checkpoint ->stop training 2、load checkpoint -> oto.compress
Before reaching the start_pruning_steps
, what are the differences by using the oto.dhspg
optimizer compared with the original torch AdamW optimizer? How does the start_pruning_steps
make effects on the accuracy of the pruned model?
Which parameter of oto.dhspg
will be dominant to the accuracy of the pruned model?
Both pipelines are supported. But for the first pipeline, to preserve the learned group sparsity, need to set the argument in dhspg optimizer fixed_zero_groups=True, then resumes the oto training.
One more trick, not sure if you met or not. During the pruning when the group sparsity is increasing, the loss function may regress a bit upon applications. If so, don’t worry, once the group sparsity reaches the target, the loss function will decrease again till ultimate convergence.
This is a good question regarding start_pruning_steps, which we will come up with detailed explanations regarding DHSPG, maybe a video tutorial.
In short, DHSPG is a hybrid optimizer, it applies the baseline optimizer over all variables before starting pruning and over the variables that are considered as potentially important during pruning. For the variables that are considered as maybe redundant, a step called Half-space step is proceeded to yield them onto zero. Once group sparsity reaches the target, the optimizer performs as the baseline optimizer till ultimate convergence.
The ultimate accuracy typically depends on 1. how the baseline model can achieve, 2. if gives fairly enough steps for warming up, and 3.if gives sufficiently many steps after reaching target group sparsity.
More documentations and tutorials will be provided, where we will show more detailed instructions.
@tianyic Thanks. Looking forward to the tutorials. It seems that DHSPG optimizer slows down compared with the torch adamw optimizer. Could you please give some advice for speeding up the optimizer?
A good question.
DHSPG optimizer is a hybrid optimizer which indeed has some computational overhead during pruning (when group sparsity is increasing). The overhead is typically varying upon model and dataset. For majority models, the overhead is negligible, but some are not (the worst case I met would double the cost). But remark here that the overhead is temporary and will disappear once group sparsity reaches the target value (afterwards the DHSPG performs the same as the baseline optimizer).
Therefore, to speed up, I would suggest shrinking the pruning procedure, i.e., to make the group sparsity increase faster to reach the target value, which can be typically achieved via fine-tuning the hyperparameters related to group sparsity exploration. In fact, most of experiments I conducted could shrink the pruning stage into just a few epochs, which largely mitigates the overhead. Meanwhile, there might be some engineering tricks in the official torch version that could be leveraged to further speedup the DHSPG.
Hope the above help.
@tianyic Hi, when I apply OTO to a c2f
module used in YOLOv8, it failed with the error about the slice
and concat
operations. How can I solve the problem?
Traceback (most recent call last):
File "test_oto_c2f.py", line 117, in <module>
oto = OTO(model=model, dummy_input=fake_input)
File "/root/miniconda3/lib/python3.8/site-packages/only_train_once/__init__.py", line 17, in __init__
self.partition_zigs()
File "/root/miniconda3/lib/python3.8/site-packages/only_train_once/__init__.py", line 28, in partition_zigs
self._graph = automated_partition_zigs(self._graph)
File "/root/miniconda3/lib/python3.8/site-packages/only_train_once/zig/zig.py", line 125, in automated_partition_zigs
graph.set_zigs(opt)
File "/root/miniconda3/lib/python3.8/site-packages/only_train_once/graph/graph.py", line 417, in set_zigs
dfs_helper(self, auxilary_cc, auxilary_cc.dependent_stem_ccs)
File "/root/miniconda3/lib/python3.8/site-packages/only_train_once/graph/graph.py", line 410, in dfs_helper
node_in = graph.nodes[node_in_id]
KeyError: 'out-28'
[debug] concat_node.inputs = ['out-28', 'out-29', 'out-35']
[debug] graph.nodes = dict_keys(['out-25', 'out-26', 'out-27', 'out-28-29', 'out-30', 'out-31', 'out-32', 'out-33', 'out-34', 'out-35', 'out-36', 'out-37', 'out-38', 'out-39'])
from typing import Callable
import torch
import torch.nn as nn
from functools import partial
from only_train_once import OTO
def autopad(k, p=None, d=1): # kernel, padding, dilation
# Pad to 'same' shape outputs
if d > 1:
k = d * (k - 1) + 1 if isinstance(k, int) else [d * (x - 1) + 1 for x in k] # actual kernel-size
if p is None:
p = k // 2 if isinstance(k, int) else [x // 2 for x in k] # auto-pad
return p
class Conv(nn.Module):
# Standard convolution with args(ch_in, ch_out, kernel, stride, padding, groups, dilation, activation)
default_act = nn.LeakyReLU(inplace=True, negative_slope=0.1) # default activation
def __init__(self, c1, c2, k=1, s=1, p=None, g=1, d=1, act=True):
super().__init__()
self.conv = nn.Conv2d(c1, c2, k, s, autopad(k, p, d), groups=g, dilation=d, bias=False)
self.bn = nn.BatchNorm2d(c2)
self.act = self.default_act if act is True else act if isinstance(act, nn.Module) else nn.Identity()
def forward(self, x):
return self.act(self.bn(self.conv(x)))
def forward_fuse(self, x):
return self.act(self.conv(x))
class Bottleneck(nn.Module):
# Standard bottleneck
def __init__(self, c1, c2, shortcut=True, g=1, k=(3, 3), e=0.5): # ch_in, ch_out, shortcut, groups, kernels, expand
super().__init__()
c_ = int(c2 * e) # hidden channels
self.cv1 = Conv(c1, c_, k[0], 1)
self.cv2 = Conv(c_, c2, k[1], 1, g=g)
self.add = shortcut and c1 == c2
def forward(self, x):
return x + self.cv2(self.cv1(x)) if self.add else self.cv2(self.cv1(x))
class C2f(nn.Module):
# CSP Bottleneck with 2 convolutions
def __init__(self, c1, c2, n=1, shortcut=False, g=1, e=0.5): # ch_in, ch_out, number, shortcut, groups, expansion
super().__init__()
self.c = int(c2 * e) # hidden channels
self.cv1 = Conv(c1, 2 * self.c, 1, 1)
self.cv2 = Conv((2 + n) * self.c, c2, 1) # optional act=FReLU(c2)
self.m = nn.ModuleList(Bottleneck(self.c, self.c, shortcut, g, k=((3, 3), (3, 3)), e=1.0) for _ in range(n))
def forward(self, x):
# slice
y = list(self.cv1(x).chunk(2, 1))
y.extend(m(y[-1]) for m in self.m)
return self.cv2(torch.cat(y, 1))
def forward_split(self, x):
y = list(self.cv1(x).split((self.c, self.c), 1))
y.extend(m(y[-1]) for m in self.m)
return self.cv2(torch.cat(y, 1))
class C2fModule(nn.Module):
def __init__(self, c1=512, c2=256):
super().__init__()
self.c2f = C2f(c1, c2, n=1, shortcut=False, g=1, e=0.5)
def forward(self, x):
return self.c2f(x)
if __name__ == "__main__":
model = C2fModule()
model.eval()
fake_input = torch.randn((1, 512, 4, 80))
print(f"{model(fake_input).shape}")
oto = OTO(model=model, dummy_input=fake_input)
oto.visualize_zigs(view=False)
oto.random_set_zero_groups() # Randomly set a subset of ZIGs to be zero.
oto.compress()
Thanks for the above example @songkq . Will take a look in the weekdays and provide a guidance later.
Thanks for the example @songkq. I have taken a quick look. We will support slice operator better in the future release.
For a hotfix, please see the below alternative way that avoids slice, where I decompose the conv following slice into two separate convs.
import torch
import torch.nn as nn
from only_train_once import OTO
def autopad(k, p=None, d=1): # kernel, padding, dilation
# Pad to 'same' shape outputs
if d > 1:
k = d * (k - 1) + 1 if isinstance(k, int) else [d * (x - 1) + 1 for x in k] # actual kernel-size
if p is None:
p = k // 2 if isinstance(k, int) else [x // 2 for x in k] # auto-pad
return p
class Conv(nn.Module):
# Standard convolution with args(ch_in, ch_out, kernel, stride, padding, groups, dilation, activation)
default_act = nn.LeakyReLU(inplace=True, negative_slope=0.1) # default activation
def __init__(self, c1, c2, k=1, s=1, p=None, g=1, d=1, act=True):
super().__init__()
self.conv = nn.Conv2d(c1, c2, k, s, autopad(k, p, d), groups=g, dilation=d, bias=True)
self.bn = nn.BatchNorm2d(c2)
self.act = self.default_act if act is True else act if isinstance(act, nn.Module) else nn.Identity()
def forward(self, x):
return self.act(self.bn(self.conv(x)))
def forward_fuse(self, x):
return self.act(self.conv(x))
class Bottleneck(nn.Module):
# Standard bottleneck
def __init__(self, c1, c2, shortcut=True, g=1, k=(3, 3), e=0.5): # ch_in, ch_out, shortcut, groups, kernels, expand
super().__init__()
c_ = int(c2 * e) # hidden channels
self.cv1 = Conv(c1, c_, k[0], 1)
self.cv2 = Conv(c_, c2, k[1], 1, g=g)
self.add = shortcut and c1 == c2
def forward(self, x):
return x + self.cv2(self.cv1(x)) if self.add else self.cv2(self.cv1(x))
class C2f(nn.Module):
# CSP Bottleneck with 2 convolutions
def __init__(self, c1, c2, n=1, shortcut=False, g=1, e=0.5): # ch_in, ch_out, number, shortcut, groups, expansion
super().__init__()
self.c = int(c2 * e) # hidden channels
self.cv1 = Conv(c1, self.c, 1, 1)
self.cv2 = Conv(c1, self.c, 1, 1)
self.cv3 = Conv((2 + n) * self.c, c2, 1) # optional act=FReLU(c2)
self.m = nn.ModuleList(Bottleneck(self.c, self.c, shortcut, g, k=((3, 3), (3, 3)), e=1.0) for _ in range(n))
def forward(self, x):
y = [self.cv1(x), self.cv2(x)]
y.extend(m(y[-1]) for m in self.m)
return self.cv3(torch.cat(y, 1))
def forward_split(self, x):
y = list(self.cv1(x).split((self.c, self.c), 1))
y.extend(m(y[-1]) for m in self.m)
return self.cv2(torch.cat(y, 1))
class C2fModule(nn.Module):
def __init__(self, c1=512, c2=256):
super().__init__()
self.c2f = C2f(c1, c2, n=1, shortcut=False, g=1, e=0.5)
def forward(self, x):
return self.c2f(x)
if __name__ == "__main__":
model = C2fModule()
model.eval()
fake_input = torch.randn((1, 512, 4, 80))
print(f"{model(fake_input).shape}")
oto = OTO(model=model, dummy_input=fake_input)
oto.visualize_zigs(view=False)
oto.random_set_zero_groups() # Randomly set a subset of ZIGs to be zero.
oto.compress()
import onnxruntime as ort
full_ort_sess = ort.InferenceSession(oto.full_model_path)
compress_ort_sess = ort.InferenceSession(oto.compressed_model_path)
full_output = full_ort_sess.run(None, {'input.1': fake_input.numpy()})[0]
compress_output = compress_ort_sess.run(None, {'input.1': fake_input.numpy()})[0]
print("Output difference:")
print(full_output - compress_output)
The full and compressed models yield the same outputs. Hope the above help.
@tianyic Thanks. I will try it out.
I met another problem that group_sparsity, omega = optimizer.compute_group_sparsity_omega()
, where the returned group_sparsity
keeps always zero
during training even after reaching the settingstart_pruning_steps
. I set the oto.dhspg
optimizer as the following. I'm confusing why oto didn't take effect.
target_group_sparsity: 0.1
start_pruning_steps: 1000
hat_lmbda_coeff: 10.0
lmbda: 0.001
lmbda_amplify: 2.0
optimizer = oto.dhspg(
variant="adamw",
lr=1e-3,
weight_decay=1e-2,
first_momentum=0.9,
second_momentum=0.999,
dampening=0.0,
target_group_sparsity=0.1,
start_pruning_steps=1000,
lmbda=1e-3,
lmbda_amplify=2.0,
hat_lmbda_coeff=10,
epsilon=0.95
)
A good question @songkq . It is largely due to the settings of hyperparameter. Adamw
and sgd
they typically requires different settings for lambda (group sparsity exploration) related due to the different gradient estimation mechanisms. Please take a try as the below. We will cover it in the coming tutorials.
Meanwhile, we have ongoing plan to further optimize and simplify the hyperparameter lists to bring more convenience for the users including ourselves (since we are actively applying OTO onto a lot of DNN application-track research and products).
optimizer = oto.dhspg(
variant="adamw",
lr=1e-3,
weight_decay=1e-2,
first_momentum=0.9,
second_momentum=0.999,
dampening=0.0,
target_group_sparsity=0.1,
start_pruning_steps=1000,
lmbda=1e-2, # larger value promote group sparsity more effectively
lmbda_amplify=20, # larger value promote group sparsity more effectively
hat_lmbda_coeff=1e3, # larger value promote group sparsity more effectively
epsilon=0.95 # larger value promote group sparsity more effectively
)
I updated the repo for auto selecting hyper-parameter for different variants. Could just set the optimizer up as
optimizer = oto.dhspg(
variant="adamw",
lr=1e-3,
target_group_sparsity=0.1,
start_pruning_steps=1000,
)
should work for the majority of the experiments.
@tianyic Good job. Thanks!
@tianyic Hi, I have attempted to prune my network with a target_group_sparsity
of 0.1/0.35/0.5. However, I found that only the last two layers self.conv1d
were pruned while failing to prune the cnn_backbone. I'm confusing why oto cannot globally prune the network.
class DemoNet(nn.Module):
def __init__(self) -> None:
super().__init__()
self.cnn_backbone = ...
self.conv1d = nn.Sequential(
nn.Conv1d(1024, 512, 1, 1, 0, bias=True),
nn.Conv1d(512, 256, 1, 1, 0, bias=True)
)
def forward(self, x):
x = self.cnn_backbone(x)
# x: [1, 512, 2, 81]
x = x.view(x.size(0), -1, 1, x.size(3)).permute(0, 3, 1, 2).contiguous()
x = x.squeeze(-1).permute(0, 2, 1)
return self.conv1d(x)
@songkq Could you please share the dependency graph for me? I can then take a quick look.
You are right OTO is working on globally pruning the whole networks. Your issue would be typically resolved via minor adjustments either to the network arch or the operator list in my gut feeling.
Just in case if the dependency graph is confidential, you could send it via email Tianyi.Chen@microsoft.com
Meanwhile, I would recommend to proceed with sanity check before engaging into DHSPG training @songkq . The sanity check would randomly set up a set of ZIGs to be zero, and a compressed model would be generated afterwards. If the compressed model looked normal and returned the exact same output as the full model given the same random input, then the sanity check passed. Afterwards, DHSPG is triggered to train and identify redundant groups from the view of optimization rather than random selection.
oto.random_set_zero_groups() # Randomly set a subset of ZIGs to be zero.
oto.compress()
import onnxruntime as ort
full_ort_sess = ort.InferenceSession(oto.full_model_path)
compress_ort_sess = ort.InferenceSession(oto.compressed_model_path)
full_output = full_ort_sess.run(None, {'input.1': fake_input.numpy()})[0]
compress_output = compress_ort_sess.run(None, {'input.1': fake_input.numpy()})[0]
print("Output difference:")
print(full_output - compress_output) # Should be merely as zeros.
@songkq Please take a look at this newly raised issue, which I suspect might be the similar situation as yours. If so, please let me know if your onnx version is also 1.14. Thanks.
@tianyic Thanks.
I have done the sanity check. It exactly shows that only the last two layers were pruned by oto.random_set_zero_groups()
and oto.compress()
. The maximum value of diffrence between full_output
and compress_output
is about 4.4703484e-08
. I'm wondering if the reshape
and transpose
operation cause the problem.
x = x.view(x.size(0), -1, 1, x.size(3)).permute(0, 3, 1, 2).contiguous()
x = x.squeeze(-1).permute(0, 2, 1)
testcase:
def autopad(k, p=None, d=1): # kernel, padding, dilation
# Pad to 'same' shape outputs
if d > 1:
k = d * (k - 1) + 1 if isinstance(k, int) else [d * (x - 1) + 1 for x in k] # actual kernel-size
if p is None:
p = k // 2 if isinstance(k, int) else [x // 2 for x in k] # auto-pad
return p
class Conv(nn.Module):
# Standard convolution with args(ch_in, ch_out, kernel, stride, padding, groups, dilation, activation)
default_act = nn.LeakyReLU(inplace=True, negative_slope=0.1) # default activation
def __init__(self, c1, c2, k=1, s=1, p=None, g=1, d=1, act=True):
super().__init__()
self.conv = nn.Conv2d(c1, c2, k, s, autopad(k, p, d), groups=g, dilation=d, bias=False)
self.bn = nn.BatchNorm2d(c2)
self.act = self.default_act if act is True else act if isinstance(act, nn.Module) else nn.Identity()
def forward(self, x):
return self.act(self.bn(self.conv(x)))
def forward_fuse(self, x):
return self.act(self.conv(x))
class Bottleneck(nn.Module):
# Standard bottleneck
def __init__(self, c1, c2, shortcut=True, g=1, k=(3, 3), e=0.5): # ch_in, ch_out, shortcut, groups, kernels, expand
super().__init__()
c_ = int(c2 * e) # hidden channels
self.cv1 = Conv(c1, c_, k[0], 1)
self.cv2 = Conv(c_, c2, k[1], 1, g=g)
self.add = shortcut and c1 == c2
def forward(self, x):
return x + self.cv2(self.cv1(x)) if self.add else self.cv2(self.cv1(x))
class C2f_rep(nn.Module):
# CSP Bottleneck with 2 convolutions
def __init__(self, c1, c2, n=1, shortcut=False, g=1, e=0.5): # ch_in, ch_out, number, shortcut, groups, expansion
super().__init__()
self.kwargs = {"c1": c1, "c2": c2, "n": n, "shortcut": shortcut, "g": g, "e": e}
self.c = int(c2 * e) # hidden channels
self.cv0 = Conv(c1, self.c, 1, 1)
self.cv1 = Conv(c1, self.c, 1, 1)
self.cv2 = Conv((2 + n) * self.c, c2, 1) # optional act=FReLU(c2)
self.m = nn.ModuleList(Bottleneck(self.c, self.c, shortcut, g, k=((3, 3), (3, 3)), e=1.0) for _ in range(n))
def forward(self, x):
# slice
# y = list(self.cv1(x).chunk(2, 1))
y = [self.cv0(x), self.cv1(x)]
y.extend(m(y[-1]) for m in self.m)
return self.cv2(torch.cat(y, 1))
def forward_split(self, x):
y = list(self.cv1(x).split((self.c, self.c), 1))
y.extend(m(y[-1]) for m in self.m)
return self.cv2(torch.cat(y, 1))
class rC2fModule(nn.Module):
def __init__(self, c1=512, c2=256):
super().__init__()
self.c2f = C2f_rep(c1, c2, n=1, shortcut=False, g=1, e=0.5)
def forward(self, x):
return self.c2f(x)
class DemoC2fNet(nn.Module):
def __init__(self) -> None:
super().__init__()
self.c2f = rC2fModule(c1=512, c2=512)
self.conv1d = nn.Sequential(
nn.Conv1d(1024, 512, 1, 1, 0, bias=True),
nn.Conv1d(512, 256, 1, 1, 0, bias=True)
)
def forward(self, x):
x = self.c2f(x)
# x: [1, 512, 2, 81]
x = x.view(x.size(0), -1, 1, x.size(3)).permute(0, 3, 1, 2).contiguous()
x = x.squeeze(-1).permute(0, 2, 1)
return self.conv1d(x)
if __name__ == "__main__":
model = DemoC2fNet()
model.eval()
fake_input = torch.randn((1, 512, 2, 81))
oto = OTO(model=model, dummy_input=fake_input)
# oto.visualize_zigs(view=False)
oto.random_set_zero_groups() # Randomly set a subset of ZIGs to be zero.
oto.compress()
exit()
My envs:
torch == 1.8.1
onnx == 1.10.1
Thanks for sharing @songkq . Will take a look this week. Quite occupied the early of this week.
@songkq Thanks for the example. I took a quick look at this example. There exists some tensor alignment issues due to the discrepancy among different dependencies version. For your case, could you please try to enable the bias
as True
. We will proceed more rigorous improvements to make the tensor alignment to be more robust against varying dependencies. For more reliably using OTO, I do suggest enabling bias
for layers as True
, also for normalization layers such as BN set their affine
as True
.
self.conv = nn.Conv2d(c1, c2, k, s, autopad(k, p, d), groups=g, dilation=d, bias=True)
Then run the sanity check again
def autopad(k, p=None, d=1): # kernel, padding, dilation
# Pad to 'same' shape outputs
if d > 1:
k = d * (k - 1) + 1 if isinstance(k, int) else [d * (x - 1) + 1 for x in k] # actual kernel-size
if p is None:
p = k // 2 if isinstance(k, int) else [x // 2 for x in k] # auto-pad
return p
class Conv(nn.Module):
# Standard convolution with args(ch_in, ch_out, kernel, stride, padding, groups, dilation, activation)
default_act = nn.LeakyReLU(inplace=True, negative_slope=0.1) # default activation
def __init__(self, c1, c2, k=1, s=1, p=None, g=1, d=1, act=True):
super().__init__()
self.conv = nn.Conv2d(c1, c2, k, s, autopad(k, p, d), groups=g, dilation=d, bias=True)
self.bn = nn.BatchNorm2d(c2)
self.act = self.default_act if act is True else act if isinstance(act, nn.Module) else nn.Identity()
def forward(self, x):
return self.act(self.bn(self.conv(x)))
def forward_fuse(self, x):
return self.act(self.conv(x))
class Bottleneck(nn.Module):
# Standard bottleneck
def __init__(self, c1, c2, shortcut=True, g=1, k=(3, 3), e=0.5): # ch_in, ch_out, shortcut, groups, kernels, expand
super().__init__()
c_ = int(c2 * e) # hidden channels
self.cv1 = Conv(c1, c_, k[0], 1)
self.cv2 = Conv(c_, c2, k[1], 1, g=g)
self.add = shortcut and c1 == c2
def forward(self, x):
return x + self.cv2(self.cv1(x)) if self.add else self.cv2(self.cv1(x))
class C2f_rep(nn.Module):
# CSP Bottleneck with 2 convolutions
def __init__(self, c1, c2, n=1, shortcut=False, g=1, e=0.5): # ch_in, ch_out, number, shortcut, groups, expansion
super().__init__()
self.kwargs = {"c1": c1, "c2": c2, "n": n, "shortcut": shortcut, "g": g, "e": e}
self.c = int(c2 * e) # hidden channels
self.cv0 = Conv(c1, self.c, 1, 1)
self.cv1 = Conv(c1, self.c, 1, 1)
self.cv2 = Conv((2 + n) * self.c, c2, 1) # optional act=FReLU(c2)
self.m = nn.ModuleList(Bottleneck(self.c, self.c, shortcut, g, k=((3, 3), (3, 3)), e=1.0) for _ in range(n))
def forward(self, x):
# slice
# y = list(self.cv1(x).chunk(2, 1))
y = [self.cv0(x), self.cv1(x)]
y.extend(m(y[-1]) for m in self.m)
return self.cv2(torch.cat(y, 1))
def forward_split(self, x):
y = list(self.cv1(x).split((self.c, self.c), 1))
y.extend(m(y[-1]) for m in self.m)
return self.cv2(torch.cat(y, 1))
class rC2fModule(nn.Module):
def __init__(self, c1=512, c2=256):
super().__init__()
self.c2f = C2f_rep(c1, c2, n=1, shortcut=False, g=1, e=0.5)
def forward(self, x):
return self.c2f(x)
class DemoC2fNet(nn.Module):
def __init__(self) -> None:
super().__init__()
self.c2f = rC2fModule(c1=512, c2=512)
self.conv1d = nn.Sequential(
nn.Conv1d(1024, 512, 1, 1, 0, bias=True),
nn.Conv1d(512, 256, 1, 1, 0, bias=True)
)
def forward(self, x):
x = self.c2f(x)
# # x: [1, 512, 2, 81]
x = x.view(x.size(0), -1, 1, x.size(3)).permute(0, 3, 1, 2).contiguous()
x = x.squeeze(-1).permute(0, 2, 1)
return self.conv1d(x)
if __name__ == "__main__":
model = DemoC2fNet()
# model = rC2fModule()
model.eval()
fake_input = torch.randn((1, 512, 2, 81))
oto = OTO(model=model, dummy_input=fake_input)
oto.visualize_zigs(view=False)
oto.random_set_zero_groups() # Randomly set a subset of ZIGs to be zero.
oto.compress()
import onnxruntime as ort
full_ort_sess = ort.InferenceSession(oto.full_model_path)
compress_ort_sess = ort.InferenceSession(oto.compressed_model_path)
full_output = full_ort_sess.run(None, {'input.1': fake_input.numpy()})[0]
compress_output = compress_ort_sess.run(None, {'input.1': fake_input.numpy()})[0]
print(full_output - compress_output)
It passed on my end, where the maximum difference between full and compressed models is 1e-7.
@songkq I attached the full and compressed models during one sanity check at Baidu Pan.
链接: https://pan.baidu.com/s/15i-8p_8Ko2R6YGzeT5FGdw 提取码: np46
My experiment setting is torch 1.13, onnx=1.12.
@tianyic Thanks.
Since normalization layers such as BN
have been used in the model, the bias
of nn.Conv2d
is always set as False
. I'm wondering if I can set the bias
as True
while freezing them as zeros
during training considering the compatibility of using OTO. Then during inference, the bias
of Conv2d
can be merged with BN
as normal.
Will this way has an influence of the model accuracy when using the oto.dhspg
optimizer?
@songkq
This is a great question. For short-term, setting bias as True should not deliver worse result. Please see the below explanations.
From the view of optimization, if bias = 0 is indeed optimal, then we make it trainable, during training, it should converge to zero eventually. In words, there would be no huge difference between bias as False and True if bias = 0 being optimal. But if optimal bias does not equal to 0, make it trainable could help achieving more optimal solution.
On the other hand, Conv-BN fusion works for non-zero bias as well. Thus, it does not matter that much to fuse bias as True/False or fix them as zero during the training. We have applied OTO onto a lot low-level and high-level vision models. Majority of them could achieve competitive performance to the full models with significant FLOPs reduction, a few of them even outperforms full versions.
The root cause of this issue is due to some tensor alignment between onnx file and torch model, which we will make rigorous fix. Therefore, for long-term, please wait for our fix.
@tianyic Thanks. I will try it out with the bias = True in my case.
By the way, is there a fast way to estimate the maximum Params and FLOPs reduction (i.e., the maximum global group sparsity) for a model with a negligible accuracy drop?
@songkq
Great! For sure, please use the below commands.
full_flops = oto.compute_flops()
compressed_flops = oto.compute_flops(compressed=True) # call after compression, otherwise may raise error
full_num_params = oto.compute_num_params()
compressed_num_params = oto.compute_num_params(compressed=True) # call after compression, otherwise may raise error
print("Full FLOPs (M): {f_flops:.2f}. Compressed FLOPs (M): {c_flops:.2f}. Reduction Ratio: {f_ratio:.4f}"\
.format(f_flops=full_flops, c_flops=compressed_flops, f_ratio=1 - compressed_flops/full_flops))
print("Full # Params: {f_params}. Compressed # Params: {c_params}. Reduction Ratio: {f_ratio:.4f}"\
.format(f_params=full_num_params, c_params=compressed_num_params, f_ratio=1 - compressed_num_params/full_num_params))
@tianyic. Thanks.
Actually, I mean is there a deterministic way to quickly evaluate the maximum pruning ratio
that can be set while considering that the pruned model accuracy is almost unaffected compared with that of before pruning.
@tianyic Hi, I found that bias=False
is not the root cause for this issue. Maybe the version of torch (torch=1.8.1
) or the default opset version
cause the problem. When I try bias=False
with torch=1.11.0+cu113
and onnx=1.10.1
, everything is OK.
I still doubt the transpose
and reshape
operation under different opset version
cause the problem. If possible, the opset version
can be set as an optional configuration of OTO.
x = x.view(x.size(0), -1, 1, x.size(3)).permute(0, 3, 1, 2).contiguous()
x = x.squeeze(-1).permute(0, 2, 1)
torch = 1.11.0 with bias=False
torch = 1.8.1 with bias=False
However, when I check the _export_onnx_opset_version
used in _optimize_trace
, torch1.11.0
and torch1.8.1
have the same _export_onnx_opset_version
. I'm so confusing about this ...
def _optimize_trace(graph, operator_export_type):
from torch.onnx import utils
return utils._optimize_graph(graph, operator_export_type)
# utils._optimize_graph
from torch.onnx.symbolic_helper import _onnx_shape_inference, _export_onnx_opset_version
torch._C._jit_pass_onnx_scalar_type_analysis(graph, True, _export_onnx_opset_version)
torch._C._jit_pass_onnx_peephole(graph, _export_onnx_opset_version, fixed_batch_size)
if _onnx_shape_inference:
torch._C._jit_pass_onnx_graph_shape_type_inference(graph, params_dict, _export_onnx_opset_version)
# torch1.11.0
_default_onnx_opset_version = 9
_onnx_main_opset = 15
_onnx_stable_opsets = [7, 8, 9, 10, 11, 12, 13, 14]
_export_onnx_opset_version = _default_onnx_opset_version
_constant_folding_opset_versions = list(range(9, _onnx_main_opset + 1))
# torch1.8.1
_default_onnx_opset_version = 9
_onnx_main_opset = 13
_onnx_stable_opsets = [7, 8, 9, 10, 11, 12]
_export_onnx_opset_version = _default_onnx_opset_version
@songkq Thanks for your deep dive.
The root cause is the tensor misalignment which seems caused by varying torch and onnx versions, where I proceeded a quick fix to make the library more robust against varying versions. Since the OTO is touching a brand new autoML area, some necessary and public APIs are lacked in torch and onnx, which are made up by me based on some logic thereby may exist corner case. But I believe it will become more reliable and robust following the development of the whole community :)
Please try again after git pull
with bias=False
.
I will add opt_version
in the next release.
BTW, the current version requires the end-users give target group sparsity level. We usually start with 70%, then up and down to 90% or 50% depending on the performance that 70% group sparsity could reach. How to automatically select the target group sparsity level without sacrificing performance would leave as future work.
@tianyic Thanks for the fix. However, it doesn't work with torch=1.8.1
and onnx=1.10.1
. Maybe a bug in torch1.8.1
.
Although the bug in torch1.8.1
, I've verified the effectiveness of OTO in my case with a target_group_sparsity=0.1
, where the pruned model has a negligible accuracy drop. Good job~
I will try to enlarge the target_group_sparsity with oto=2.0.10
and torch=1.11.0
later.
Great that works for your case. I will update the readme regarding the torch dependencies.
@tianyic Hi, do you have a plan to introduce this speciality that add some functionality to round
the number of pruned channels to the expected number (32 or 16 or 8, refer to https://github.com/VainF/Torch-Pruning)? If so, it will be very useful for speeding up the inference of pruned model on edge devices such as NPU.
One more thing, as you said "we usually start with 70%, then up and down to 90% or 50% depending on the performance that 70% group sparsity could reach.", does the 70% group sparsity mean target_group_sparsity=0.7
or 0.3
?
@songkq You could set the group_divisible in dhspg optimizer to be 8, 16, 32 if you want the remaining ZIGs to be divided by 8, 16, 32.
Yes, I know and appreciate the torch pruning. You might notice that both frameworks currently have pros and cons. In short, torch pruning could generate pruned model in the format of torch, while is still multi-stage procedure. OTO generated a product-ready pruned model in onnx format from scratch in the one shot manner. OTO is more like an end-to-end automatic general DNN training and compression framework.
Yes, I referred to start with target group sparsity as 0.7.
@tianyic Thanks. I'll try group_divisible. Yeah, OTO is much more user-friendly than some other pruning tools. In my case, it seems that OTO performs more powerful than torch-pruning considering the pruned model accuracy drop. Could you please provide a benchmark about the trade-off between the target group sparsity and pruned model accuracy for various downstream tasks, such as YOLO series? This work could reach the global maximum pruning ratio to 30%~40% for the downstream models with negligible performance drop (https://github.com/HankYe/PAGCP). It seems that target_group_sparsity=0.7 is amazing for real-world tasks.
@songkq Thanks for the kind words. The accuracy preservation of OTO is due to our mathematical background especially the expertise in sparse optimization, which is the fundamental problem for pruning tasks. One-shot method will eventually become the main trend since besides user-friendness it has more advantages and possibilities in mathematics which could not be easily brought via multi-stage method.
For benchmarking down-stream tasks, our current bandwidth is limited, especially we are focusing on the development of next generation of OTO. Currently, maintaining the OTOv2 open-source library has reached our workload limit. Therefore, we may not be able to do it by ourselves perhaps by the end of this year but are open to the contributions from the community.
@tianyic Will next generation of OTO support Transformer structure compression? Looking forward to it.
@songkq Thanks.
The next generation of OTO would be on another vertical. The vanilla support of transformer could be considered as an extension within the current OTOv2, which is actually ongoing for the PR. The key is to support the matmul
operator. But we have not merged this PR yet since it hasn't rigorously considered the bias
stored in the add
operator yet.
Another reason that we do not urgently push the transformer support is because of the standard structure pruning more easily causing regression on transformer compared with CNNs. You might notice that some recent pruning works claim achieving negligible performance regression on transformer while are typically unstructured pruning so that are useless in reality. We believe low-rank analysis should be leveraged into transformer pruning, thereby postponing the transformer support or more precisely matmul
and add bias
support till when we have sufficient bandwidth to fundamentally solve that problem.
@tianyic Hi, I'm wondering if I can recover the activation of those ZIGs that have been pruned. For example, target_group_sparsity was set to 0.7 during the first training, and now I want to recover the model target_group_sparsity to 0.5 during finetuning.
@tianyic When optimizer.step()
, RuntimeError
occurred after optimizer.load_state_dict
from a checkpoint for resuming. Could you please give some advice?
"only_train_once/optimizer/dhspg.py", line 102, in get_first_momentum_grad
buf.mul_(momentum).add_(grad, alpha=(1.0-dampening))
RuntimeError: The size of tensor a (32) must match the size of tensor b (3) at non-singleton dimension 3
@songkq That is a great point. Though we currently do not have that feature yet, it can be definitely doable via modifying the optimizer.
Regarding the error of loading optimizer's state_dict, this function is rarely used for our end, since the oto could be resumed via
oto = OTO(model=latest_model, dummy_input)
We will spend time to test and fix it.
@tianyic Hi, thanks for the wonderful work. I just notice that nn.Upsample is listed in the supported operation list, but when I try to use the operation, it shows that the Unknown op: resize occur, and further meet an error in graph.py, line333 (_pruned_onnx_param = numpy_param[:, incoming_cc.non_zero_groupidxes, …]) for IndexError when calling oto.compress()
Do you have any idea about this problem or any advice for any upsample operation that can be used? Thanks!
@fordevoted Thanks for reaching out. We have tried OTOv2 on a few Unets and super-resolution models which archs have upsamplers, and OTOv2 worked pretty well. My gut feeling is that maybe something else caused the errors.
If possible, please share the model script and dummy input for me. I will take a look upon my bandwidth. The issue can be typically resolved by slightly changing the model architecture. If any confidential, please share to my email address: tiachen@microsoft.com.
@tianyic thanks for the information, I just further debugging and find the problem is similar to above. After the modification, the issue is resolved. Thanks!
Did you find out why it was causing issue. When you sliced the network it worked fine. But why it was not working with original architecture.
And if I have to make it work with this architecture, how can I fix it.
class C2f(nn.Module):
# CSP Bottleneck with 2 convolutions
def __init__(self, c1, c2, n=1, shortcut=False, g=1, e=0.5): # ch_in, ch_out, number, shortcut, groups, expansion
super().__init__()
self.c = int(c2 * e) # hidden channels
self.cv1 = Conv(c1, 2 * self.c, 1, 1)
self.cv2 = Conv((2 + n) * self.c, c2, 1) # optional act=FReLU(c2)
self.m = nn.ModuleList(Bottleneck(self.c, self.c, shortcut, g, k=((3, 3), (3, 3)), e=1.0) for _ in range(n))
def forward(self, x):
# slice
y = list(self.cv1(x).chunk(2, 1))
y.extend(m(y[-1]) for m in self.m)
return self.cv2(torch.cat(y, 1))
def forward_split(self, x):
y = list(self.cv1(x).split((self.c, self.c), 1))
y.extend(m(y[-1]) for m in self.m)
return self.cv2(torch.cat(y, 1))
@tianyic Hi, when I tried OTO with the following case, oto.compress failed. Could you please give some advice?