Closed marsggbo closed 3 years ago
Hi @marsggbo. Thanks for raising this issue.
For parameter size, because NNI builds a supernet for training, you may experience issues if you directly use model.parameters()
which will count every parameter even if some of them is not used. This issue will be addressed later in NNI framework. For now, a simple workaround is to dry run and do back-propagation and count the parameters used. e.g.,
def param_size(model, loss_fn, input_size):
"""
Compute parameter size in MB
"""
x = torch.rand([2] + input_size).cuda()
y = model(x)
target = torch.randint(model.n_classes, size=[2]).cuda()
loss = loss_fn(y, target)
loss.backward()
n_params = sum(np.prod(v.size()) for v in model.parameters() if v.grad is not None)
return n_params / 1e6
As for FLOPS, calculating FLOPS doesn't have native support in PyTorch. You need to consort to tools like torchscope
.
@ultmaster Thanks for your reply. But there is still an issue that we can only get the real size of the real model after loss.backward()
. But in Enas, I want to add the model size as a loss item which is conflicted with your example. Is there any other solutions? Thanks!
Looks to me that you want to trace the number of parameters used for every sub architecture sampled during training phase. In fact, the model size is not differentiable. So how do you want to add it as a loss term? If all you want is to display the loss as a number without differentiation, you can add it after loss.backward()
.
Looks to me that you want to trace the number of parameters used for every sub architecture sampled during training phase. In fact, the model size is not differentiable. So how do you want to add it as a loss term? If all you want is to display the loss as a number without differentiation, you can add it after
loss.backward()
.
The reason why I add the model size as an loss item is that I want the mutator to find a smaller model which also has a comparable performance. Since the model size is not differentiable, is there any other substitutable method? Thanks!
For sure you can add the flops or param size as part of the reward function, so that controller is aware of the model size, that's exactly what MnasNet, FBNet, and SPOS did. What I want to add here is, other than loss.backward()
, there is another method called lookup table method, which we provide an example, SPOS, that asks the model to give flops from a lookup table, when all choices are provided. I'm hoping it can be helpful. Here is an example: https://github.com/microsoft/nni/blob/598d8de2a9ac9579dea962d22d428b14557da38e/examples/nas/spos/network.py#L106
Thanks for your reply! But I have a question is that the flops obtained by using the lookup-table method is also not differentiable. Or the loss of enas controller is not necessary to be differentiable?
Reward for RL agent is not necessarily differentiable. There are certain tricks that can make flops differentiable, like FBNet. Please read the papers for more details.
I implement a function to calculate the real model size by only using model and mutator, i.e. we don't have to call loss.backward()
. The function code is as follows:
import nni
def calc_real_model_size(model, mutator):
'''calculate the size of real model
real_size = size_choice + size_non_choice
'''
if isinstance(model, torch.nn.DataParallel):
model = model.module
else:
model = model
size_choice = 0 # the size of LayerChoice
size_non_choice = 0 # the size of normal model part
# the size of normal model part
for name, param in model.named_parameters():
if 'choice' not in name:
size_non_choice += param.numel()
# the mask for each operation, the masks looks like as follows:
# {'down_node_0_x_op': tensor([False, True, False, False, False, False]),
# 'down_node_0_y_op': tensor([ True, False, False, False, False, False]),
# 'down_node_1_x_op': tensor([False, True, False, False, False, False]),
# 'down_node_1_y_op': tensor([False, False, False, False, False, True]),
# ...
# 'up_node_0_x_op': tensor([False, False, False, False, True, False]),
# 'up_node_0_y_op': tensor([False, False, True, False, False, False]),
# 'up_node_1_x_op': tensor([False, True, False, False, False, False]),
# 'up_node_1_y_op': tensor([False, False, False, True, False, False]),
# ...
masks = {}
for key in mutator._cache:
if 'op' in key:
masks[key] = mutator._cache[key]
# the real size of all LayerChoice
for name, module in model.named_modules():
if isinstance(module, nni.nas.pytorch.mutables.LayerChoice):
size_ops = []
for index, op in enumerate(module.choices):
size_ops.append(sum([p.numel() for p in op.parameters()]))
# parse the key for masks, which is needed to modified for different model.
infos = name.split('.')
layer_type = infos[0] # down or up
node_id = infos[4] # 0-4
cell_id = infos[5][-1] # x or y
prefix = 'down' if 'down' in layer_type else 'up'
key = f"{prefix}_node_{node_id}_{cell_id}_op"
index = masks[key].int().argmax()
size_choice += size_ops[index]
real_size = size_choice + size_non_choice
return real_size
I think that's true, that's basically what we do in SPOS. We might want to expose cache as public API for convenience later.
Also, these two lines can be more elegant:
for name, module in model.named_modules():
if isinstance(module, nni.nas.pytorch.mutables.LayerChoice):
->
for mutable in mutator.mutables:
if isinstance(mutable, LayerChoice):
while a straightforward way to implement it is to define a "sample" function in your model, like:
...
def forward(self, x):
return self.classifier(self.tail(self.mid(self.stem(x))).flatten(1))
def sample(self, mutator, no_els=False):
mid = nn.Sequential([
for i in range(len(mid))
])
mid = []
for i in range(len(mid)):
m = mid[i][mutator._cache[mid[i].key]]
if no_els and isinstance(m, ConvELS):
continue
mid.append(m)
mid = nn.Sequential(*mid)
class S(nn.Module):
def __init__(selff):
super(S, selff).__init__()
selff.stem = self.stem
selff.tail = self.tail
selff.mid = mid
selff.classifier = self.classifier
def forward(selff, x):
return selff.classifier(selff.tail(selff.mid(selff.stem(x))).flatten(1))
return S()
The good thing is that you can choose the specific sample method. In my case, I need to change "ConvELS" module to "Identity" in some training phrase
@marsggbo @CuriousCat-7 I‘m closing this issue as it has no updates from user for 3 months, please feel free to reopen if you are still seeing it an active issue.
I want to use NAS method to search a resource-aware model, but for each step, the mutator generates a new model structure so that it is not easy to calculate the generated model size, especially flops. Is any possible and easy way to implement this? Thanks!