Closed ssk1997 closed 4 months ago
@ssk1997 please provide a code snippet to reproduce
Test case with requires_grad=False:
import torch
import torch.nn as nn
from torch.nn import TransformerEncoder, TransformerEncoderLayer
from ptflops import get_model_complexity_info
def prepare_input(resolution):
input1 = torch.randn(1, 64, 512)
return dict(src=input1)
layer = TransformerEncoderLayer(d_model=512 , nhead=4, dim_feedforward=512 , batch_first=True)
model = TransformerEncoder(layer, 1)
for param in model.parameters():
param.requires_grad = False
flop1, params = get_model_complexity_info(model, input_res=(1, 1, 1),
input_constructor=prepare_input,
as_strings=True, print_per_layer_stat=True)
print(flop1,params)
output:
TransformerEncoder(
0, 0.000% Params, 0.0 Mac, 0.000% MACs,
(layers): ModuleList(
0, 0.000% Params, 0.0 Mac, 0.000% MACs,
(0): TransformerEncoderLayer(
0, 0.000% Params, 0.0 Mac, 0.000% MACs,
(self_attn): MultiheadAttention(
0, 0.000% Params, 0.0 Mac, 0.000% MACs,
(out_proj): NonDynamicallyQuantizableLinear(0, 0.000% Params, 0.0 Mac, 0.000% MACs, in_features=512, out_features=512, bias=True)
)
(linear1): Linear(0, 0.000% Params, 0.0 Mac, 0.000% MACs, in_features=512, out_features=512, bias=True)
(dropout): Dropout(0, 0.000% Params, 0.0 Mac, 0.000% MACs, p=0.1, inplace=False)
(linear2): Linear(0, 0.000% Params, 0.0 Mac, 0.000% MACs, in_features=512, out_features=512, bias=True)
(norm1): LayerNorm(0, 0.000% Params, 0.0 Mac, 0.000% MACs, (512,), eps=1e-05, elementwise_affine=True)
(norm2): LayerNorm(0, 0.000% Params, 0.0 Mac, 0.000% MACs, (512,), eps=1e-05, elementwise_affine=True)
(dropout1): Dropout(0, 0.000% Params, 0.0 Mac, 0.000% MACs, p=0.1, inplace=False)
(dropout2): Dropout(0, 0.000% Params, 0.0 Mac, 0.000% MACs, p=0.1, inplace=False)
)
)
)
0.0 Mac 0
Test case with requires_grad=True:
import torch
import torch.nn as nn
from torch.nn import TransformerEncoder, TransformerEncoderLayer
from ptflops import get_model_complexity_info
def prepare_input(resolution):
input1 = torch.randn(1, 64, 512)
return dict(src=input1)
layer = TransformerEncoderLayer(d_model=512 , nhead=4, dim_feedforward=512 , batch_first=True)
model = TransformerEncoder(layer, 1)
flop1, params = get_model_complexity_info(model, input_res=(1, 1, 1),
input_constructor=prepare_input,
as_strings=True, print_per_layer_stat=True)
print(flop1,params)
Output:
TransformerEncoder(
1.58 M, 99.870% Params, 105.04 MMac, 100.000% MACs,
(layers): ModuleList(
1.58 M, 99.870% Params, 105.04 MMac, 100.000% MACs,
(0): TransformerEncoderLayer(
1.58 M, 99.870% Params, 105.04 MMac, 100.000% MACs,
(self_attn): MultiheadAttention(
1.05 M, 66.580% Params, 71.48 MMac, 68.054% MACs,
(out_proj): NonDynamicallyQuantizableLinear(0, 0.000% Params, 0.0 Mac, 0.000% MACs, in_features=512, out_features=512, bias=True)
)
(linear1): Linear(262.66 k, 16.645% Params, 16.78 MMac, 15.973% MACs, in_features=512, out_features=512, bias=True)
(dropout): Dropout(0, 0.000% Params, 0.0 Mac, 0.000% MACs, p=0.1, inplace=False)
(linear2): Linear(262.66 k, 16.645% Params, 16.78 MMac, 15.973% MACs, in_features=512, out_features=512, bias=True)
(norm1): LayerNorm(0, 0.000% Params, 0.0 Mac, 0.000% MACs, (512,), eps=1e-05, elementwise_affine=True)
(norm2): LayerNorm(0, 0.000% Params, 0.0 Mac, 0.000% MACs, (512,), eps=1e-05, elementwise_affine=True)
(dropout1): Dropout(0, 0.000% Params, 0.0 Mac, 0.000% MACs, p=0.1, inplace=False)
(dropout2): Dropout(0, 0.000% Params, 0.0 Mac, 0.000% MACs, p=0.1, inplace=False)
)
)
)
105.04 MMac 1.58 M
Thanks for your response. @sovrasov
My output if I launch this snipped with param.requires_grad = False
will be 100.89 MMac 0
. ptflops returns 0 params in that case because it counts only ones with a gradients (it's a natural definition of learnable parameters). Which version of ptflops do you use?
I've made the previous experiment with batch_first=False
, when I set batch_first=True
indeed the output is zero.
The reason is in the pytorch internals: it doesn't actually use MultiheadAttention.forward method (instead an optimized cuda kernel is directly called), so ptflops can't trace it. Therefore this bug with batch_first=True and param.requires_grad = False is not fixable.
For the reference you can grep torch._transformer_encoder_layer_fwd
in the pytorch's source code.
Thanks a lot. It works fine with batch_first=False
.
This bug may occur in other cases as well because of pytorch inference optimization: https://pytorch.org/tutorials/beginner/bettertransformer_tutorial.html https://pytorch.org/blog/a-better-transformer-for-fast-transformer-encoder-inference/
Encountered the same bug. Is there any hook to fix this?
@quancs as I already wrote, this is a wontfix problem
I found that the hook function will not be called when calculating MultiheadAttention module with requires_grad=False, this causes the FLOPs to be 0. No errors with requires_grad=True.