There was a bug with computing MultiheadAttention flops

ssk1997 commented 2 years ago

I found that the hook function will not be called when calculating MultiheadAttention module with requires_grad=False, this causes the FLOPs to be 0. No errors with requires_grad=True.

sovrasov commented 2 years ago

@ssk1997 please provide a code snippet to reproduce

ssk1997 commented 2 years ago

Test case with requires_grad=False:

import torch
import torch.nn as nn
from torch.nn import TransformerEncoder, TransformerEncoderLayer
from ptflops import get_model_complexity_info

def prepare_input(resolution):
    input1 = torch.randn(1, 64, 512)
    return dict(src=input1)

layer = TransformerEncoderLayer(d_model=512 , nhead=4, dim_feedforward=512 , batch_first=True)
model = TransformerEncoder(layer, 1)
for param in model.parameters():
    param.requires_grad = False  

flop1, params = get_model_complexity_info(model, input_res=(1, 1, 1), 
                                              input_constructor=prepare_input,
                                              as_strings=True, print_per_layer_stat=True)
print(flop1,params)

output:

TransformerEncoder(
  0, 0.000% Params, 0.0 Mac, 0.000% MACs, 
  (layers): ModuleList(
    0, 0.000% Params, 0.0 Mac, 0.000% MACs, 
    (0): TransformerEncoderLayer(
      0, 0.000% Params, 0.0 Mac, 0.000% MACs, 
      (self_attn): MultiheadAttention(
        0, 0.000% Params, 0.0 Mac, 0.000% MACs, 
        (out_proj): NonDynamicallyQuantizableLinear(0, 0.000% Params, 0.0 Mac, 0.000% MACs, in_features=512, out_features=512, bias=True)
      )
      (linear1): Linear(0, 0.000% Params, 0.0 Mac, 0.000% MACs, in_features=512, out_features=512, bias=True)
      (dropout): Dropout(0, 0.000% Params, 0.0 Mac, 0.000% MACs, p=0.1, inplace=False)
      (linear2): Linear(0, 0.000% Params, 0.0 Mac, 0.000% MACs, in_features=512, out_features=512, bias=True)
      (norm1): LayerNorm(0, 0.000% Params, 0.0 Mac, 0.000% MACs, (512,), eps=1e-05, elementwise_affine=True)
      (norm2): LayerNorm(0, 0.000% Params, 0.0 Mac, 0.000% MACs, (512,), eps=1e-05, elementwise_affine=True)
      (dropout1): Dropout(0, 0.000% Params, 0.0 Mac, 0.000% MACs, p=0.1, inplace=False)
      (dropout2): Dropout(0, 0.000% Params, 0.0 Mac, 0.000% MACs, p=0.1, inplace=False)
    )
  )
)
0.0 Mac 0

Test case with requires_grad=True:

import torch
import torch.nn as nn
from torch.nn import TransformerEncoder, TransformerEncoderLayer
from ptflops import get_model_complexity_info

def prepare_input(resolution):
    input1 = torch.randn(1, 64, 512)
    return dict(src=input1)

layer = TransformerEncoderLayer(d_model=512 , nhead=4, dim_feedforward=512 , batch_first=True)
model = TransformerEncoder(layer, 1)

flop1, params = get_model_complexity_info(model, input_res=(1, 1, 1), 
                                              input_constructor=prepare_input,
                                              as_strings=True, print_per_layer_stat=True)
print(flop1,params)

Output:

TransformerEncoder(
  1.58 M, 99.870% Params, 105.04 MMac, 100.000% MACs, 
  (layers): ModuleList(
    1.58 M, 99.870% Params, 105.04 MMac, 100.000% MACs, 
    (0): TransformerEncoderLayer(
      1.58 M, 99.870% Params, 105.04 MMac, 100.000% MACs, 
      (self_attn): MultiheadAttention(
        1.05 M, 66.580% Params, 71.48 MMac, 68.054% MACs, 
        (out_proj): NonDynamicallyQuantizableLinear(0, 0.000% Params, 0.0 Mac, 0.000% MACs, in_features=512, out_features=512, bias=True)
      )
      (linear1): Linear(262.66 k, 16.645% Params, 16.78 MMac, 15.973% MACs, in_features=512, out_features=512, bias=True)
      (dropout): Dropout(0, 0.000% Params, 0.0 Mac, 0.000% MACs, p=0.1, inplace=False)
      (linear2): Linear(262.66 k, 16.645% Params, 16.78 MMac, 15.973% MACs, in_features=512, out_features=512, bias=True)
      (norm1): LayerNorm(0, 0.000% Params, 0.0 Mac, 0.000% MACs, (512,), eps=1e-05, elementwise_affine=True)
      (norm2): LayerNorm(0, 0.000% Params, 0.0 Mac, 0.000% MACs, (512,), eps=1e-05, elementwise_affine=True)
      (dropout1): Dropout(0, 0.000% Params, 0.0 Mac, 0.000% MACs, p=0.1, inplace=False)
      (dropout2): Dropout(0, 0.000% Params, 0.0 Mac, 0.000% MACs, p=0.1, inplace=False)
    )
  )
)
105.04 MMac 1.58 M

Thanks for your response. @sovrasov

sovrasov commented 2 years ago

My output if I launch this snipped with param.requires_grad = False will be 100.89 MMac 0. ptflops returns 0 params in that case because it counts only ones with a gradients (it's a natural definition of learnable parameters). Which version of ptflops do you use?

sovrasov commented 2 years ago

I've made the previous experiment with batch_first=False, when I set batch_first=True indeed the output is zero. The reason is in the pytorch internals: it doesn't actually use MultiheadAttention.forward method (instead an optimized cuda kernel is directly called), so ptflops can't trace it. Therefore this bug with batch_first=True and param.requires_grad = False is not fixable.

sovrasov commented 2 years ago

For the reference you can grep torch._transformer_encoder_layer_fwd in the pytorch's source code.

ssk1997 commented 2 years ago

Thanks a lot. It works fine with batch_first=False.

sovrasov commented 2 years ago

This bug may occur in other cases as well because of pytorch inference optimization: https://pytorch.org/tutorials/beginner/bettertransformer_tutorial.html https://pytorch.org/blog/a-better-transformer-for-fast-transformer-encoder-inference/

quancs commented 1 year ago

Encountered the same bug. Is there any hook to fix this?

sovrasov commented 1 year ago

@quancs as I already wrote, this is a wontfix problem

sovrasov / flops-counter.pytorch

There was a bug with computing MultiheadAttention flops #101