Open jimmy-adams opened 7 months ago
Also another question related is: when i set the input sequence with 128, the calculated GFLOPS is about 0.65. Assume there are 12 hidden layers in BERT base, the total GFLOPs is less than 12 GFLOPs, not compatible with the profiler test result about 20s GFLOPs.
The FLOPs number corresponds to a single layer of BERT. Can you provide more information about the profiler?
The FLOPs number corresponds to a single layer of BERT. Can you provide more information about the profiler? Hello, https://github.com/cli99/flops-profiler https://github.com/autoliuweijie/FastBERT/issues/11
Here the two posts mentioned their result, more or less different, but still at the value of 20~GLOPS.
Our provided simulation script only calculates the FLOPs of a single multi-head attention (MHA) module. However, an encoder layer of BERT also includes a fully-connected feed-forward network (FFN) following the MHA. The thop profiler used by FastBERT calculates the total FLOPs of all modules in a BERT model, which includes MHA, FFN, and potentially other modules not included in our calculation. Therefore, it should produce a larger FLOPs count than ours.
Hello, Listed is one hidden layer of BERT: (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) (intermediate_act_fn): GELUActivation() ) (output): BertOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) Do you mean the FLOPs in Sanger only contain the attention submodule in the listed BertEncoder? How can I calculate the other two parts based on your calculation method?
Best Regards
(1) Yes, and BertIntermediate and BertOutput above correspond to the FFN submodule. Besides, we do not include the GLOPs of LayerNorm and Softmax operations in our calculation.
(2) Calculating FLOPs of fully connected layers (or Linear
in PyTorch terms) in the FFN submodule should be almost identical to our script’s method.
Please refer to the implementation of thop (https://github.com/Lyken17/pytorch-OpCounter/blob/master/thop/profile.py and https://github.com/Lyken17/pytorch-OpCounter/blob/master/thop/vision/calc_func.py) for the formulas used.
(1) Yes, and BertIntermediate and BertOutput above correspond to the FFN submodule. Besides, we do not include the GLOPs of LayerNorm and Softmax operations in our calculation. (2) Calculating FLOPs fully fully connected layers (or
Linear
in PyTorch terms) in the FFN submodule should be almost identical to our script’s method.Please refer to the implementation of thop (https://github.com/Lyken17/pytorch-OpCounter/blob/master/thop/profile.py and https://github.com/Lyken17/pytorch-OpCounter/blob/master/thop/vision/calc_func.py) for the formulas used.
Hello,
Does that mean in these two modules there are no matmul ops and only contain Linear Ops?
From what I understand, fully connected layers, or Linear
modules, are essentially affine transformations (i.e., a matmul and an element-wise addition of a bias vector with broadcast). Besides, BertIntermediate
and BertOutput
contain not just Linear
modules, but also LayerNorm
operations and element-wise activation functions. Depending on how thop
library calculates FLOPs, you may also need to include the FLOPs of these operations in the final result if you want to replicate the estimation of thop
.
From what I understand, fully connected layers, or
Linear
modules, are essentially affine transformations (i.e., a matmul and an element-wise addition of a bias vector with broadcast). Besides,BertIntermediate
andBertOutput
contain not justLinear
modules, but alsoLayerNorm
operations and element-wise activation functions. Depending on howthop
library calculates FLOPs, you may also need to include the FLOPs of these operations in the final result if you want to replicate the estimation ofthop
.
Dear author, Thanks a lot for your kind reply. One further question is will Sanger can process efficiently for layernorm or element-wise activation functions?
Best Regards
Our accelerator design is primarily focused on the core attention mechanism, which does not contain LayerNorm or activation functions. Therefore, these operations are not taken into account in our work.
Hello, In this repo you provide a simulation script to calculate the sanger's processing of BERT. //============================================================== def bert_base_gflops(seq_len): HIDDEN_SIZE = 768 linear_flops = seq_len HIDDEN_SIZE HIDDEN_SIZE 2 3 qk_flops = seq_len seq_len HIDDEN_SIZE 2 pv_flops = seq_len seq_len HIDDEN_SIZE 2 out_proj_flops = seq_len HIDDEN_SIZE HIDDEN_SIZE 2 stage1_flops = linear_flops stage2_flops = qk_flops + pv_flops + out_proj_flops stage1_gflops = stage1_flops / 1e9 stage2_gflops = stage2_flops / 1e9 print("The stage1_gflops: %.3f FLOPS", stage1_gflops 1e9) print("The stage2_gflops: %.3f FLOPS", stage2_gflops * 1e9) return stage1_gflops, stage2_gflops //=============================================================== I want to ask if the flops covers the total 12 hidden layers? or just a single layer of the BERT encoder.
Best Regards