hatsu3 / Sanger

41 stars 10 forks source link

bench_sanger script #3

Open jimmy-adams opened 7 months ago

jimmy-adams commented 7 months ago

Hello, In this repo you provide a simulation script to calculate the sanger's processing of BERT. //============================================================== def bert_base_gflops(seq_len): HIDDEN_SIZE = 768 linear_flops = seq_len HIDDEN_SIZE HIDDEN_SIZE 2 3 qk_flops = seq_len seq_len HIDDEN_SIZE 2 pv_flops = seq_len seq_len HIDDEN_SIZE 2 out_proj_flops = seq_len HIDDEN_SIZE HIDDEN_SIZE 2 stage1_flops = linear_flops stage2_flops = qk_flops + pv_flops + out_proj_flops stage1_gflops = stage1_flops / 1e9 stage2_gflops = stage2_flops / 1e9 print("The stage1_gflops: %.3f FLOPS", stage1_gflops 1e9) print("The stage2_gflops: %.3f FLOPS", stage2_gflops * 1e9) return stage1_gflops, stage2_gflops //=============================================================== I want to ask if the flops covers the total 12 hidden layers? or just a single layer of the BERT encoder.

Best Regards

jimmy-adams commented 7 months ago

Also another question related is: when i set the input sequence with 128, the calculated GFLOPS is about 0.65. Assume there are 12 hidden layers in BERT base, the total GFLOPs is less than 12 GFLOPs, not compatible with the profiler test result about 20s GFLOPs.

hatsu3 commented 7 months ago

The FLOPs number corresponds to a single layer of BERT. Can you provide more information about the profiler?

jimmy-adams commented 7 months ago

The FLOPs number corresponds to a single layer of BERT. Can you provide more information about the profiler? Hello, https://github.com/cli99/flops-profiler https://github.com/autoliuweijie/FastBERT/issues/11

Here the two posts mentioned their result, more or less different, but still at the value of 20~GLOPS.

hatsu3 commented 7 months ago

Our provided simulation script only calculates the FLOPs of a single multi-head attention (MHA) module. However, an encoder layer of BERT also includes a fully-connected feed-forward network (FFN) following the MHA. The thop profiler used by FastBERT calculates the total FLOPs of all modules in a BERT model, which includes MHA, FFN, and potentially other modules not included in our calculation. Therefore, it should produce a larger FLOPs count than ours.

jimmy-adams commented 7 months ago

Hello, Listed is one hidden layer of BERT: (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) (intermediate_act_fn): GELUActivation() ) (output): BertOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) Do you mean the FLOPs in Sanger only contain the attention submodule in the listed BertEncoder? How can I calculate the other two parts based on your calculation method?

Best Regards

hatsu3 commented 7 months ago

(1) Yes, and BertIntermediate and BertOutput above correspond to the FFN submodule. Besides, we do not include the GLOPs of LayerNorm and Softmax operations in our calculation. (2) Calculating FLOPs of fully connected layers (or Linear in PyTorch terms) in the FFN submodule should be almost identical to our script’s method.

Please refer to the implementation of thop (https://github.com/Lyken17/pytorch-OpCounter/blob/master/thop/profile.py and https://github.com/Lyken17/pytorch-OpCounter/blob/master/thop/vision/calc_func.py) for the formulas used.

jimmy-adams commented 7 months ago

(1) Yes, and BertIntermediate and BertOutput above correspond to the FFN submodule. Besides, we do not include the GLOPs of LayerNorm and Softmax operations in our calculation. (2) Calculating FLOPs fully fully connected layers (or Linear in PyTorch terms) in the FFN submodule should be almost identical to our script’s method.

Please refer to the implementation of thop (https://github.com/Lyken17/pytorch-OpCounter/blob/master/thop/profile.py and https://github.com/Lyken17/pytorch-OpCounter/blob/master/thop/vision/calc_func.py) for the formulas used.

Hello,

Does that mean in these two modules there are no matmul ops and only contain Linear Ops?

hatsu3 commented 7 months ago

From what I understand, fully connected layers, or Linear modules, are essentially affine transformations (i.e., a matmul and an element-wise addition of a bias vector with broadcast). Besides, BertIntermediate and BertOutput contain not just Linear modules, but also LayerNorm operations and element-wise activation functions. Depending on how thop library calculates FLOPs, you may also need to include the FLOPs of these operations in the final result if you want to replicate the estimation of thop.

jimmy-adams commented 7 months ago

From what I understand, fully connected layers, or Linear modules, are essentially affine transformations (i.e., a matmul and an element-wise addition of a bias vector with broadcast). Besides, BertIntermediate and BertOutput contain not just Linear modules, but also LayerNorm operations and element-wise activation functions. Depending on how thop library calculates FLOPs, you may also need to include the FLOPs of these operations in the final result if you want to replicate the estimation of thop.

Dear author, Thanks a lot for your kind reply. One further question is will Sanger can process efficiently for layernorm or element-wise activation functions?

Best Regards

hatsu3 commented 7 months ago

Our accelerator design is primarily focused on the core attention mechanism, which does not contain LayerNorm or activation functions. Therefore, these operations are not taken into account in our work.