intel / torch-xpu-ops

Apache License 2.0
22 stars 14 forks source link

[E2E] Some models performance on 1100 lower than 0.5 of A100 #631

Open chuanqi129 opened 1 month ago

chuanqi129 commented 1 month ago

🐛 Describe the bug

Category Name Inductor vs. Eager [XPU] Inductor vs. Eager [CUDA] XPU vs. CUDA [Eager] XPU vs. CUDA [Inductor]
huggingface_amp_fp16_training BlenderbotForCausalLM 1.445563 0 0 0
huggingface_amp_fp16_training GoogleFnet 1.130302 1.7885 0.184332 0.116495
huggingface_amp_fp16_training XLNetLMHeadModel 0.818967 2.0096 0.654382 0.266679
huggingface_amp_fp16_training AllenaiLongformerBase 1.566016 2.7945 0.605366 0.339243
huggingface_amp_fp16_training T5Small 0.916485 1.6912 0.639776 0.346704
huggingface_amp_fp16_training T5ForConditionalGeneration 0.908585 1.6874 0.649019 0.349466
huggingface_amp_fp16_training ElectraForQuestionAnswering 2.252931 3.0889 0.569206 0.415158
huggingface_amp_fp16_training YituTechConvBert 1.500966 1.9019 0.550297 0.43429
huggingface_amp_fp16_training BertForQuestionAnswering 2.023307 2.4958 0.55466 0.449654
huggingface_amp_fp16_training BartForCausalLM 1.304644 1.3774 0.480397 0.455022
huggingface_amp_fp16_training CamemBert 1.871217 2.4716 0.601443 0.455345
huggingface_amp_fp16_training RobertaForCausalLM 1.918584 2.5157 0.609188 0.464594
huggingface_amp_fp16_training BertForMaskedLM 1.951344 2.3745 0.569245 0.467801
huggingface_amp_fp16_training RobertaForQuestionAnswering 2.019149 2.4921 0.581912 0.471477
huggingface_amp_fp16_training PLBartForCausalLM 1.753479 2.1765 0.588808 0.474368
huggingface_amp_fp16_training LayoutLMForSequenceClassification 2.032835 2.4209 0.567496 0.476528
huggingface_amp_fp16_training LayoutLMForMaskedLM 1.957796 2.3533 0.573175 0.476846
           
Category Name Inductor vs. Eager [XPU] Inductor vs. Eager [CUDA] XPU vs. CUDA [Eager] XPU vs. CUDA [Inductor]
huggingface_bfloat16_inference GoogleFnet 1.131774 1.8792 0.153149 0.092236
huggingface_bfloat16_inference BartForConditionalGeneration 1.259094 1.4757 0.438865 0.374447
huggingface_bfloat16_inference PLBartForConditionalGeneration 1.276682 1.9 0.56172 0.377441
huggingface_bfloat16_inference ElectraForQuestionAnswering 1.831429 2.7424 0.578978 0.386653
huggingface_bfloat16_inference BertForQuestionAnswering 1.60178 2.1398 0.545401 0.408268
huggingface_bfloat16_inference BartForCausalLM 1.366255 1.4544 0.441725 0.414954
huggingface_bfloat16_inference BertForMaskedLM 1.521718 2.0576 0.562541 0.416033
huggingface_bfloat16_inference LayoutLMForSequenceClassification 1.658037 2.1311 0.561941 0.437201
huggingface_bfloat16_inference YituTechConvBert 1.757703 1.9947 0.503386 0.443577
huggingface_bfloat16_inference LayoutLMForMaskedLM 1.557921 2.0338 0.582644 0.446314
huggingface_bfloat16_inference CamemBert 1.51492 2.2051 0.658283 0.452245
huggingface_bfloat16_inference AllenaiLongformerBase 1.681576 1.8998 0.523111 0.463023
huggingface_bfloat16_inference MBartForConditionalGeneration 1.254131 1.6338 0.606666 0.465686
huggingface_bfloat16_inference ElectraForCausalLM 1.914964 2.3972 0.608062 0.485741
huggingface_bfloat16_inference RobertaForQuestionAnswering 1.642645 2.1503 0.637465 0.486969
           
Category Name Inductor vs. Eager [XPU] Inductor vs. Eager [CUDA] XPU vs. CUDA [Eager] XPU vs. CUDA [Inductor]
torchbench_amp_fp16_training demucs 2.047933 1.0475 0.08289 0.162055
torchbench_bfloat16_inference stable_diffusion_unet 1.241336 1.3206 0.229972 0.216169
torchbench_amp_fp16_training basic_gnn_edgecnn 0.668626 1.3301 0.524298 0.263558
torchbench_amp_fp16_training basic_gnn_gin 0.462562 1.5465 0.914707 0.273591
torchbench_bfloat16_inference hf_Whisper 1.626784 1.4091 0.239924 0.276988
torchbench_bfloat16_inference hf_distil_whisper 1.746267 1.2789 0.206975 0.282613
torchbench_amp_fp16_training basic_gnn_sage 0.431613 1.3836 0.918225 0.28644
torchbench_amp_fp16_training stable_diffusion_unet 1.231067 1.4853 0.422831 0.350457
torchbench_bfloat16_inference timm_vision_transformer 1.314497 1.1377 0.303696 0.35089
torchbench_amp_fp16_training pytorch_unet 1.09446 1.5131 0.499519 0.361314
torchbench_amp_fp16_training Background_Matting 1.086445 1.3335 0.460481 0.375168
torchbench_amp_fp16_training hf_Longformer 1.458085 2.9905 0.804781 0.392389
torchbench_amp_fp16_training basic_gnn_gcn 0.644957 1.0261 0.63822 0.401154
torchbench_bfloat16_inference BERT_pytorch 1.391206 2.0482 0.595367 0.404393
torchbench_amp_fp16_training hf_Whisper 1.390221 1.2352 0.36053 0.405777
torchbench_amp_fp16_training Super_SloMo 1.032225 1.4135 0.561799 0.41026
torchbench_amp_fp16_training hf_T5 1.335135 1.893 0.582934 0.411144
torchbench_bfloat16_inference timm_vision_transformer_large 1.125584 1.0202 0.376041 0.414885
torchbench_bfloat16_inference demucs 1.022757 1.2081 0.495616 0.41958
torchbench_bfloat16_inference hf_Albert 1.890689 2.6302 0.59021 0.424266
torchbench_bfloat16_inference timm_nfnet 1.545414 1.8041 0.500001 0.428307
torchbench_amp_fp16_training timm_nfnet 1.4418 1.5648 0.465152 0.428589
torchbench_bfloat16_inference hf_Bart 1.272364 1.7281 0.582624 0.428974
torchbench_bfloat16_inference hf_Bert 1.403547 2.0083 0.630075 0.440343
torchbench_bfloat16_inference pyhpc_isoneutral_mixing 2.365045 6.7835 1.304087 0.454666
torchbench_bfloat16_inference hf_Longformer 1.749718 1.8634 0.520506 0.488751
torchbench_amp_fp16_training hf_Albert 2.653909 3.0928 0.571739 0.490605
           
Category Name Inductor vs. Eager [XPU] Inductor vs. Eager [CUDA] XPU vs. CUDA [Eager] XPU vs. CUDA [Inductor]
timm_models_amp_fp16_training eca_halonext26ts 1.136842 1.5709 0.393641 0.284873
timm_models_amp_fp16_training eca_botnext26ts_256 1.364209 1.5703 0.390792 0.339503
timm_models_amp_fp16_training crossvit_9_240 1.601276 1.7663 0.432011 0.391649
timm_models_amp_fp16_training mobilevit_s 1.294885 1.4252 0.432646 0.393087
timm_models_amp_fp16_training pit_b_224 1.3922 1.446 0.433992 0.417844
timm_models_amp_fp16_training swin_base_patch4_window7_224 1.286933 1.7502 0.56837 0.417926
timm_models_amp_fp16_training dm_nfnet_f0 1.434417 1.6076 0.497232 0.443666
timm_models_amp_fp16_training beit_base_patch16_224 1.287437 1.3895 0.479719 0.444482
timm_models_amp_fp16_training resnest101e 1.612318 1.424 0.395661 0.447985
timm_models_amp_fp16_training ghostnet_100 2.42258 3.9511 0.733566 0.449779
timm_models_amp_fp16_training vit_base_patch16_224 1.26337 1.3202 0.473338 0.452962
timm_models_amp_fp16_training res2next50 1.497527 1.4005 0.447182 0.478163
timm_models_amp_fp16_training mnasnet_100 2.482549 2.8118 0.550493 0.486033
timm_models_amp_fp16_training tnt_s_patch16_224 3.21354 3.4043 0.516651 0.487701
timm_models_amp_fp16_training fbnetc_100 2.49366 2.9924 0.594391 0.495325
           
Category Name Inductor vs. Eager [XPU] Inductor vs. Eager [CUDA] XPU vs. CUDA [Eager] XPU vs. CUDA [Inductor]
timm_models_bfloat16_inference pit_b_224 1.293497 1.1016 0.281519 0.330559
timm_models_bfloat16_inference beit_base_patch16_224 1.102877 1.1071 0.339333 0.338039
timm_models_bfloat16_inference crossvit_9_240 1.326512 1.3763 0.361046 0.347985
timm_models_bfloat16_inference dm_nfnet_f0 1.547564 1.9984 0.457605 0.35437
timm_models_bfloat16_inference vit_base_patch16_224 1.229941 1.0655 0.321644 0.371284
timm_models_bfloat16_inference resnest101e 1.57353 1.7385 0.420429 0.380533
timm_models_bfloat16_inference pnasnet5large 1.775325 2.4233 0.581028 0.425665
timm_models_bfloat16_inference dpn107 1.624985 2.0272 0.543763 0.435875
timm_models_bfloat16_inference tnt_s_patch16_224 1.904305 2.4122 0.567356 0.447898
timm_models_bfloat16_inference deit_base_distilled_patch16_224 1.373973 1.0494 0.354831 0.464578
timm_models_bfloat16_inference poolformer_m36 1.715762 2.2624 0.615038 0.466434
timm_models_bfloat16_inference twins_pcpvt_base 1.326378 1.4436 0.540626 0.496727
timm_models_bfloat16_inference gluon_inception_v3 2.15053 2.123 0.492487 0.498873

Versions

PVC 1100, driver 803.61, bundle 0.5.1 pytorch https://github.com/pytorch/pytorch/commit/32e74ed torch-xpu-ops: https://github.com/intel/torch-xpu-ops/commit/1dcaf3e

Transformers Timm Torchbench Torchvision Torchaudio
243e186 730b907 febcba7 d23a6e1 b829e93
dvrogozh commented 1 month ago

what are these numbers? latency? lower is better?

chuanqi129 commented 2 weeks ago

@retonym will add a summary for it

retonym commented 2 weeks ago

Apart from the differences in hardware between PVC and A100, we are currently facing the following issues with software efficiency.