[E2E] Some models performance on 1100 lower than 0.5 of A100

chuanqi129 commented 1 month ago

🐛 Describe the bug

Category	Name	Inductor vs. Eager [XPU]	Inductor vs. Eager [CUDA]	XPU vs. CUDA [Eager]	XPU vs. CUDA [Inductor]
huggingface_amp_fp16_training	BlenderbotForCausalLM	1.445563	0	0	0
huggingface_amp_fp16_training	GoogleFnet	1.130302	1.7885	0.184332	0.116495
huggingface_amp_fp16_training	XLNetLMHeadModel	0.818967	2.0096	0.654382	0.266679
huggingface_amp_fp16_training	AllenaiLongformerBase	1.566016	2.7945	0.605366	0.339243
huggingface_amp_fp16_training	T5Small	0.916485	1.6912	0.639776	0.346704
huggingface_amp_fp16_training	T5ForConditionalGeneration	0.908585	1.6874	0.649019	0.349466
huggingface_amp_fp16_training	ElectraForQuestionAnswering	2.252931	3.0889	0.569206	0.415158
huggingface_amp_fp16_training	YituTechConvBert	1.500966	1.9019	0.550297	0.43429
huggingface_amp_fp16_training	BertForQuestionAnswering	2.023307	2.4958	0.55466	0.449654
huggingface_amp_fp16_training	BartForCausalLM	1.304644	1.3774	0.480397	0.455022
huggingface_amp_fp16_training	CamemBert	1.871217	2.4716	0.601443	0.455345
huggingface_amp_fp16_training	RobertaForCausalLM	1.918584	2.5157	0.609188	0.464594
huggingface_amp_fp16_training	BertForMaskedLM	1.951344	2.3745	0.569245	0.467801
huggingface_amp_fp16_training	RobertaForQuestionAnswering	2.019149	2.4921	0.581912	0.471477
huggingface_amp_fp16_training	PLBartForCausalLM	1.753479	2.1765	0.588808	0.474368
huggingface_amp_fp16_training	LayoutLMForSequenceClassification	2.032835	2.4209	0.567496	0.476528
huggingface_amp_fp16_training	LayoutLMForMaskedLM	1.957796	2.3533	0.573175	0.476846

Category	Name	Inductor vs. Eager [XPU]	Inductor vs. Eager [CUDA]	XPU vs. CUDA [Eager]	XPU vs. CUDA [Inductor]
huggingface_bfloat16_inference	GoogleFnet	1.131774	1.8792	0.153149	0.092236
huggingface_bfloat16_inference	BartForConditionalGeneration	1.259094	1.4757	0.438865	0.374447
huggingface_bfloat16_inference	PLBartForConditionalGeneration	1.276682	1.9	0.56172	0.377441
huggingface_bfloat16_inference	ElectraForQuestionAnswering	1.831429	2.7424	0.578978	0.386653
huggingface_bfloat16_inference	BertForQuestionAnswering	1.60178	2.1398	0.545401	0.408268
huggingface_bfloat16_inference	BartForCausalLM	1.366255	1.4544	0.441725	0.414954
huggingface_bfloat16_inference	BertForMaskedLM	1.521718	2.0576	0.562541	0.416033
huggingface_bfloat16_inference	LayoutLMForSequenceClassification	1.658037	2.1311	0.561941	0.437201
huggingface_bfloat16_inference	YituTechConvBert	1.757703	1.9947	0.503386	0.443577
huggingface_bfloat16_inference	LayoutLMForMaskedLM	1.557921	2.0338	0.582644	0.446314
huggingface_bfloat16_inference	CamemBert	1.51492	2.2051	0.658283	0.452245
huggingface_bfloat16_inference	AllenaiLongformerBase	1.681576	1.8998	0.523111	0.463023
huggingface_bfloat16_inference	MBartForConditionalGeneration	1.254131	1.6338	0.606666	0.465686
huggingface_bfloat16_inference	ElectraForCausalLM	1.914964	2.3972	0.608062	0.485741
huggingface_bfloat16_inference	RobertaForQuestionAnswering	1.642645	2.1503	0.637465	0.486969

Category	Name	Inductor vs. Eager [XPU]	Inductor vs. Eager [CUDA]	XPU vs. CUDA [Eager]	XPU vs. CUDA [Inductor]
torchbench_amp_fp16_training	demucs	2.047933	1.0475	0.08289	0.162055
torchbench_bfloat16_inference	stable_diffusion_unet	1.241336	1.3206	0.229972	0.216169
torchbench_amp_fp16_training	basic_gnn_edgecnn	0.668626	1.3301	0.524298	0.263558
torchbench_amp_fp16_training	basic_gnn_gin	0.462562	1.5465	0.914707	0.273591
torchbench_bfloat16_inference	hf_Whisper	1.626784	1.4091	0.239924	0.276988
torchbench_bfloat16_inference	hf_distil_whisper	1.746267	1.2789	0.206975	0.282613
torchbench_amp_fp16_training	basic_gnn_sage	0.431613	1.3836	0.918225	0.28644
torchbench_amp_fp16_training	stable_diffusion_unet	1.231067	1.4853	0.422831	0.350457
torchbench_bfloat16_inference	timm_vision_transformer	1.314497	1.1377	0.303696	0.35089
torchbench_amp_fp16_training	pytorch_unet	1.09446	1.5131	0.499519	0.361314
torchbench_amp_fp16_training	Background_Matting	1.086445	1.3335	0.460481	0.375168
torchbench_amp_fp16_training	hf_Longformer	1.458085	2.9905	0.804781	0.392389
torchbench_amp_fp16_training	basic_gnn_gcn	0.644957	1.0261	0.63822	0.401154
torchbench_bfloat16_inference	BERT_pytorch	1.391206	2.0482	0.595367	0.404393
torchbench_amp_fp16_training	hf_Whisper	1.390221	1.2352	0.36053	0.405777
torchbench_amp_fp16_training	Super_SloMo	1.032225	1.4135	0.561799	0.41026
torchbench_amp_fp16_training	hf_T5	1.335135	1.893	0.582934	0.411144
torchbench_bfloat16_inference	timm_vision_transformer_large	1.125584	1.0202	0.376041	0.414885
torchbench_bfloat16_inference	demucs	1.022757	1.2081	0.495616	0.41958
torchbench_bfloat16_inference	hf_Albert	1.890689	2.6302	0.59021	0.424266
torchbench_bfloat16_inference	timm_nfnet	1.545414	1.8041	0.500001	0.428307
torchbench_amp_fp16_training	timm_nfnet	1.4418	1.5648	0.465152	0.428589
torchbench_bfloat16_inference	hf_Bart	1.272364	1.7281	0.582624	0.428974
torchbench_bfloat16_inference	hf_Bert	1.403547	2.0083	0.630075	0.440343
torchbench_bfloat16_inference	pyhpc_isoneutral_mixing	2.365045	6.7835	1.304087	0.454666
torchbench_bfloat16_inference	hf_Longformer	1.749718	1.8634	0.520506	0.488751
torchbench_amp_fp16_training	hf_Albert	2.653909	3.0928	0.571739	0.490605

Category	Name	Inductor vs. Eager [XPU]	Inductor vs. Eager [CUDA]	XPU vs. CUDA [Eager]	XPU vs. CUDA [Inductor]
timm_models_amp_fp16_training	eca_halonext26ts	1.136842	1.5709	0.393641	0.284873
timm_models_amp_fp16_training	eca_botnext26ts_256	1.364209	1.5703	0.390792	0.339503
timm_models_amp_fp16_training	crossvit_9_240	1.601276	1.7663	0.432011	0.391649
timm_models_amp_fp16_training	mobilevit_s	1.294885	1.4252	0.432646	0.393087
timm_models_amp_fp16_training	pit_b_224	1.3922	1.446	0.433992	0.417844
timm_models_amp_fp16_training	swin_base_patch4_window7_224	1.286933	1.7502	0.56837	0.417926
timm_models_amp_fp16_training	dm_nfnet_f0	1.434417	1.6076	0.497232	0.443666
timm_models_amp_fp16_training	beit_base_patch16_224	1.287437	1.3895	0.479719	0.444482
timm_models_amp_fp16_training	resnest101e	1.612318	1.424	0.395661	0.447985
timm_models_amp_fp16_training	ghostnet_100	2.42258	3.9511	0.733566	0.449779
timm_models_amp_fp16_training	vit_base_patch16_224	1.26337	1.3202	0.473338	0.452962
timm_models_amp_fp16_training	res2next50	1.497527	1.4005	0.447182	0.478163
timm_models_amp_fp16_training	mnasnet_100	2.482549	2.8118	0.550493	0.486033
timm_models_amp_fp16_training	tnt_s_patch16_224	3.21354	3.4043	0.516651	0.487701
timm_models_amp_fp16_training	fbnetc_100	2.49366	2.9924	0.594391	0.495325

Category	Name	Inductor vs. Eager [XPU]	Inductor vs. Eager [CUDA]	XPU vs. CUDA [Eager]	XPU vs. CUDA [Inductor]
timm_models_bfloat16_inference	pit_b_224	1.293497	1.1016	0.281519	0.330559
timm_models_bfloat16_inference	beit_base_patch16_224	1.102877	1.1071	0.339333	0.338039
timm_models_bfloat16_inference	crossvit_9_240	1.326512	1.3763	0.361046	0.347985
timm_models_bfloat16_inference	dm_nfnet_f0	1.547564	1.9984	0.457605	0.35437
timm_models_bfloat16_inference	vit_base_patch16_224	1.229941	1.0655	0.321644	0.371284
timm_models_bfloat16_inference	resnest101e	1.57353	1.7385	0.420429	0.380533
timm_models_bfloat16_inference	pnasnet5large	1.775325	2.4233	0.581028	0.425665
timm_models_bfloat16_inference	dpn107	1.624985	2.0272	0.543763	0.435875
timm_models_bfloat16_inference	tnt_s_patch16_224	1.904305	2.4122	0.567356	0.447898
timm_models_bfloat16_inference	deit_base_distilled_patch16_224	1.373973	1.0494	0.354831	0.464578
timm_models_bfloat16_inference	poolformer_m36	1.715762	2.2624	0.615038	0.466434
timm_models_bfloat16_inference	twins_pcpvt_base	1.326378	1.4436	0.540626	0.496727
timm_models_bfloat16_inference	gluon_inception_v3	2.15053	2.123	0.492487	0.498873

Versions

PVC 1100, driver 803.61, bundle 0.5.1 pytorch https://github.com/pytorch/pytorch/commit/32e74ed torch-xpu-ops: https://github.com/intel/torch-xpu-ops/commit/1dcaf3e

Transformers	Timm	Torchbench	Torchvision	Torchaudio
243e186	730b907	febcba7	d23a6e1	b829e93

dvrogozh commented 1 month ago

what are these numbers? latency? lower is better?

chuanqi129 commented 2 weeks ago

@retonym will add a summary for it

retonym commented 2 weeks ago

Apart from the differences in hardware between PVC and A100, we are currently facing the following issues with software efficiency.

The absence of SDP kernels on XPU is a potential contributor to significant performance gaps compared to CUDA.
FFT operations not support on XPU.
XPU LSTM performance lags behind CUDA.
XPU eager softmax / inplace_add is slower than CUDA
oneDNN is slower than CUDA

intel / torch-xpu-ops

[E2E] Some models performance on 1100 lower than 0.5 of A100 #631

🐛 Describe the bug

Versions