[inductor][cpu] amp/amp_fp16 performance regression in 2024-11-18 nightly release

zxd1997066 commented 5 days ago

🐛 Describe the bug

amp static shape cpp wrapper

suite	name	thread	batch_size_new	speed_up_new	inductor_new	eager_new	compilation_latency_new	batch_size_old	speed_up_old	inductor_old	eager_old	compilation_latency_old	Ratio Speedup(New/old)	Eager Ratio(old/new)	Inductor Ratio(old/new)	Compilation_latency_Ratio(old/new)
torchbench	shufflenet_v2_x1_0	multiple	64	2.537466	0.010730228999999999	0.027227591259714	17.470544	64	2.876879	0.009069634	0.026092239592286	17.572227	0.88	0.96	0.85	1.01
torchbench	Super_SloMo	multiple	6	1.032684	0.229767659	0.237277385166756	27.019539	6	1.194091	0.199764877	0.23853744174180702	27.207285	0.86	1.01	0.87	1.01
torchbench	shufflenet_v2_x1_0	single	1	3.166121	0.003873212	0.012263057850652	16.222689	1	3.601001	0.003419612	0.012314026231612	16.244421	0.88	1.0	0.88	1.0

amp dynamic shape cpp wrapper

suite	name	thread	batch_size_new	speed_up_new	inductor_new	eager_new	compilation_latency_new	batch_size_old	speed_up_old	inductor_old	eager_old	compilation_latency_old	Ratio Speedup(New/old)	Eager Ratio(old/new)	Inductor Ratio(old/new)	Compilation_latency_Ratio(old/new)
torchbench	shufflenet_v2_x1_0	single	1	3.158312	0.003915195	0.01236540735084	16.153606	1	3.620155	0.0034262109999999998	0.012403414882704999	16.179824	0.87	1.0	0.88	1.0

amp_fp16 static shape cpp wrapper

suite	name	thread	batch_size_new	speed_up_new	inductor_new	eager_new	compilation_latency_new	batch_size_old	speed_up_old	inductor_old	eager_old	compilation_latency_old	Ratio Speedup(New/old)	Eager Ratio(old/new)	Inductor Ratio(old/new)	Compilation_latency_Ratio(old/new)
torchbench	shufflenet_v2_x1_0	multiple	64	2.648366	0.02387262	0.06322343513892001	59.549544	64	2.966687	0.0206696	0.0613202336152	60.026376	0.89	0.97	0.87	1.01
torchbench	basic_gnn_gcn	single	1	1.837054	0.047721980000000004	0.08766785424692	78.320497	1	2.230323	0.04047754	0.09027798844542	79.910189	0.82	1.03	0.85	1.02
torchbench	Background_Matting	single	1	1.709493	3.19020124	5.45362668837132	47.723786	1	2.014685	2.72976672	5.4996200642832	46.431005	0.85	1.01	0.86	0.97
torchbench	doctr_det_predictor	single	1	4.086563	0.95522034	3.90356809829142	64.725399	1	4.464748	0.85025308	3.79616573842384	66.383939	0.92	0.97	0.89	1.03

amp_fp16 dynamic shape default wrapper

suite	name	thread	batch_size_new	speed_up_new	inductor_new	eager_new	compilation_latency_new	batch_size_old	speed_up_old	inductor_old	eager_old	compilation_latency_old	Ratio Speedup(New/old)	Eager Ratio(old/new)	Inductor Ratio(old/new)	Compilation_latency_Ratio(old/new)
torchbench	basic_gnn_gcn	multiple	1	1.12969	0.010413157999999999	0.011763640461019999	32.021382	1	1.146111	0.009186986	0.010529305711446	32.220003	0.99	0.9	0.88	1.01
torchbench	shufflenet_v2_x1_0	multiple	64	2.264811	0.022603352	0.051192320246471995	43.678762	64	2.48406	0.019864012	0.04934339764872	43.204043	0.91	0.96	0.88	0.99
torchbench	Background_Matting	single	1	1.680371	2.880739765	4.840711559652815	50.16649	1	1.956941	2.4907366189999998	4.874224609922479	48.728337	0.86	1.01	0.86	0.97
torchbench	basic_gnn_gcn	single	1	1.835839	0.040765593	0.074839065487527	33.268975	1	2.134891	0.034357286	0.073349060665826	32.15738	0.86	0.98	0.84	0.97
torchbench	doctr_det_predictor	single	1	4.044143	0.857683021	3.468592785596003	54.840956	1	4.6629	0.748766216	3.4914219885863997	55.272061	0.87	1.01	0.87	1.01

the bad commit: 263a5bf95e8b0160f22f29039235e7fa523a1048 ``` /workspace/pytorch# bash inductor_single_run.sh single inference performance torchbench shufflenet_v2_x1_0 amp first static cpp Testing with cpp wrapper. Testing with inductor. single-thread testing.... loading model: 0it [00:00, ?it/s] cpu eval shufflenet_v2_x1_0 running benchmark: 100%|███████████████████████████████████████████████████████████████| 50/50 [00:00<00:00, 60.73it/s] 3.125x WARNING:common:Trying to call the empty_gpu_cache for device: cpu, which is not in list [cuda, xpu] dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks,autograd_captures,autograd_compiles,cudagraph_skips cpu,shufflenet_v2_x1_0,1,3.125110,3.915668,18.889813,0.825000,46.714061,56.623104,273,1,0,0,0,0,0 ``` the last good commit: 29114e44fa7a17a3a2112d76937ae3b4cf9d33ce ``` /workspace/pytorch# bash inductor_single_run.sh single inference performance torchbench shufflenet_v2_x1_0 amp first static cpp Testing with cpp wrapper. Testing with inductor. single-thread testing.... loading model: 0it [00:00, ?it/s] cpu eval shufflenet_v2_x1_0 running benchmark: 100%|███████████████████████████████████████████████████████████████| 50/50 [00:00<00:00, 62.93it/s] 3.558x WARNING:common:Trying to call the empty_gpu_cache for device: cpu, which is not in list [cuda, xpu] dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks,autograd_captures,autograd_compiles,cudagraph_skips cpu,shufflenet_v2_x1_0,1,3.557891,3.422257,18.912550,0.821577,46.714061,56.859034,273,1,0,0,0,0,0 ``` ### Versions

SW info

name	target_branch	target_commit	refer_branch	refer_commit
torchbench	main	766a5e3a	main	766a5e3a
torch	main	2fc692b3dd42bf92c4f92dcec862bae7ae1c7995	main	5ef33e40b3c3fd2608552d3301c7255826c0e7f6
torchvision	main	0.19.0a0+d23a6e1	main	0.19.0a0+d23a6e1
torchtext	main	0.16.0a0+b0ebddc	main	0.16.0a0+b0ebddc
torchaudio	main	2.5.0a0+332760d	main	2.5.0a0+fa44bda
torchdata	main	0.7.0a0+11bb5b8	main	0.7.0a0+11bb5b8
dynamo_benchmarks	main	nightly	main	nightly

Repro: [inductor_single_run.sh](https://github.com/chuanqi129/inductor-tools/blob//main/scripts/modelbench/inductor_single_run.sh) bash inductor_single_run.sh single inference performance torchbench shufflenet_v2_x1_0 amp first static cpp Suspected guilty commit: https://github.com/pytorch/pytorch/commit/263a5bf95e8b0160f22f29039235e7fa523a1048 [torchbench-shufflenet_v2_x1_0-inference-amp-static-cpp-single-performance-drop_guilty_commit.log](https://github.com/user-attachments/files/17842867/torchbench-shufflenet_v2_x1_0-inference-amp-static-cpp-single-performance-drop_guilty_commit.log) cc @chuanqi129

chunyuan-w commented 5 days ago

cc @Valentine233

Valentine233 commented 5 days ago

For the guilty commit, some regressions are expected, as we removed a compiler flag. I have already resolved the large regressions before landing the PR. The rest speedups are all above 0.8, and the geomean remains almost the same.

pytorch / pytorch

[inductor][cpu] amp/amp_fp16 performance regression in 2024-11-18 nightly release #141222

🐛 Describe the bug