pytorch / pytorch

Tensors and Dynamic neural networks in Python with strong GPU acceleration
https://pytorch.org
Other
84.5k stars 22.76k forks source link

[inductor][cpu] amp/amp_fp16 performance regression in 2024-11-18 nightly release #141222

Open zxd1997066 opened 5 days ago

zxd1997066 commented 5 days ago

🐛 Describe the bug

amp static shape cpp wrapper

suite name thread batch_size_new speed_up_new inductor_new eager_new compilation_latency_new batch_size_old speed_up_old inductor_old eager_old compilation_latency_old Ratio Speedup(New/old) Eager Ratio(old/new) Inductor Ratio(old/new) Compilation_latency_Ratio(old/new)
torchbench shufflenet_v2_x1_0 multiple 64 2.537466 0.010730228999999999 0.027227591259714 17.470544 64 2.876879 0.009069634 0.026092239592286 17.572227 0.88 0.96 0.85 1.01
torchbench Super_SloMo multiple 6 1.032684 0.229767659 0.237277385166756 27.019539 6 1.194091 0.199764877 0.23853744174180702 27.207285 0.86 1.01 0.87 1.01
torchbench shufflenet_v2_x1_0 single 1 3.166121 0.003873212 0.012263057850652 16.222689 1 3.601001 0.003419612 0.012314026231612 16.244421 0.88 1.0 0.88 1.0

amp dynamic shape cpp wrapper

suite name thread batch_size_new speed_up_new inductor_new eager_new compilation_latency_new batch_size_old speed_up_old inductor_old eager_old compilation_latency_old Ratio Speedup(New/old) Eager Ratio(old/new) Inductor Ratio(old/new) Compilation_latency_Ratio(old/new)
torchbench shufflenet_v2_x1_0 single 1 3.158312 0.003915195 0.01236540735084 16.153606 1 3.620155 0.0034262109999999998 0.012403414882704999 16.179824 0.87 1.0 0.88 1.0

amp_fp16 static shape cpp wrapper

suite name thread batch_size_new speed_up_new inductor_new eager_new compilation_latency_new batch_size_old speed_up_old inductor_old eager_old compilation_latency_old Ratio Speedup(New/old) Eager Ratio(old/new) Inductor Ratio(old/new) Compilation_latency_Ratio(old/new)
torchbench shufflenet_v2_x1_0 multiple 64 2.648366 0.02387262 0.06322343513892001 59.549544 64 2.966687 0.0206696 0.0613202336152 60.026376 0.89 0.97 0.87 1.01
torchbench basic_gnn_gcn single 1 1.837054 0.047721980000000004 0.08766785424692 78.320497 1 2.230323 0.04047754 0.09027798844542 79.910189 0.82 1.03 0.85 1.02
torchbench Background_Matting single 1 1.709493 3.19020124 5.45362668837132 47.723786 1 2.014685 2.72976672 5.4996200642832 46.431005 0.85 1.01 0.86 0.97
torchbench doctr_det_predictor single 1 4.086563 0.95522034 3.90356809829142 64.725399 1 4.464748 0.85025308 3.79616573842384 66.383939 0.92 0.97 0.89 1.03

amp_fp16 dynamic shape default wrapper

suite name thread batch_size_new speed_up_new inductor_new eager_new compilation_latency_new batch_size_old speed_up_old inductor_old eager_old compilation_latency_old Ratio Speedup(New/old) Eager Ratio(old/new) Inductor Ratio(old/new) Compilation_latency_Ratio(old/new)
torchbench basic_gnn_gcn multiple 1 1.12969 0.010413157999999999 0.011763640461019999 32.021382 1 1.146111 0.009186986 0.010529305711446 32.220003 0.99 0.9 0.88 1.01
torchbench shufflenet_v2_x1_0 multiple 64 2.264811 0.022603352 0.051192320246471995 43.678762 64 2.48406 0.019864012 0.04934339764872 43.204043 0.91 0.96 0.88 0.99
torchbench Background_Matting single 1 1.680371 2.880739765 4.840711559652815 50.16649 1 1.956941 2.4907366189999998 4.874224609922479 48.728337 0.86 1.01 0.86 0.97
torchbench basic_gnn_gcn single 1 1.835839 0.040765593 0.074839065487527 33.268975 1 2.134891 0.034357286 0.073349060665826 32.15738 0.86 0.98 0.84 0.97
torchbench doctr_det_predictor single 1 4.044143 0.857683021 3.468592785596003 54.840956 1 4.6629 0.748766216 3.4914219885863997 55.272061 0.87 1.01 0.87 1.01
the bad commit: 263a5bf95e8b0160f22f29039235e7fa523a1048 ``` /workspace/pytorch# bash inductor_single_run.sh single inference performance torchbench shufflenet_v2_x1_0 amp first static cpp Testing with cpp wrapper. Testing with inductor. single-thread testing.... loading model: 0it [00:00, ?it/s] cpu eval shufflenet_v2_x1_0 running benchmark: 100%|███████████████████████████████████████████████████████████████| 50/50 [00:00<00:00, 60.73it/s] 3.125x WARNING:common:Trying to call the empty_gpu_cache for device: cpu, which is not in list [cuda, xpu] dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks,autograd_captures,autograd_compiles,cudagraph_skips cpu,shufflenet_v2_x1_0,1,3.125110,3.915668,18.889813,0.825000,46.714061,56.623104,273,1,0,0,0,0,0 ``` the last good commit: 29114e44fa7a17a3a2112d76937ae3b4cf9d33ce ``` /workspace/pytorch# bash inductor_single_run.sh single inference performance torchbench shufflenet_v2_x1_0 amp first static cpp Testing with cpp wrapper. Testing with inductor. single-thread testing.... loading model: 0it [00:00, ?it/s] cpu eval shufflenet_v2_x1_0 running benchmark: 100%|███████████████████████████████████████████████████████████████| 50/50 [00:00<00:00, 62.93it/s] 3.558x WARNING:common:Trying to call the empty_gpu_cache for device: cpu, which is not in list [cuda, xpu] dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks,autograd_captures,autograd_compiles,cudagraph_skips cpu,shufflenet_v2_x1_0,1,3.557891,3.422257,18.912550,0.821577,46.714061,56.859034,273,1,0,0,0,0,0 ``` ### Versions

SW info

name target_branch target_commit refer_branch refer_commit
torchbench main 766a5e3a main 766a5e3a
torch main 2fc692b3dd42bf92c4f92dcec862bae7ae1c7995 main 5ef33e40b3c3fd2608552d3301c7255826c0e7f6
torchvision main 0.19.0a0+d23a6e1 main 0.19.0a0+d23a6e1
torchtext main 0.16.0a0+b0ebddc main 0.16.0a0+b0ebddc
torchaudio main 2.5.0a0+332760d main 2.5.0a0+fa44bda
torchdata main 0.7.0a0+11bb5b8 main 0.7.0a0+11bb5b8
dynamo_benchmarks main nightly main nightly
Repro: [inductor_single_run.sh](https://github.com/chuanqi129/inductor-tools/blob//main/scripts/modelbench/inductor_single_run.sh) bash inductor_single_run.sh single inference performance torchbench shufflenet_v2_x1_0 amp first static cpp Suspected guilty commit: https://github.com/pytorch/pytorch/commit/263a5bf95e8b0160f22f29039235e7fa523a1048 [torchbench-shufflenet_v2_x1_0-inference-amp-static-cpp-single-performance-drop_guilty_commit.log](https://github.com/user-attachments/files/17842867/torchbench-shufflenet_v2_x1_0-inference-amp-static-cpp-single-performance-drop_guilty_commit.log) cc @chuanqi129
chunyuan-w commented 5 days ago

cc @Valentine233

Valentine233 commented 5 days ago

For the guilty commit, some regressions are expected, as we removed a compiler flag. I have already resolved the large regressions before landing the PR. The rest speedups are all above 0.8, and the geomean remains almost the same.