Closed victor-eds closed 3 months ago
So far, we haven't been able to reproduce this behavior on the current llvm-target
branch. Will investigate further.
Work on this ticket had been blocked by #1647 for most of last week. We were able to obtain data from a CI run end of last week and will present a summary offline in the meeting.
As discussed in chat offline, varying performance between different CI runs (based on the same commit) are still a problem for this investigation.
We'll try to still identify a consistent outlier in the performance comparison with the different subgroups sizes and then investigate what causes that performance difference.
After comparing CI runs 4 and 5, three outlier models could be identified where sub-group size 16 consistently across both runs provided worse performance than sub-group size 32:
AllenaiLongformerBase
training with float16
and amp_fp16
XLNetLMHeadModel
training with float16
and amp_fp16
BlenderbotSmallForCausalLM
inference with amp_fp16
For XLNetLMHeadModel
, neither sub-group size provides a speedup over Pytorch execution, so the model was excluded from further investigation.
AllenaiLongformerBase
Comparing device timing from unitrace
for both SG-sizes shows two Triton kernels that are among the GPU kernels that take up the most time.
For SG-size 32:
Kernel, Calls, Time (ns), Time (%), Average (ns), Min (ns), Max (ns)
"triton_poi_fused_index_add_new_zeros_13", 179, 480605600, 8.665852, 2684947, 28800, 2943360
"triton_poi_fused_index_add_new_zeros_25", 178, 443019040, 7.988125, 2488871, 6080, 2830400
For SG-size 16:
Kernel, Calls, Time (ns), Time (%), Average (ns), Min (ns), Max (ns)
"triton_poi_fused_index_add_new_zeros_13", 178, 631492960, 10.590560, 3547713, 2720, 4099520
"triton_poi_fused_index_add_new_zeros_25", 178, 629485440, 10.556891, 3536435, 6080, 4015200
Whereas these two kernels each take up ~8% of the overall execution time of the model with SG-size 32, they take up 10.5% with SG-size 16.
The average execution time of the kernel also increases from 2.6ms to 3.5ms, a 1.3x slowdown.
The kernels are attached to this comment. They are both rather simple, but both use tl.atomic_add
. @chengjunlu had confirmed in a chat offline that performance for atomic operations is a known issue, so that the performance difference potentially stems from this operation.
BlenderbotSmallForCausalLM
Comparing device timing from unitrace
for both SG-sizes shows one Triton kernel that is among the GPU kernels that take up the most time.
For SG-size 32:
Kernel, Calls, Time (ns), Time (%), Average (ns), Min (ns), Max (ns)
"triton_red_fused__log_softmax__to_copy_view_8", 15, 42188320, 6.827563, 2812554, 2708160, 2900960
For SG-size 16:
Kernel, Calls, Time (ns), Time (%), Average (ns), Min (ns), Max (ns)
"triton_red_fused__log_softmax__to_copy_view_8", 15, 60684640, 9.484119, 4045642, 3979680, 4196800
Whereas the kernel takes up less than 7% of the overall execution time of the model with SG-size 32, it takes up 9.5% with SG-size 16.
The average execution time of the kernel also increases from 2.8ms to 4ms, a 1.43x slowdown.
The kernel is attached to this comment and is also rather simple. It however uses a reduction. From previous investigations by @victor-eds, it is known that the current pattern used for reductions in the XPU backend is less efficient for SG-size 16, as it generates significantly more assembly instructions, so the performance difference most likely stems from this known issue.
add_new_zeros_25.txt add_new_zeros_13.txt softmax_to_copy_view_8.txt
I filed #1867 and #1868 as follow-up to investigate the root cause of the two outliers. That investigation currently has lower priority.
After running huggingface benchmarks with subgroup size 16 (https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/9499093865), we saw some cases in which subgroup size 16 reported worse performance:
huggingface amp_bf16 inference PLBartForCausalLM
: 3.58x vs 2.36x speedupshuggingface bf16 inferece BartForCausalLM
: 2.52x vs 1.19x speedupshuggingface f32 training T5Small
: 0.65x vs 0.50x speedupsInvestigate and create followup issues if needed or write report in this issue.
Env: