intel / llvm

Intel staging area for llvm.org contribution. Home for Intel LLVM-based projects.
Other
1.22k stars 732 forks source link

[SYCL][HIP] Poor Memory Bandwidth due to Unnecessary Memory Write Traffic, 50% slower than OpenSYCL #10624

Closed biergaizi closed 11 months ago

biergaizi commented 1 year ago

Describe the bug When targeting AMD HIP gfx906, simple SYCL code compiled by DPC++ often shows poor memory bandwidth, 50% below the hardware peak performance. Profiling shows the memory write traffic is amplified by 300% for unclear reason and it's suspected to be the cause.

To Reproduce

The problem seems to exist in many simple memory read-write kernels.

To illustrate the point, consider the following A[X] = A[X] * B[X] + C[X] vector triad kernel, test1.cpp.

#include <sycl/sycl.hpp>

void benchmark(
    float *__restrict x,
    float *__restrict y,
    float *__restrict z,
    sycl::range<1> global_size,
    sycl::queue Q
)
{
    int timesteps = 1000;
    sycl::range local_size{1024};

    auto t1 = std::chrono::high_resolution_clock().now();

    for (int i = 0; i < timesteps; i++) {
        Q.submit([&](sycl::handler &h) {
            h.parallel_for<class Bandwidth>(
                sycl::nd_range<1>{global_size, local_size}, [=](sycl::nd_item<1> item) {
                    size_t i = item.get_global_id()[0];

                    float local_x = x[i];
                    float local_y = y[i];
                    float local_z = z[i];

                    local_x *= local_y;
                    local_x += local_z;

                    x[i] = local_x;
                }
            );
        });
    }
    Q.wait_and_throw();

    auto t2 = std::chrono::high_resolution_clock().now();
    double dt = std::chrono::duration_cast<std::chrono::microseconds>(t2 - t1).count() / 1e6;

    size_t bytes_per_iteration = global_size[0] * 4 * sizeof(float);
    fprintf(stderr, "speed: %.0f MB/s\n",
        (double) bytes_per_iteration * timesteps / dt / 1024 / 1024
    );
}

int main(int argc, char** argv)
{
    sycl::queue Q({sycl::property::queue::in_order()});

    sycl::range global_size{256 * 256 * 256};

    size_t size = global_size[0];

    float *x = sycl::malloc_shared<float>(size, Q);
    Q.memset(x, 0, sizeof(float) * size);
    Q.prefetch(x, sizeof(float) * size);

    float *y = sycl::malloc_shared<float>(size, Q);
    Q.memset(y, 0, sizeof(float) * size);
    Q.prefetch(y, sizeof(float) * size);

    float *z = sycl::malloc_shared<float>(size, Q);
    Q.memset(z, 0, sizeof(float) * size);
    Q.prefetch(z, sizeof(float) * size);

    Q.wait_and_throw();

    benchmark(x, y, z, global_size, Q);

    return 0;
}

Compile the code via Intel DPC++ using the command clang++ test1.cpp -o test1_dpcpp.elf -fsycl -fsycl-targets=amdgcn-amd-amdhsa -Xsycl-target-backend --offload-arch=gfx906 --rocm-device-lib-path=/usr/lib/amdgcn/bitcode/ -Ofast -march=native -Wall -Wextra.

Running test1_dpcpp.elf shows:

speed: 486333 MB/s

The performance is 50% too low.

Compile the same code via OpenSYCL using the command syclcc test1.cpp -o test1_opensycl.elf --opensycl-targets=hip:gfx906 --rocm-device-lib-path=/usr/lib/amdgcn/bitcode/ -Ofast -march=native -Wall -Wextra --save-temps.

Running test1_opensycl.elf shows:

speed: 746092 MB/s

This is close to the realizable peak memory bandwidth on AMD Radeon Pro VII / Instinct MI50.

To profile the kernels via AMD rocprof, create a file named rocprof_input.txt with the following content:

pmc: Wavefronts
pmc: FETCH_SIZE
pmc: WRITE_SIZE
pmc: L2CacheHit

And running rocprof -i rocprof_input.txt -o rocprof_dpcpp.csv ./test1_dpcpp.elf and rocprof -i rocprof_input.txt -o rocprof_opensycl.csv ./test1_opensycl.elf.

According to rocprof_opensycl.csv, in each iteration, around 65536 kilobytes are written into memory in each iteration, as expected for an FP32 array with 16777216 elements. The L2CacheHit rate is around 0% as expected, since there's no data reuse.

Wavefronts         FETCH_SIZE         WRITE_SIZE        L2CacheHit
131072.0000000000  1916.1250000000    65536.0000000000  0.8920779599
131072.0000000000  1938.1250000000    65536.0000000000  0.2148528725
131072.0000000000  2099.6875000000    65536.0000000000  0.2160909461
262144.0000000000  196851.1250000000  65536.0000000000  0.0043812751
262144.0000000000  196863.4375000000  65536.0000000000  0.0021905332
262144.0000000000  196892.8125000000  65536.0000000000  0.0021903406
262144.0000000000  196888.1875000000  65536.0000000000  0.0021904032
262144.0000000000  196890.9375000000  65536.0000000000  0.0021903774
262144.0000000000  196888.4375000000  65536.0000000000  0.0021903733

But according to rocprof_dpcpp.csv, in each iteration, around 159336 kilobytes in written into memory (not including the first three iterations, which are memset()). This is 300% as much as the theoretical value. The L2CacheHit rate is around 48%, also indicating that somehow there are redundant loads or stores.

Wavefronts         FETCH_SIZE         WRITE_SIZE         L2CacheHit
131072.0000000000  1996.3750000000    65536.0000000000   0.6253965377
131072.0000000000  1858.9375000000    65536.0000000000   0.2319405694
131072.0000000000  1827.9375000000    65536.0000000000   0.2351738241
262144.0000000000  196781.1875000000  159211.2187500000  49.2742437845
262144.0000000000  196806.5000000000  159336.3125000000  48.5368960684
262144.0000000000  196822.6250000000  160237.6875000000  48.5208721942
262144.0000000000  196833.9375000000  158601.8750000000  48.5464310007
262144.0000000000  196827.8125000000  159001.8125000000  48.5544115839
262144.0000000000  196823.6250000000  161945.0625000000  48.5346288889

Environment (please complete the following information):

jinz2014 commented 1 year ago

May you attach the GPU assembly codes generated by the compilers ?

biergaizi commented 1 year ago

OpenSYCL generates the following code:

Click to Expand OpenSYCL Code ``` .text .amdgcn_target "amdgcn-amd-amdhsa--gfx906" .section .text._Z30__hipsycl_kernel_name_templateIN7hipsycl4glue20complete_kernel_nameIZZ9benchmarkPfS3_S3_NS0_4sycl5rangeILi1EEENS4_5queueEENKUlRNS4_7handlerEE_clES9_E9BandwidthEEEvv,#alloc,#execinstr .globl _Z30__hipsycl_kernel_name_templateIN7hipsycl4glue20complete_kernel_nameIZZ9benchmarkPfS3_S3_NS0_4sycl5rangeILi1EEENS4_5queueEENKUlRNS4_7handlerEE_clES9_E9BandwidthEEEvv ; -- Begin function _Z30__hipsycl_kernel_name_templateIN7hipsycl4glue20complete_kernel_nameIZZ9benchmarkPfS3_S3_NS0_4sycl5rangeILi1EEENS4_5queueEENKUlRNS4_7handlerEE_clES9_E9BandwidthEEEvv .p2align 8 .type _Z30__hipsycl_kernel_name_templateIN7hipsycl4glue20complete_kernel_nameIZZ9benchmarkPfS3_S3_NS0_4sycl5rangeILi1EEENS4_5queueEENKUlRNS4_7handlerEE_clES9_E9BandwidthEEEvv,@function _Z30__hipsycl_kernel_name_templateIN7hipsycl4glue20complete_kernel_nameIZZ9benchmarkPfS3_S3_NS0_4sycl5rangeILi1EEENS4_5queueEENKUlRNS4_7handlerEE_clES9_E9BandwidthEEEvv: ; @_Z30__hipsycl_kernel_name_templateIN7hipsycl4glue20complete_kernel_nameIZZ9benchmarkPfS3_S3_NS0_4sycl5rangeILi1EEENS4_5queueEENKUlRNS4_7handlerEE_clES9_E9BandwidthEEEvv ; %bb.0: s_endpgm .section .rodata,#alloc .p2align 6 .amdhsa_kernel _Z30__hipsycl_kernel_name_templateIN7hipsycl4glue20complete_kernel_nameIZZ9benchmarkPfS3_S3_NS0_4sycl5rangeILi1EEENS4_5queueEENKUlRNS4_7handlerEE_clES9_E9BandwidthEEEvv .amdhsa_group_segment_fixed_size 0 .amdhsa_private_segment_fixed_size 0 .amdhsa_kernarg_size 0 .amdhsa_user_sgpr_count 4 .amdhsa_user_sgpr_private_segment_buffer 1 .amdhsa_user_sgpr_dispatch_ptr 0 .amdhsa_user_sgpr_queue_ptr 0 .amdhsa_user_sgpr_kernarg_segment_ptr 0 .amdhsa_user_sgpr_dispatch_id 0 .amdhsa_user_sgpr_flat_scratch_init 0 .amdhsa_user_sgpr_private_segment_size 0 .amdhsa_uses_dynamic_stack 0 .amdhsa_system_sgpr_private_segment_wavefront_offset 0 .amdhsa_system_sgpr_workgroup_id_x 1 .amdhsa_system_sgpr_workgroup_id_y 0 .amdhsa_system_sgpr_workgroup_id_z 0 .amdhsa_system_sgpr_workgroup_info 0 .amdhsa_system_vgpr_workitem_id 0 .amdhsa_next_free_vgpr 1 .amdhsa_next_free_sgpr 1 .amdhsa_reserve_vcc 0 .amdhsa_reserve_flat_scratch 0 .amdhsa_reserve_xnack_mask 1 .amdhsa_float_round_mode_32 0 .amdhsa_float_round_mode_16_64 0 .amdhsa_float_denorm_mode_32 3 .amdhsa_float_denorm_mode_16_64 3 .amdhsa_dx10_clamp 1 .amdhsa_ieee_mode 1 .amdhsa_fp16_overflow 0 .amdhsa_exception_fp_ieee_invalid_op 0 .amdhsa_exception_fp_denorm_src 0 .amdhsa_exception_fp_ieee_div_zero 0 .amdhsa_exception_fp_ieee_overflow 0 .amdhsa_exception_fp_ieee_underflow 0 .amdhsa_exception_fp_ieee_inexact 0 .amdhsa_exception_int_div_zero 0 .end_amdhsa_kernel .section .text._Z30__hipsycl_kernel_name_templateIN7hipsycl4glue20complete_kernel_nameIZZ9benchmarkPfS3_S3_NS0_4sycl5rangeILi1EEENS4_5queueEENKUlRNS4_7handlerEE_clES9_E9BandwidthEEEvv,#alloc,#execinstr .Lfunc_end0: .size _Z30__hipsycl_kernel_name_templateIN7hipsycl4glue20complete_kernel_nameIZZ9benchmarkPfS3_S3_NS0_4sycl5rangeILi1EEENS4_5queueEENKUlRNS4_7handlerEE_clES9_E9BandwidthEEEvv, .Lfunc_end0-_Z30__hipsycl_kernel_name_templateIN7hipsycl4glue20complete_kernel_nameIZZ9benchmarkPfS3_S3_NS0_4sycl5rangeILi1EEENS4_5queueEENKUlRNS4_7handlerEE_clES9_E9BandwidthEEEvv ; -- End function .section .AMDGPU.csdata ; Kernel info: ; codeLenInByte = 4 ; NumSgprs: 0 ; NumVgprs: 0 ; ScratchSize: 0 ; MemoryBound: 0 ; FloatMode: 240 ; IeeeMode: 1 ; LDSByteSize: 0 bytes/workgroup (compile time only) ; SGPRBlocks: 0 ; VGPRBlocks: 0 ; NumSGPRsForWavesPerEU: 1 ; NumVGPRsForWavesPerEU: 1 ; Occupancy: 10 ; WaveLimiterHint : 0 ; COMPUTE_PGM_RSRC2:SCRATCH_EN: 0 ; COMPUTE_PGM_RSRC2:USER_SGPR: 4 ; COMPUTE_PGM_RSRC2:TRAP_HANDLER: 0 ; COMPUTE_PGM_RSRC2:TGID_X_EN: 1 ; COMPUTE_PGM_RSRC2:TGID_Y_EN: 0 ; COMPUTE_PGM_RSRC2:TGID_Z_EN: 0 ; COMPUTE_PGM_RSRC2:TIDIG_COMP_CNT: 0 .section .text._Z30__hipsycl_kernel_name_templateIZZ9benchmarkPfS0_S0_N7hipsycl4sycl5rangeILi1EEENS2_5queueEENKUlRNS2_7handlerEE_clES7_EUlNS2_7nd_itemILi1EEEE_Evv,#alloc,#execinstr .globl _Z30__hipsycl_kernel_name_templateIZZ9benchmarkPfS0_S0_N7hipsycl4sycl5rangeILi1EEENS2_5queueEENKUlRNS2_7handlerEE_clES7_EUlNS2_7nd_itemILi1EEEE_Evv ; -- Begin function _Z30__hipsycl_kernel_name_templateIZZ9benchmarkPfS0_S0_N7hipsycl4sycl5rangeILi1EEENS2_5queueEENKUlRNS2_7handlerEE_clES7_EUlNS2_7nd_itemILi1EEEE_Evv .p2align 8 .type _Z30__hipsycl_kernel_name_templateIZZ9benchmarkPfS0_S0_N7hipsycl4sycl5rangeILi1EEENS2_5queueEENKUlRNS2_7handlerEE_clES7_EUlNS2_7nd_itemILi1EEEE_Evv,@function _Z30__hipsycl_kernel_name_templateIZZ9benchmarkPfS0_S0_N7hipsycl4sycl5rangeILi1EEENS2_5queueEENKUlRNS2_7handlerEE_clES7_EUlNS2_7nd_itemILi1EEEE_Evv: ; @_Z30__hipsycl_kernel_name_templateIZZ9benchmarkPfS0_S0_N7hipsycl4sycl5rangeILi1EEENS2_5queueEENKUlRNS2_7handlerEE_clES7_EUlNS2_7nd_itemILi1EEEE_Evv ; %bb.0: s_endpgm .section .rodata,#alloc .p2align 6 .amdhsa_kernel _Z30__hipsycl_kernel_name_templateIZZ9benchmarkPfS0_S0_N7hipsycl4sycl5rangeILi1EEENS2_5queueEENKUlRNS2_7handlerEE_clES7_EUlNS2_7nd_itemILi1EEEE_Evv .amdhsa_group_segment_fixed_size 0 .amdhsa_private_segment_fixed_size 0 .amdhsa_kernarg_size 0 .amdhsa_user_sgpr_count 4 .amdhsa_user_sgpr_private_segment_buffer 1 .amdhsa_user_sgpr_dispatch_ptr 0 .amdhsa_user_sgpr_queue_ptr 0 .amdhsa_user_sgpr_kernarg_segment_ptr 0 .amdhsa_user_sgpr_dispatch_id 0 .amdhsa_user_sgpr_flat_scratch_init 0 .amdhsa_user_sgpr_private_segment_size 0 .amdhsa_uses_dynamic_stack 0 .amdhsa_system_sgpr_private_segment_wavefront_offset 0 .amdhsa_system_sgpr_workgroup_id_x 1 .amdhsa_system_sgpr_workgroup_id_y 0 .amdhsa_system_sgpr_workgroup_id_z 0 .amdhsa_system_sgpr_workgroup_info 0 .amdhsa_system_vgpr_workitem_id 0 .amdhsa_next_free_vgpr 1 .amdhsa_next_free_sgpr 1 .amdhsa_reserve_vcc 0 .amdhsa_reserve_flat_scratch 0 .amdhsa_reserve_xnack_mask 1 .amdhsa_float_round_mode_32 0 .amdhsa_float_round_mode_16_64 0 .amdhsa_float_denorm_mode_32 3 .amdhsa_float_denorm_mode_16_64 3 .amdhsa_dx10_clamp 1 .amdhsa_ieee_mode 1 .amdhsa_fp16_overflow 0 .amdhsa_exception_fp_ieee_invalid_op 0 .amdhsa_exception_fp_denorm_src 0 .amdhsa_exception_fp_ieee_div_zero 0 .amdhsa_exception_fp_ieee_overflow 0 .amdhsa_exception_fp_ieee_underflow 0 .amdhsa_exception_fp_ieee_inexact 0 .amdhsa_exception_int_div_zero 0 .end_amdhsa_kernel .section .text._Z30__hipsycl_kernel_name_templateIZZ9benchmarkPfS0_S0_N7hipsycl4sycl5rangeILi1EEENS2_5queueEENKUlRNS2_7handlerEE_clES7_EUlNS2_7nd_itemILi1EEEE_Evv,#alloc,#execinstr .Lfunc_end1: .size _Z30__hipsycl_kernel_name_templateIZZ9benchmarkPfS0_S0_N7hipsycl4sycl5rangeILi1EEENS2_5queueEENKUlRNS2_7handlerEE_clES7_EUlNS2_7nd_itemILi1EEEE_Evv, .Lfunc_end1-_Z30__hipsycl_kernel_name_templateIZZ9benchmarkPfS0_S0_N7hipsycl4sycl5rangeILi1EEENS2_5queueEENKUlRNS2_7handlerEE_clES7_EUlNS2_7nd_itemILi1EEEE_Evv ; -- End function .section .AMDGPU.csdata ; Kernel info: ; codeLenInByte = 4 ; NumSgprs: 0 ; NumVgprs: 0 ; ScratchSize: 0 ; MemoryBound: 0 ; FloatMode: 240 ; IeeeMode: 1 ; LDSByteSize: 0 bytes/workgroup (compile time only) ; SGPRBlocks: 0 ; VGPRBlocks: 0 ; NumSGPRsForWavesPerEU: 1 ; NumVGPRsForWavesPerEU: 1 ; Occupancy: 10 ; WaveLimiterHint : 0 ; COMPUTE_PGM_RSRC2:SCRATCH_EN: 0 ; COMPUTE_PGM_RSRC2:USER_SGPR: 4 ; COMPUTE_PGM_RSRC2:TRAP_HANDLER: 0 ; COMPUTE_PGM_RSRC2:TGID_X_EN: 1 ; COMPUTE_PGM_RSRC2:TGID_Y_EN: 0 ; COMPUTE_PGM_RSRC2:TGID_Z_EN: 0 ; COMPUTE_PGM_RSRC2:TIDIG_COMP_CNT: 0 .section .text._Z16__hipsycl_kernelIN7hipsycl4glue20complete_kernel_nameIZZ9benchmarkPfS3_S3_NS0_4sycl5rangeILi1EEENS4_5queueEENKUlRNS4_7handlerEE_clES9_E9BandwidthEEEvv,#alloc,#execinstr .globl _Z16__hipsycl_kernelIN7hipsycl4glue20complete_kernel_nameIZZ9benchmarkPfS3_S3_NS0_4sycl5rangeILi1EEENS4_5queueEENKUlRNS4_7handlerEE_clES9_E9BandwidthEEEvv ; -- Begin function _Z16__hipsycl_kernelIN7hipsycl4glue20complete_kernel_nameIZZ9benchmarkPfS3_S3_NS0_4sycl5rangeILi1EEENS4_5queueEENKUlRNS4_7handlerEE_clES9_E9BandwidthEEEvv .p2align 8 .type _Z16__hipsycl_kernelIN7hipsycl4glue20complete_kernel_nameIZZ9benchmarkPfS3_S3_NS0_4sycl5rangeILi1EEENS4_5queueEENKUlRNS4_7handlerEE_clES9_E9BandwidthEEEvv,@function _Z16__hipsycl_kernelIN7hipsycl4glue20complete_kernel_nameIZZ9benchmarkPfS3_S3_NS0_4sycl5rangeILi1EEENS4_5queueEENKUlRNS4_7handlerEE_clES9_E9BandwidthEEEvv: ; @_Z16__hipsycl_kernelIN7hipsycl4glue20complete_kernel_nameIZZ9benchmarkPfS3_S3_NS0_4sycl5rangeILi1EEENS4_5queueEENKUlRNS4_7handlerEE_clES9_E9BandwidthEEEvv ; %bb.0: s_load_dword s0, s[4:5], 0x4 s_load_dword s1, s[4:5], 0xc s_load_dwordx8 s[12:19], s[6:7], 0x0 s_waitcnt lgkmcnt(0) s_and_b32 s0, s0, 0xffff s_mul_i32 s2, s8, s0 s_sub_i32 s1, s1, s2 s_min_u32 s0, s1, s0 s_mul_i32 s0, s0, s8 v_add_u32_e32 v0, s0, v0 v_ashrrev_i32_e32 v1, 31, v0 v_mov_b32_e32 v2, s19 v_add_co_u32_e32 v0, vcc, s18, v0 v_addc_co_u32_e32 v1, vcc, v1, v2, vcc v_lshlrev_b64 v[0:1], 2, v[0:1] v_mov_b32_e32 v3, s13 v_add_co_u32_e32 v2, vcc, s12, v0 v_addc_co_u32_e32 v3, vcc, v3, v1, vcc v_mov_b32_e32 v5, s15 v_add_co_u32_e32 v4, vcc, s14, v0 v_addc_co_u32_e32 v5, vcc, v5, v1, vcc global_load_dword v4, v[4:5], off v_mov_b32_e32 v5, s17 v_add_co_u32_e32 v0, vcc, s16, v0 v_addc_co_u32_e32 v1, vcc, v5, v1, vcc global_load_dword v6, v[2:3], off s_nop 0 global_load_dword v0, v[0:1], off s_waitcnt vmcnt(0) v_fmac_f32_e32 v0, v4, v6 global_store_dword v[2:3], v0, off s_endpgm .section .rodata,#alloc .p2align 6 .amdhsa_kernel _Z16__hipsycl_kernelIN7hipsycl4glue20complete_kernel_nameIZZ9benchmarkPfS3_S3_NS0_4sycl5rangeILi1EEENS4_5queueEENKUlRNS4_7handlerEE_clES9_E9BandwidthEEEvv .amdhsa_group_segment_fixed_size 0 .amdhsa_private_segment_fixed_size 0 .amdhsa_kernarg_size 32 .amdhsa_user_sgpr_count 8 .amdhsa_user_sgpr_private_segment_buffer 1 .amdhsa_user_sgpr_dispatch_ptr 1 .amdhsa_user_sgpr_queue_ptr 0 .amdhsa_user_sgpr_kernarg_segment_ptr 1 .amdhsa_user_sgpr_dispatch_id 0 .amdhsa_user_sgpr_flat_scratch_init 0 .amdhsa_user_sgpr_private_segment_size 0 .amdhsa_uses_dynamic_stack 0 .amdhsa_system_sgpr_private_segment_wavefront_offset 0 .amdhsa_system_sgpr_workgroup_id_x 1 .amdhsa_system_sgpr_workgroup_id_y 0 .amdhsa_system_sgpr_workgroup_id_z 0 .amdhsa_system_sgpr_workgroup_info 0 .amdhsa_system_vgpr_workitem_id 0 .amdhsa_next_free_vgpr 7 .amdhsa_next_free_sgpr 20 .amdhsa_reserve_flat_scratch 0 .amdhsa_reserve_xnack_mask 1 .amdhsa_float_round_mode_32 0 .amdhsa_float_round_mode_16_64 0 .amdhsa_float_denorm_mode_32 3 .amdhsa_float_denorm_mode_16_64 3 .amdhsa_dx10_clamp 1 .amdhsa_ieee_mode 1 .amdhsa_fp16_overflow 0 .amdhsa_exception_fp_ieee_invalid_op 0 .amdhsa_exception_fp_denorm_src 0 .amdhsa_exception_fp_ieee_div_zero 0 .amdhsa_exception_fp_ieee_overflow 0 .amdhsa_exception_fp_ieee_underflow 0 .amdhsa_exception_fp_ieee_inexact 0 .amdhsa_exception_int_div_zero 0 .end_amdhsa_kernel .section .text._Z16__hipsycl_kernelIN7hipsycl4glue20complete_kernel_nameIZZ9benchmarkPfS3_S3_NS0_4sycl5rangeILi1EEENS4_5queueEENKUlRNS4_7handlerEE_clES9_E9BandwidthEEEvv,#alloc,#execinstr .Lfunc_end2: .size _Z16__hipsycl_kernelIN7hipsycl4glue20complete_kernel_nameIZZ9benchmarkPfS3_S3_NS0_4sycl5rangeILi1EEENS4_5queueEENKUlRNS4_7handlerEE_clES9_E9BandwidthEEEvv, .Lfunc_end2-_Z16__hipsycl_kernelIN7hipsycl4glue20complete_kernel_nameIZZ9benchmarkPfS3_S3_NS0_4sycl5rangeILi1EEENS4_5queueEENKUlRNS4_7handlerEE_clES9_E9BandwidthEEEvv ; -- End function .section .AMDGPU.csdata ; Kernel info: ; codeLenInByte = 164 ; NumSgprs: 22 ; NumVgprs: 7 ; ScratchSize: 0 ; MemoryBound: 0 ; FloatMode: 240 ; IeeeMode: 1 ; LDSByteSize: 0 bytes/workgroup (compile time only) ; SGPRBlocks: 2 ; VGPRBlocks: 1 ; NumSGPRsForWavesPerEU: 22 ; NumVGPRsForWavesPerEU: 7 ; Occupancy: 10 ; WaveLimiterHint : 0 ; COMPUTE_PGM_RSRC2:SCRATCH_EN: 0 ; COMPUTE_PGM_RSRC2:USER_SGPR: 8 ; COMPUTE_PGM_RSRC2:TRAP_HANDLER: 0 ; COMPUTE_PGM_RSRC2:TGID_X_EN: 1 ; COMPUTE_PGM_RSRC2:TGID_Y_EN: 0 ; COMPUTE_PGM_RSRC2:TGID_Z_EN: 0 ; COMPUTE_PGM_RSRC2:TIDIG_COMP_CNT: 0 .ident "clang version 15.0.7" .section ".note.GNU-stack" .addrsig .amdgpu_metadata --- amdhsa.kernels: - .args: [] .group_segment_fixed_size: 0 .kernarg_segment_align: 4 .kernarg_segment_size: 0 .language: OpenCL C .language_version: - 2 - 0 .max_flat_workgroup_size: 1024 .name: _Z30__hipsycl_kernel_name_templateIN7hipsycl4glue20complete_kernel_nameIZZ9benchmarkPfS3_S3_NS0_4sycl5rangeILi1EEENS4_5queueEENKUlRNS4_7handlerEE_clES9_E9BandwidthEEEvv .private_segment_fixed_size: 0 .sgpr_count: 0 .sgpr_spill_count: 0 .symbol: _Z30__hipsycl_kernel_name_templateIN7hipsycl4glue20complete_kernel_nameIZZ9benchmarkPfS3_S3_NS0_4sycl5rangeILi1EEENS4_5queueEENKUlRNS4_7handlerEE_clES9_E9BandwidthEEEvv.kd .uses_dynamic_stack: false .vgpr_count: 0 .vgpr_spill_count: 0 .wavefront_size: 64 - .args: [] .group_segment_fixed_size: 0 .kernarg_segment_align: 4 .kernarg_segment_size: 0 .language: OpenCL C .language_version: - 2 - 0 .max_flat_workgroup_size: 1024 .name: _Z30__hipsycl_kernel_name_templateIZZ9benchmarkPfS0_S0_N7hipsycl4sycl5rangeILi1EEENS2_5queueEENKUlRNS2_7handlerEE_clES7_EUlNS2_7nd_itemILi1EEEE_Evv .private_segment_fixed_size: 0 .sgpr_count: 0 .sgpr_spill_count: 0 .symbol: _Z30__hipsycl_kernel_name_templateIZZ9benchmarkPfS0_S0_N7hipsycl4sycl5rangeILi1EEENS2_5queueEENKUlRNS2_7handlerEE_clES7_EUlNS2_7nd_itemILi1EEEE_Evv.kd .uses_dynamic_stack: false .vgpr_count: 0 .vgpr_spill_count: 0 .wavefront_size: 64 - .args: - .offset: 0 .size: 24 .value_kind: by_value - .offset: 24 .size: 8 .value_kind: by_value .group_segment_fixed_size: 0 .kernarg_segment_align: 8 .kernarg_segment_size: 32 .language: OpenCL C .language_version: - 2 - 0 .max_flat_workgroup_size: 1024 .name: _Z16__hipsycl_kernelIN7hipsycl4glue20complete_kernel_nameIZZ9benchmarkPfS3_S3_NS0_4sycl5rangeILi1EEENS4_5queueEENKUlRNS4_7handlerEE_clES9_E9BandwidthEEEvv .private_segment_fixed_size: 0 .sgpr_count: 22 .sgpr_spill_count: 0 .symbol: _Z16__hipsycl_kernelIN7hipsycl4glue20complete_kernel_nameIZZ9benchmarkPfS3_S3_NS0_4sycl5rangeILi1EEENS4_5queueEENKUlRNS4_7handlerEE_clES9_E9BandwidthEEEvv.kd .uses_dynamic_stack: false .vgpr_count: 7 .vgpr_spill_count: 0 .wavefront_size: 64 amdhsa.target: amdgcn-amd-amdhsa--gfx906 amdhsa.version: - 1 - 1 ... .end_amdgpu_metadata ```

Intel DPC++ generates the following code:

Click to Expand Intel DPC++ Code ``` .text .amdgcn_target "amdgcn-amd-amdhsa--gfx906" .section .text._ZTSZZ9benchmarkPfS_S_N4sycl3_V15rangeILi1EEENS1_5queueEENKUlRNS1_7handlerEE_clES6_E9Bandwidth,#alloc,#execinstr .protected _ZTSZZ9benchmarkPfS_S_N4sycl3_V15rangeILi1EEENS1_5queueEENKUlRNS1_7handlerEE_clES6_E9Bandwidth ; -- Begin function _ZTSZZ9benchmarkPfS_S_N4sycl3_V15rangeILi1EEENS1_5queueEENKUlRNS1_7handlerEE_clES6_E9Bandwidth .weak _ZTSZZ9benchmarkPfS_S_N4sycl3_V15rangeILi1EEENS1_5queueEENKUlRNS1_7handlerEE_clES6_E9Bandwidth .p2align 8 .type _ZTSZZ9benchmarkPfS_S_N4sycl3_V15rangeILi1EEENS1_5queueEENKUlRNS1_7handlerEE_clES6_E9Bandwidth,@function _ZTSZZ9benchmarkPfS_S_N4sycl3_V15rangeILi1EEENS1_5queueEENKUlRNS1_7handlerEE_clES6_E9Bandwidth: ; @_ZTSZZ9benchmarkPfS_S_N4sycl3_V15rangeILi1EEENS1_5queueEENKUlRNS1_7handlerEE_clES6_E9Bandwidth ; %bb.0: ; %entry s_load_dword s4, s[4:5], 0x4 s_add_u32 s0, s0, s11 s_addc_u32 s1, s1, 0 v_mov_b32_e32 v1, 0 v_mov_b32_e32 v2, s10 s_waitcnt lgkmcnt(0) s_and_b32 s4, s4, 0xffff v_mad_u64_u32 v[2:3], s[4:5], s4, v2, v[0:1] s_load_dwordx4 s[8:11], s[6:7], 0x0 s_load_dwordx2 s[4:5], s[6:7], 0x10 v_lshlrev_b64 v[2:3], 2, v[2:3] s_waitcnt lgkmcnt(0) v_mov_b32_e32 v0, s9 v_add_co_u32_e32 v4, vcc, s8, v2 v_addc_co_u32_e32 v5, vcc, v0, v3, vcc v_mov_b32_e32 v7, s11 v_add_co_u32_e32 v6, vcc, s10, v2 v_addc_co_u32_e32 v7, vcc, v7, v3, vcc global_load_dword v6, v[6:7], off v_mov_b32_e32 v7, s5 v_add_co_u32_e32 v2, vcc, s4, v2 v_addc_co_u32_e32 v3, vcc, v7, v3, vcc global_load_dword v0, v[4:5], off s_nop 0 global_load_dword v2, v[2:3], off s_nop 0 buffer_store_dword v1, off, s[0:3], 0 offset:8 buffer_store_dword v1, off, s[0:3], 0 offset:4 buffer_store_dword v1, off, s[0:3], 0 offset:12 s_waitcnt vmcnt(3) v_fmac_f32_e32 v2, v6, v0 global_store_dword v[4:5], v2, off s_endpgm .section .rodata,#alloc .p2align 6, 0x0 .amdhsa_kernel _ZTSZZ9benchmarkPfS_S_N4sycl3_V15rangeILi1EEENS1_5queueEENKUlRNS1_7handlerEE_clES6_E9Bandwidth .amdhsa_group_segment_fixed_size 0 .amdhsa_private_segment_fixed_size 16 .amdhsa_kernarg_size 24 .amdhsa_user_sgpr_count 10 .amdhsa_user_sgpr_private_segment_buffer 1 .amdhsa_user_sgpr_dispatch_ptr 1 .amdhsa_user_sgpr_queue_ptr 0 .amdhsa_user_sgpr_kernarg_segment_ptr 1 .amdhsa_user_sgpr_dispatch_id 0 .amdhsa_user_sgpr_flat_scratch_init 1 .amdhsa_user_sgpr_private_segment_size 0 .amdhsa_system_sgpr_private_segment_wavefront_offset 1 .amdhsa_system_sgpr_workgroup_id_x 1 .amdhsa_system_sgpr_workgroup_id_y 0 .amdhsa_system_sgpr_workgroup_id_z 0 .amdhsa_system_sgpr_workgroup_info 0 .amdhsa_system_vgpr_workitem_id 0 .amdhsa_next_free_vgpr 8 .amdhsa_next_free_sgpr 12 .amdhsa_reserve_flat_scratch 0 .amdhsa_reserve_xnack_mask 1 .amdhsa_float_round_mode_32 0 .amdhsa_float_round_mode_16_64 0 .amdhsa_float_denorm_mode_32 3 .amdhsa_float_denorm_mode_16_64 3 .amdhsa_dx10_clamp 1 .amdhsa_ieee_mode 1 .amdhsa_fp16_overflow 0 .amdhsa_exception_fp_ieee_invalid_op 0 .amdhsa_exception_fp_denorm_src 0 .amdhsa_exception_fp_ieee_div_zero 0 .amdhsa_exception_fp_ieee_overflow 0 .amdhsa_exception_fp_ieee_underflow 0 .amdhsa_exception_fp_ieee_inexact 0 .amdhsa_exception_int_div_zero 0 .end_amdhsa_kernel .section .text._ZTSZZ9benchmarkPfS_S_N4sycl3_V15rangeILi1EEENS1_5queueEENKUlRNS1_7handlerEE_clES6_E9Bandwidth,#alloc,#execinstr .Lfunc_end0: .size _ZTSZZ9benchmarkPfS_S_N4sycl3_V15rangeILi1EEENS1_5queueEENKUlRNS1_7handlerEE_clES6_E9Bandwidth, .Lfunc_end0-_ZTSZZ9benchmarkPfS_S_N4sycl3_V15rangeILi1EEENS1_5queueEENKUlRNS1_7handlerEE_clES6_E9Bandwidth ; -- End function .section .AMDGPU.csdata ; Kernel info: ; codeLenInByte = 188 ; NumSgprs: 16 ; NumVgprs: 8 ; ScratchSize: 16 ; MemoryBound: 0 ; FloatMode: 240 ; IeeeMode: 1 ; LDSByteSize: 0 bytes/workgroup (compile time only) ; SGPRBlocks: 1 ; VGPRBlocks: 1 ; NumSGPRsForWavesPerEU: 16 ; NumVGPRsForWavesPerEU: 8 ; Occupancy: 8 ; WaveLimiterHint : 0 ; COMPUTE_PGM_RSRC2:SCRATCH_EN: 1 ; COMPUTE_PGM_RSRC2:USER_SGPR: 10 ; COMPUTE_PGM_RSRC2:TRAP_HANDLER: 0 ; COMPUTE_PGM_RSRC2:TGID_X_EN: 1 ; COMPUTE_PGM_RSRC2:TGID_Y_EN: 0 ; COMPUTE_PGM_RSRC2:TGID_Z_EN: 0 ; COMPUTE_PGM_RSRC2:TIDIG_COMP_CNT: 0 .text .protected _ZTSZZ9benchmarkPfS_S_N4sycl3_V15rangeILi1EEENS1_5queueEENKUlRNS1_7handlerEE_clES6_E9Bandwidth_with_offset ; -- Begin function _ZTSZZ9benchmarkPfS_S_N4sycl3_V15rangeILi1EEENS1_5queueEENKUlRNS1_7handlerEE_clES6_E9Bandwidth_with_offset .weak _ZTSZZ9benchmarkPfS_S_N4sycl3_V15rangeILi1EEENS1_5queueEENKUlRNS1_7handlerEE_clES6_E9Bandwidth_with_offset .p2align 8 .type _ZTSZZ9benchmarkPfS_S_N4sycl3_V15rangeILi1EEENS1_5queueEENKUlRNS1_7handlerEE_clES6_E9Bandwidth_with_offset,@function _ZTSZZ9benchmarkPfS_S_N4sycl3_V15rangeILi1EEENS1_5queueEENKUlRNS1_7handlerEE_clES6_E9Bandwidth_with_offset: ; @_ZTSZZ9benchmarkPfS_S_N4sycl3_V15rangeILi1EEENS1_5queueEENKUlRNS1_7handlerEE_clES6_E9Bandwidth_with_offset ; %bb.0: ; %entry s_load_dword s8, s[4:5], 0x4 s_load_dwordx8 s[12:19], s[6:7], 0x0 s_add_u32 s0, s0, s11 s_addc_u32 s1, s1, 0 v_mov_b32_e32 v1, 0 s_waitcnt lgkmcnt(0) s_and_b32 s4, s8, 0xffff v_mov_b32_e32 v2, s10 v_mad_u64_u32 v[0:1], s[4:5], s4, v2, v[0:1] v_mov_b32_e32 v3, s13 v_mov_b32_e32 v5, s15 v_add_co_u32_e32 v0, vcc, s18, v0 v_addc_co_u32_e32 v1, vcc, 0, v1, vcc v_lshlrev_b64 v[0:1], 2, v[0:1] s_load_dword s4, s[6:7], 0x20 v_add_co_u32_e32 v2, vcc, s12, v0 v_addc_co_u32_e32 v3, vcc, v3, v1, vcc v_add_co_u32_e32 v4, vcc, s14, v0 v_addc_co_u32_e32 v5, vcc, v5, v1, vcc global_load_dword v4, v[4:5], off v_mov_b32_e32 v5, s17 v_add_co_u32_e32 v0, vcc, s16, v0 v_addc_co_u32_e32 v1, vcc, v5, v1, vcc global_load_dword v6, v[2:3], off s_nop 0 global_load_dword v0, v[0:1], off s_waitcnt lgkmcnt(0) v_mov_b32_e32 v1, s4 buffer_store_dword v1, off, s[0:3], 0 offset:12 v_mov_b32_e32 v1, s19 buffer_store_dword v1, off, s[0:3], 0 offset:8 v_mov_b32_e32 v1, s18 buffer_store_dword v1, off, s[0:3], 0 offset:4 s_waitcnt vmcnt(3) v_fmac_f32_e32 v0, v4, v6 global_store_dword v[2:3], v0, off s_endpgm .section .rodata,#alloc .p2align 6, 0x0 .amdhsa_kernel _ZTSZZ9benchmarkPfS_S_N4sycl3_V15rangeILi1EEENS1_5queueEENKUlRNS1_7handlerEE_clES6_E9Bandwidth_with_offset .amdhsa_group_segment_fixed_size 0 .amdhsa_private_segment_fixed_size 16 .amdhsa_kernarg_size 36 .amdhsa_user_sgpr_count 10 .amdhsa_user_sgpr_private_segment_buffer 1 .amdhsa_user_sgpr_dispatch_ptr 1 .amdhsa_user_sgpr_queue_ptr 0 .amdhsa_user_sgpr_kernarg_segment_ptr 1 .amdhsa_user_sgpr_dispatch_id 0 .amdhsa_user_sgpr_flat_scratch_init 1 .amdhsa_user_sgpr_private_segment_size 0 .amdhsa_system_sgpr_private_segment_wavefront_offset 1 .amdhsa_system_sgpr_workgroup_id_x 1 .amdhsa_system_sgpr_workgroup_id_y 0 .amdhsa_system_sgpr_workgroup_id_z 0 .amdhsa_system_sgpr_workgroup_info 0 .amdhsa_system_vgpr_workitem_id 0 .amdhsa_next_free_vgpr 7 .amdhsa_next_free_sgpr 20 .amdhsa_reserve_flat_scratch 0 .amdhsa_reserve_xnack_mask 1 .amdhsa_float_round_mode_32 0 .amdhsa_float_round_mode_16_64 0 .amdhsa_float_denorm_mode_32 3 .amdhsa_float_denorm_mode_16_64 3 .amdhsa_dx10_clamp 1 .amdhsa_ieee_mode 1 .amdhsa_fp16_overflow 0 .amdhsa_exception_fp_ieee_invalid_op 0 .amdhsa_exception_fp_denorm_src 0 .amdhsa_exception_fp_ieee_div_zero 0 .amdhsa_exception_fp_ieee_overflow 0 .amdhsa_exception_fp_ieee_underflow 0 .amdhsa_exception_fp_ieee_inexact 0 .amdhsa_exception_int_div_zero 0 .end_amdhsa_kernel .text .Lfunc_end1: .size _ZTSZZ9benchmarkPfS_S_N4sycl3_V15rangeILi1EEENS1_5queueEENKUlRNS1_7handlerEE_clES6_E9Bandwidth_with_offset, .Lfunc_end1-_ZTSZZ9benchmarkPfS_S_N4sycl3_V15rangeILi1EEENS1_5queueEENKUlRNS1_7handlerEE_clES6_E9Bandwidth_with_offset ; -- End function .section .AMDGPU.csdata ; Kernel info: ; codeLenInByte = 204 ; NumSgprs: 24 ; NumVgprs: 7 ; ScratchSize: 16 ; MemoryBound: 0 ; FloatMode: 240 ; IeeeMode: 1 ; LDSByteSize: 0 bytes/workgroup (compile time only) ; SGPRBlocks: 2 ; VGPRBlocks: 1 ; NumSGPRsForWavesPerEU: 24 ; NumVGPRsForWavesPerEU: 7 ; Occupancy: 8 ; WaveLimiterHint : 0 ; COMPUTE_PGM_RSRC2:SCRATCH_EN: 1 ; COMPUTE_PGM_RSRC2:USER_SGPR: 10 ; COMPUTE_PGM_RSRC2:TRAP_HANDLER: 0 ; COMPUTE_PGM_RSRC2:TGID_X_EN: 1 ; COMPUTE_PGM_RSRC2:TGID_Y_EN: 0 ; COMPUTE_PGM_RSRC2:TGID_Z_EN: 0 ; COMPUTE_PGM_RSRC2:TIDIG_COMP_CNT: 0 .ident "clang version 17.0.0 (https://github.com/intel/llvm.git 8ea3e8eb65b863dfacb3c970d4403ae322e8d02e)" .ident "clang version 15.0.7" .section ".note.GNU-stack" .addrsig .amdgpu_metadata --- amdhsa.kernels: - .args: - .address_space: global .name: _arg_x .offset: 0 .size: 8 .value_kind: global_buffer - .address_space: global .name: _arg_y .offset: 8 .size: 8 .value_kind: global_buffer - .address_space: global .name: _arg_z .offset: 16 .size: 8 .value_kind: global_buffer .group_segment_fixed_size: 0 .kernarg_segment_align: 8 .kernarg_segment_size: 24 .language: OpenCL C .language_version: - 2 - 0 .max_flat_workgroup_size: 1024 .name: _ZTSZZ9benchmarkPfS_S_N4sycl3_V15rangeILi1EEENS1_5queueEENKUlRNS1_7handlerEE_clES6_E9Bandwidth .private_segment_fixed_size: 16 .sgpr_count: 16 .sgpr_spill_count: 0 .symbol: _ZTSZZ9benchmarkPfS_S_N4sycl3_V15rangeILi1EEENS1_5queueEENKUlRNS1_7handlerEE_clES6_E9Bandwidth.kd .vgpr_count: 8 .vgpr_spill_count: 0 .wavefront_size: 64 - .args: - .address_space: global .offset: 0 .size: 8 .value_kind: global_buffer - .address_space: global .offset: 8 .size: 8 .value_kind: global_buffer - .address_space: global .offset: 16 .size: 8 .value_kind: global_buffer - .offset: 24 .size: 12 .value_kind: by_value .group_segment_fixed_size: 0 .kernarg_segment_align: 8 .kernarg_segment_size: 36 .language: OpenCL C .language_version: - 2 - 0 .max_flat_workgroup_size: 1024 .name: _ZTSZZ9benchmarkPfS_S_N4sycl3_V15rangeILi1EEENS1_5queueEENKUlRNS1_7handlerEE_clES6_E9Bandwidth_with_offset .private_segment_fixed_size: 16 .sgpr_count: 24 .sgpr_spill_count: 0 .symbol: _ZTSZZ9benchmarkPfS_S_N4sycl3_V15rangeILi1EEENS1_5queueEENKUlRNS1_7handlerEE_clES6_E9Bandwidth_with_offset.kd .vgpr_count: 7 .vgpr_spill_count: 0 .wavefront_size: 64 amdhsa.target: amdgcn-amd-amdhsa--gfx906 amdhsa.version: - 1 - 1 ... .end_amdgpu_metadata ```
hdelan commented 1 year ago

Does the code show the same performance gap if malloc_device is used instead of malloc_shared?

biergaizi commented 1 year ago

Does the code show the same performance gap if malloc_device is used instead of malloc_shared?

Yes, both OpenSYCL and DPC++ show performance degradation using malloc_device(), likely because of an issue related memory channel/bank conflict. From my observation, using the HIP backend, malloc_device() is 1 MiB aligned, while malloc_shared() is 4 KiB aligned. However, allocating 3 separate arrays each with 1 MiB alignment using malloc_device() seems to cause a memory channel/bark conflict on the GFX906, which appears to be a poorly documented problem. Thus, counterintuitively, malloc_device() is slower than malloc_shared() in some cases.

On OpenSYCL, I found it can be worked around by using a single large malloc_device() allocation, with manually padding between the three arrays. However, I was not able to reproduce the same workaround with DPC++ yet. I'm not sure if the performance degradation seen in DPC++ has the same root cause, or a separate problem.

biergaizi commented 1 year ago

I got a better and even more minimalist example to show the memory bandwidth problems. Memory channel/bank conflict is ruled out because both uses malloc_device() with the same memory alignment.

#include <sycl/sycl.hpp>

void benchmark(
    sycl::float4* __restrict array,
    sycl::range<1> global_size,
    sycl::range<1> local_size,
    sycl::queue Q
)
{
    int timesteps = 1000;

    sycl::range global_range{global_size[0]};
    sycl::range local_range{local_size[0]};

    auto t1 = std::chrono::high_resolution_clock().now();

    for (int i = 0; i < timesteps; i++) {
        Q.submit([&](sycl::handler &h) {
            h.parallel_for<class Bandwidth>(
                sycl::nd_range<1>{global_range, local_range}, [=](sycl::nd_item<1> item) {
                    uint32_t i = item.get_global_id()[0];

                    sycl::float4 elem = array[i];
                    elem += 1;
                    array[i] = elem;
                }
            );
        });
    }
    Q.wait_and_throw();

    auto t2 = std::chrono::high_resolution_clock().now();
    double dt = std::chrono::duration_cast<std::chrono::microseconds>(t2 - t1).count() / 1e6;

    size_t bytes_per_iteration = global_size[0] * sizeof(sycl::float4);
    bytes_per_iteration *= 2;  // read + write
    fprintf(stderr, "speed: %.0f MB/s\n",
        (double) bytes_per_iteration * timesteps / dt / 1024 / 1024
    );
}

int main(int argc, char** argv)
{
    size_t wg_min = 10240;
    size_t wg_max = 65536;

    sycl::queue Q({sycl::property::queue::in_order()});
    sycl::range local_size{1024};

    /*
     * Note that the pointer returned by sycl::malloc_device() on AMD
     * HIP backend is always 1 MiB aligned.
     */
    size_t size = 1024 * wg_max;
    sycl::float4 *buf = sycl::malloc_device<sycl::float4>(size, Q);  // 256 float4 = 4096 bytes
    Q.memset(buf, 0, sizeof(sycl::float4) * size);

    for (size_t wg_num = wg_min; wg_num < wg_max; wg_num += 128) {
        sycl::range global_size{1024 * wg_num};

        printf("workgroups: %zu, p: %p\n", wg_num, buf);

        benchmark(buf, global_size, local_size, Q);
        benchmark(buf, global_size, local_size, Q);
        benchmark(buf, global_size, local_size, Q);

        Q.wait_and_throw();
    }

    return 0;
}

Using OpenSYCL, typical bandwidth is around 700 to 800 GB/s. But using Intel DPC++, the bandwidth is only 500 GB/s.

Click to Expand OpenSYCL Assembly Code ``` .text .amdgcn_target "amdgcn-amd-amdhsa--gfx906" .section .text._Z30__hipsycl_kernel_name_templateIN7hipsycl4glue20complete_kernel_nameIZZ9benchmarkPNS0_4sycl3vecIfLi4ENS3_6detail11vec_storageIfLi4EEEEENS3_5rangeILi1EEESB_NS3_5queueEENKUlRNS3_7handlerEE_clESE_E9BandwidthEEEvv,#alloc,#execinstr .globl _Z30__hipsycl_kernel_name_templateIN7hipsycl4glue20complete_kernel_nameIZZ9benchmarkPNS0_4sycl3vecIfLi4ENS3_6detail11vec_storageIfLi4EEEEENS3_5rangeILi1EEESB_NS3_5queueEENKUlRNS3_7handlerEE_clESE_E9BandwidthEEEvv ; -- Begin function _Z30__hipsycl_kernel_name_templateIN7hipsycl4glue20complete_kernel_nameIZZ9benchmarkPNS0_4sycl3vecIfLi4ENS3_6detail11vec_storageIfLi4EEEEENS3_5rangeILi1EEESB_NS3_5queueEENKUlRNS3_7handlerEE_clESE_E9BandwidthEEEvv .p2align 8 .type _Z30__hipsycl_kernel_name_templateIN7hipsycl4glue20complete_kernel_nameIZZ9benchmarkPNS0_4sycl3vecIfLi4ENS3_6detail11vec_storageIfLi4EEEEENS3_5rangeILi1EEESB_NS3_5queueEENKUlRNS3_7handlerEE_clESE_E9BandwidthEEEvv,@function _Z30__hipsycl_kernel_name_templateIN7hipsycl4glue20complete_kernel_nameIZZ9benchmarkPNS0_4sycl3vecIfLi4ENS3_6detail11vec_storageIfLi4EEEEENS3_5rangeILi1EEESB_NS3_5queueEENKUlRNS3_7handlerEE_clESE_E9BandwidthEEEvv: ; @_Z30__hipsycl_kernel_name_templateIN7hipsycl4glue20complete_kernel_nameIZZ9benchmarkPNS0_4sycl3vecIfLi4ENS3_6detail11vec_storageIfLi4EEEEENS3_5rangeILi1EEESB_NS3_5queueEENKUlRNS3_7handlerEE_clESE_E9BandwidthEEEvv ; %bb.0: s_endpgm .section .rodata,#alloc .p2align 6 .amdhsa_kernel _Z30__hipsycl_kernel_name_templateIN7hipsycl4glue20complete_kernel_nameIZZ9benchmarkPNS0_4sycl3vecIfLi4ENS3_6detail11vec_storageIfLi4EEEEENS3_5rangeILi1EEESB_NS3_5queueEENKUlRNS3_7handlerEE_clESE_E9BandwidthEEEvv .amdhsa_group_segment_fixed_size 0 .amdhsa_private_segment_fixed_size 0 .amdhsa_kernarg_size 0 .amdhsa_user_sgpr_count 4 .amdhsa_user_sgpr_private_segment_buffer 1 .amdhsa_user_sgpr_dispatch_ptr 0 .amdhsa_user_sgpr_queue_ptr 0 .amdhsa_user_sgpr_kernarg_segment_ptr 0 .amdhsa_user_sgpr_dispatch_id 0 .amdhsa_user_sgpr_flat_scratch_init 0 .amdhsa_user_sgpr_private_segment_size 0 .amdhsa_uses_dynamic_stack 0 .amdhsa_system_sgpr_private_segment_wavefront_offset 0 .amdhsa_system_sgpr_workgroup_id_x 1 .amdhsa_system_sgpr_workgroup_id_y 0 .amdhsa_system_sgpr_workgroup_id_z 0 .amdhsa_system_sgpr_workgroup_info 0 .amdhsa_system_vgpr_workitem_id 0 .amdhsa_next_free_vgpr 1 .amdhsa_next_free_sgpr 1 .amdhsa_reserve_vcc 0 .amdhsa_reserve_flat_scratch 0 .amdhsa_reserve_xnack_mask 1 .amdhsa_float_round_mode_32 0 .amdhsa_float_round_mode_16_64 0 .amdhsa_float_denorm_mode_32 3 .amdhsa_float_denorm_mode_16_64 3 .amdhsa_dx10_clamp 1 .amdhsa_ieee_mode 1 .amdhsa_fp16_overflow 0 .amdhsa_exception_fp_ieee_invalid_op 0 .amdhsa_exception_fp_denorm_src 0 .amdhsa_exception_fp_ieee_div_zero 0 .amdhsa_exception_fp_ieee_overflow 0 .amdhsa_exception_fp_ieee_underflow 0 .amdhsa_exception_fp_ieee_inexact 0 .amdhsa_exception_int_div_zero 0 .end_amdhsa_kernel .section .text._Z30__hipsycl_kernel_name_templateIN7hipsycl4glue20complete_kernel_nameIZZ9benchmarkPNS0_4sycl3vecIfLi4ENS3_6detail11vec_storageIfLi4EEEEENS3_5rangeILi1EEESB_NS3_5queueEENKUlRNS3_7handlerEE_clESE_E9BandwidthEEEvv,#alloc,#execinstr .Lfunc_end0: .size _Z30__hipsycl_kernel_name_templateIN7hipsycl4glue20complete_kernel_nameIZZ9benchmarkPNS0_4sycl3vecIfLi4ENS3_6detail11vec_storageIfLi4EEEEENS3_5rangeILi1EEESB_NS3_5queueEENKUlRNS3_7handlerEE_clESE_E9BandwidthEEEvv, .Lfunc_end0-_Z30__hipsycl_kernel_name_templateIN7hipsycl4glue20complete_kernel_nameIZZ9benchmarkPNS0_4sycl3vecIfLi4ENS3_6detail11vec_storageIfLi4EEEEENS3_5rangeILi1EEESB_NS3_5queueEENKUlRNS3_7handlerEE_clESE_E9BandwidthEEEvv ; -- End function .section .AMDGPU.csdata ; Kernel info: ; codeLenInByte = 4 ; NumSgprs: 0 ; NumVgprs: 0 ; ScratchSize: 0 ; MemoryBound: 0 ; FloatMode: 240 ; IeeeMode: 1 ; LDSByteSize: 0 bytes/workgroup (compile time only) ; SGPRBlocks: 0 ; VGPRBlocks: 0 ; NumSGPRsForWavesPerEU: 1 ; NumVGPRsForWavesPerEU: 1 ; Occupancy: 10 ; WaveLimiterHint : 0 ; COMPUTE_PGM_RSRC2:SCRATCH_EN: 0 ; COMPUTE_PGM_RSRC2:USER_SGPR: 4 ; COMPUTE_PGM_RSRC2:TRAP_HANDLER: 0 ; COMPUTE_PGM_RSRC2:TGID_X_EN: 1 ; COMPUTE_PGM_RSRC2:TGID_Y_EN: 0 ; COMPUTE_PGM_RSRC2:TGID_Z_EN: 0 ; COMPUTE_PGM_RSRC2:TIDIG_COMP_CNT: 0 .section .text._Z30__hipsycl_kernel_name_templateIZZ9benchmarkPN7hipsycl4sycl3vecIfLi4ENS1_6detail11vec_storageIfLi4EEEEENS1_5rangeILi1EEES9_NS1_5queueEENKUlRNS1_7handlerEE_clESC_EUlNS1_7nd_itemILi1EEEE_Evv,#alloc,#execinstr .globl _Z30__hipsycl_kernel_name_templateIZZ9benchmarkPN7hipsycl4sycl3vecIfLi4ENS1_6detail11vec_storageIfLi4EEEEENS1_5rangeILi1EEES9_NS1_5queueEENKUlRNS1_7handlerEE_clESC_EUlNS1_7nd_itemILi1EEEE_Evv ; -- Begin function _Z30__hipsycl_kernel_name_templateIZZ9benchmarkPN7hipsycl4sycl3vecIfLi4ENS1_6detail11vec_storageIfLi4EEEEENS1_5rangeILi1EEES9_NS1_5queueEENKUlRNS1_7handlerEE_clESC_EUlNS1_7nd_itemILi1EEEE_Evv .p2align 8 .type _Z30__hipsycl_kernel_name_templateIZZ9benchmarkPN7hipsycl4sycl3vecIfLi4ENS1_6detail11vec_storageIfLi4EEEEENS1_5rangeILi1EEES9_NS1_5queueEENKUlRNS1_7handlerEE_clESC_EUlNS1_7nd_itemILi1EEEE_Evv,@function _Z30__hipsycl_kernel_name_templateIZZ9benchmarkPN7hipsycl4sycl3vecIfLi4ENS1_6detail11vec_storageIfLi4EEEEENS1_5rangeILi1EEES9_NS1_5queueEENKUlRNS1_7handlerEE_clESC_EUlNS1_7nd_itemILi1EEEE_Evv: ; @_Z30__hipsycl_kernel_name_templateIZZ9benchmarkPN7hipsycl4sycl3vecIfLi4ENS1_6detail11vec_storageIfLi4EEEEENS1_5rangeILi1EEES9_NS1_5queueEENKUlRNS1_7handlerEE_clESC_EUlNS1_7nd_itemILi1EEEE_Evv ; %bb.0: s_endpgm .section .rodata,#alloc .p2align 6 .amdhsa_kernel _Z30__hipsycl_kernel_name_templateIZZ9benchmarkPN7hipsycl4sycl3vecIfLi4ENS1_6detail11vec_storageIfLi4EEEEENS1_5rangeILi1EEES9_NS1_5queueEENKUlRNS1_7handlerEE_clESC_EUlNS1_7nd_itemILi1EEEE_Evv .amdhsa_group_segment_fixed_size 0 .amdhsa_private_segment_fixed_size 0 .amdhsa_kernarg_size 0 .amdhsa_user_sgpr_count 4 .amdhsa_user_sgpr_private_segment_buffer 1 .amdhsa_user_sgpr_dispatch_ptr 0 .amdhsa_user_sgpr_queue_ptr 0 .amdhsa_user_sgpr_kernarg_segment_ptr 0 .amdhsa_user_sgpr_dispatch_id 0 .amdhsa_user_sgpr_flat_scratch_init 0 .amdhsa_user_sgpr_private_segment_size 0 .amdhsa_uses_dynamic_stack 0 .amdhsa_system_sgpr_private_segment_wavefront_offset 0 .amdhsa_system_sgpr_workgroup_id_x 1 .amdhsa_system_sgpr_workgroup_id_y 0 .amdhsa_system_sgpr_workgroup_id_z 0 .amdhsa_system_sgpr_workgroup_info 0 .amdhsa_system_vgpr_workitem_id 0 .amdhsa_next_free_vgpr 1 .amdhsa_next_free_sgpr 1 .amdhsa_reserve_vcc 0 .amdhsa_reserve_flat_scratch 0 .amdhsa_reserve_xnack_mask 1 .amdhsa_float_round_mode_32 0 .amdhsa_float_round_mode_16_64 0 .amdhsa_float_denorm_mode_32 3 .amdhsa_float_denorm_mode_16_64 3 .amdhsa_dx10_clamp 1 .amdhsa_ieee_mode 1 .amdhsa_fp16_overflow 0 .amdhsa_exception_fp_ieee_invalid_op 0 .amdhsa_exception_fp_denorm_src 0 .amdhsa_exception_fp_ieee_div_zero 0 .amdhsa_exception_fp_ieee_overflow 0 .amdhsa_exception_fp_ieee_underflow 0 .amdhsa_exception_fp_ieee_inexact 0 .amdhsa_exception_int_div_zero 0 .end_amdhsa_kernel .section .text._Z30__hipsycl_kernel_name_templateIZZ9benchmarkPN7hipsycl4sycl3vecIfLi4ENS1_6detail11vec_storageIfLi4EEEEENS1_5rangeILi1EEES9_NS1_5queueEENKUlRNS1_7handlerEE_clESC_EUlNS1_7nd_itemILi1EEEE_Evv,#alloc,#execinstr .Lfunc_end1: .size _Z30__hipsycl_kernel_name_templateIZZ9benchmarkPN7hipsycl4sycl3vecIfLi4ENS1_6detail11vec_storageIfLi4EEEEENS1_5rangeILi1EEES9_NS1_5queueEENKUlRNS1_7handlerEE_clESC_EUlNS1_7nd_itemILi1EEEE_Evv, .Lfunc_end1-_Z30__hipsycl_kernel_name_templateIZZ9benchmarkPN7hipsycl4sycl3vecIfLi4ENS1_6detail11vec_storageIfLi4EEEEENS1_5rangeILi1EEES9_NS1_5queueEENKUlRNS1_7handlerEE_clESC_EUlNS1_7nd_itemILi1EEEE_Evv ; -- End function .section .AMDGPU.csdata ; Kernel info: ; codeLenInByte = 4 ; NumSgprs: 0 ; NumVgprs: 0 ; ScratchSize: 0 ; MemoryBound: 0 ; FloatMode: 240 ; IeeeMode: 1 ; LDSByteSize: 0 bytes/workgroup (compile time only) ; SGPRBlocks: 0 ; VGPRBlocks: 0 ; NumSGPRsForWavesPerEU: 1 ; NumVGPRsForWavesPerEU: 1 ; Occupancy: 10 ; WaveLimiterHint : 0 ; COMPUTE_PGM_RSRC2:SCRATCH_EN: 0 ; COMPUTE_PGM_RSRC2:USER_SGPR: 4 ; COMPUTE_PGM_RSRC2:TRAP_HANDLER: 0 ; COMPUTE_PGM_RSRC2:TGID_X_EN: 1 ; COMPUTE_PGM_RSRC2:TGID_Y_EN: 0 ; COMPUTE_PGM_RSRC2:TGID_Z_EN: 0 ; COMPUTE_PGM_RSRC2:TIDIG_COMP_CNT: 0 .section .text._Z16__hipsycl_kernelIN7hipsycl4glue20complete_kernel_nameIZZ9benchmarkPNS0_4sycl3vecIfLi4ENS3_6detail11vec_storageIfLi4EEEEENS3_5rangeILi1EEESB_NS3_5queueEENKUlRNS3_7handlerEE_clESE_E9BandwidthEEEvv,#alloc,#execinstr .globl _Z16__hipsycl_kernelIN7hipsycl4glue20complete_kernel_nameIZZ9benchmarkPNS0_4sycl3vecIfLi4ENS3_6detail11vec_storageIfLi4EEEEENS3_5rangeILi1EEESB_NS3_5queueEENKUlRNS3_7handlerEE_clESE_E9BandwidthEEEvv ; -- Begin function _Z16__hipsycl_kernelIN7hipsycl4glue20complete_kernel_nameIZZ9benchmarkPNS0_4sycl3vecIfLi4ENS3_6detail11vec_storageIfLi4EEEEENS3_5rangeILi1EEESB_NS3_5queueEENKUlRNS3_7handlerEE_clESE_E9BandwidthEEEvv .p2align 8 .type _Z16__hipsycl_kernelIN7hipsycl4glue20complete_kernel_nameIZZ9benchmarkPNS0_4sycl3vecIfLi4ENS3_6detail11vec_storageIfLi4EEEEENS3_5rangeILi1EEESB_NS3_5queueEENKUlRNS3_7handlerEE_clESE_E9BandwidthEEEvv,@function _Z16__hipsycl_kernelIN7hipsycl4glue20complete_kernel_nameIZZ9benchmarkPNS0_4sycl3vecIfLi4ENS3_6detail11vec_storageIfLi4EEEEENS3_5rangeILi1EEESB_NS3_5queueEENKUlRNS3_7handlerEE_clESE_E9BandwidthEEEvv: ; @_Z16__hipsycl_kernelIN7hipsycl4glue20complete_kernel_nameIZZ9benchmarkPNS0_4sycl3vecIfLi4ENS3_6detail11vec_storageIfLi4EEEEENS3_5rangeILi1EEESB_NS3_5queueEENKUlRNS3_7handlerEE_clESE_E9BandwidthEEEvv ; %bb.0: s_load_dword s9, s[4:5], 0x4 s_load_dword s10, s[4:5], 0xc s_load_dwordx4 s[0:3], s[6:7], 0x0 v_mov_b32_e32 v1, 0 s_waitcnt lgkmcnt(0) s_and_b32 s3, s9, 0xffff s_mul_i32 s4, s8, s3 s_sub_i32 s4, s10, s4 s_min_u32 s3, s4, s3 s_mul_i32 s3, s3, s8 v_add_u32_e32 v0, s2, v0 v_add_u32_e32 v0, s3, v0 v_lshlrev_b64 v[0:1], 4, v[0:1] v_mov_b32_e32 v2, s1 v_add_co_u32_e32 v4, vcc, s0, v0 v_addc_co_u32_e32 v5, vcc, v2, v1, vcc global_load_dwordx4 v[0:3], v[4:5], off s_waitcnt vmcnt(0) v_add_f32_e32 v0, 1.0, v0 v_add_f32_e32 v1, 1.0, v1 v_add_f32_e32 v2, 1.0, v2 v_add_f32_e32 v3, 1.0, v3 global_store_dwordx4 v[4:5], v[0:3], off s_endpgm .section .rodata,#alloc .p2align 6 .amdhsa_kernel _Z16__hipsycl_kernelIN7hipsycl4glue20complete_kernel_nameIZZ9benchmarkPNS0_4sycl3vecIfLi4ENS3_6detail11vec_storageIfLi4EEEEENS3_5rangeILi1EEESB_NS3_5queueEENKUlRNS3_7handlerEE_clESE_E9BandwidthEEEvv .amdhsa_group_segment_fixed_size 0 .amdhsa_private_segment_fixed_size 0 .amdhsa_kernarg_size 16 .amdhsa_user_sgpr_count 8 .amdhsa_user_sgpr_private_segment_buffer 1 .amdhsa_user_sgpr_dispatch_ptr 1 .amdhsa_user_sgpr_queue_ptr 0 .amdhsa_user_sgpr_kernarg_segment_ptr 1 .amdhsa_user_sgpr_dispatch_id 0 .amdhsa_user_sgpr_flat_scratch_init 0 .amdhsa_user_sgpr_private_segment_size 0 .amdhsa_uses_dynamic_stack 0 .amdhsa_system_sgpr_private_segment_wavefront_offset 0 .amdhsa_system_sgpr_workgroup_id_x 1 .amdhsa_system_sgpr_workgroup_id_y 0 .amdhsa_system_sgpr_workgroup_id_z 0 .amdhsa_system_sgpr_workgroup_info 0 .amdhsa_system_vgpr_workitem_id 0 .amdhsa_next_free_vgpr 6 .amdhsa_next_free_sgpr 11 .amdhsa_reserve_flat_scratch 0 .amdhsa_reserve_xnack_mask 1 .amdhsa_float_round_mode_32 0 .amdhsa_float_round_mode_16_64 0 .amdhsa_float_denorm_mode_32 3 .amdhsa_float_denorm_mode_16_64 3 .amdhsa_dx10_clamp 1 .amdhsa_ieee_mode 1 .amdhsa_fp16_overflow 0 .amdhsa_exception_fp_ieee_invalid_op 0 .amdhsa_exception_fp_denorm_src 0 .amdhsa_exception_fp_ieee_div_zero 0 .amdhsa_exception_fp_ieee_overflow 0 .amdhsa_exception_fp_ieee_underflow 0 .amdhsa_exception_fp_ieee_inexact 0 .amdhsa_exception_int_div_zero 0 .end_amdhsa_kernel .section .text._Z16__hipsycl_kernelIN7hipsycl4glue20complete_kernel_nameIZZ9benchmarkPNS0_4sycl3vecIfLi4ENS3_6detail11vec_storageIfLi4EEEEENS3_5rangeILi1EEESB_NS3_5queueEENKUlRNS3_7handlerEE_clESE_E9BandwidthEEEvv,#alloc,#execinstr .Lfunc_end2: .size _Z16__hipsycl_kernelIN7hipsycl4glue20complete_kernel_nameIZZ9benchmarkPNS0_4sycl3vecIfLi4ENS3_6detail11vec_storageIfLi4EEEEENS3_5rangeILi1EEESB_NS3_5queueEENKUlRNS3_7handlerEE_clESE_E9BandwidthEEEvv, .Lfunc_end2-_Z16__hipsycl_kernelIN7hipsycl4glue20complete_kernel_nameIZZ9benchmarkPNS0_4sycl3vecIfLi4ENS3_6detail11vec_storageIfLi4EEEEENS3_5rangeILi1EEESB_NS3_5queueEENKUlRNS3_7handlerEE_clESE_E9BandwidthEEEvv ; -- End function .section .AMDGPU.csdata ; Kernel info: ; codeLenInByte = 124 ; NumSgprs: 13 ; NumVgprs: 6 ; ScratchSize: 0 ; MemoryBound: 0 ; FloatMode: 240 ; IeeeMode: 1 ; LDSByteSize: 0 bytes/workgroup (compile time only) ; SGPRBlocks: 1 ; VGPRBlocks: 1 ; NumSGPRsForWavesPerEU: 13 ; NumVGPRsForWavesPerEU: 6 ; Occupancy: 10 ; WaveLimiterHint : 0 ; COMPUTE_PGM_RSRC2:SCRATCH_EN: 0 ; COMPUTE_PGM_RSRC2:USER_SGPR: 8 ; COMPUTE_PGM_RSRC2:TRAP_HANDLER: 0 ; COMPUTE_PGM_RSRC2:TGID_X_EN: 1 ; COMPUTE_PGM_RSRC2:TGID_Y_EN: 0 ; COMPUTE_PGM_RSRC2:TGID_Z_EN: 0 ; COMPUTE_PGM_RSRC2:TIDIG_COMP_CNT: 0 .ident "clang version 15.0.7" .section ".note.GNU-stack" .addrsig .amdgpu_metadata --- amdhsa.kernels: - .args: [] .group_segment_fixed_size: 0 .kernarg_segment_align: 4 .kernarg_segment_size: 0 .language: OpenCL C .language_version: - 2 - 0 .max_flat_workgroup_size: 1024 .name: _Z30__hipsycl_kernel_name_templateIN7hipsycl4glue20complete_kernel_nameIZZ9benchmarkPNS0_4sycl3vecIfLi4ENS3_6detail11vec_storageIfLi4EEEEENS3_5rangeILi1EEESB_NS3_5queueEENKUlRNS3_7handlerEE_clESE_E9BandwidthEEEvv .private_segment_fixed_size: 0 .sgpr_count: 0 .sgpr_spill_count: 0 .symbol: _Z30__hipsycl_kernel_name_templateIN7hipsycl4glue20complete_kernel_nameIZZ9benchmarkPNS0_4sycl3vecIfLi4ENS3_6detail11vec_storageIfLi4EEEEENS3_5rangeILi1EEESB_NS3_5queueEENKUlRNS3_7handlerEE_clESE_E9BandwidthEEEvv.kd .uses_dynamic_stack: false .vgpr_count: 0 .vgpr_spill_count: 0 .wavefront_size: 64 - .args: [] .group_segment_fixed_size: 0 .kernarg_segment_align: 4 .kernarg_segment_size: 0 .language: OpenCL C .language_version: - 2 - 0 .max_flat_workgroup_size: 1024 .name: _Z30__hipsycl_kernel_name_templateIZZ9benchmarkPN7hipsycl4sycl3vecIfLi4ENS1_6detail11vec_storageIfLi4EEEEENS1_5rangeILi1EEES9_NS1_5queueEENKUlRNS1_7handlerEE_clESC_EUlNS1_7nd_itemILi1EEEE_Evv .private_segment_fixed_size: 0 .sgpr_count: 0 .sgpr_spill_count: 0 .symbol: _Z30__hipsycl_kernel_name_templateIZZ9benchmarkPN7hipsycl4sycl3vecIfLi4ENS1_6detail11vec_storageIfLi4EEEEENS1_5rangeILi1EEES9_NS1_5queueEENKUlRNS1_7handlerEE_clESC_EUlNS1_7nd_itemILi1EEEE_Evv.kd .uses_dynamic_stack: false .vgpr_count: 0 .vgpr_spill_count: 0 .wavefront_size: 64 - .args: - .address_space: global .offset: 0 .size: 8 .value_kind: global_buffer - .offset: 8 .size: 8 .value_kind: by_value .group_segment_fixed_size: 0 .kernarg_segment_align: 8 .kernarg_segment_size: 16 .language: OpenCL C .language_version: - 2 - 0 .max_flat_workgroup_size: 1024 .name: _Z16__hipsycl_kernelIN7hipsycl4glue20complete_kernel_nameIZZ9benchmarkPNS0_4sycl3vecIfLi4ENS3_6detail11vec_storageIfLi4EEEEENS3_5rangeILi1EEESB_NS3_5queueEENKUlRNS3_7handlerEE_clESE_E9BandwidthEEEvv .private_segment_fixed_size: 0 .sgpr_count: 13 .sgpr_spill_count: 0 .symbol: _Z16__hipsycl_kernelIN7hipsycl4glue20complete_kernel_nameIZZ9benchmarkPNS0_4sycl3vecIfLi4ENS3_6detail11vec_storageIfLi4EEEEENS3_5rangeILi1EEESB_NS3_5queueEENKUlRNS3_7handlerEE_clESE_E9BandwidthEEEvv.kd .uses_dynamic_stack: false .vgpr_count: 6 .vgpr_spill_count: 0 .wavefront_size: 64 amdhsa.target: amdgcn-amd-amdhsa--gfx906 amdhsa.version: - 1 - 1 ... .end_amdgpu_metadata ```
Click to Expand DPC++ Assembly Code ``` .text .amdgcn_target "amdgcn-amd-amdhsa--gfx906" .section .text._ZTSZZ9benchmarkPN4sycl3_V13vecIfLi4EEENS0_5rangeILi1EEES5_NS0_5queueEENKUlRNS0_7handlerEE_clES8_E9Bandwidth,#alloc,#execinstr .protected _ZTSZZ9benchmarkPN4sycl3_V13vecIfLi4EEENS0_5rangeILi1EEES5_NS0_5queueEENKUlRNS0_7handlerEE_clES8_E9Bandwidth ; -- Begin function _ZTSZZ9benchmarkPN4sycl3_V13vecIfLi4EEENS0_5rangeILi1EEES5_NS0_5queueEENKUlRNS0_7handlerEE_clES8_E9Bandwidth .weak _ZTSZZ9benchmarkPN4sycl3_V13vecIfLi4EEENS0_5rangeILi1EEES5_NS0_5queueEENKUlRNS0_7handlerEE_clES8_E9Bandwidth .p2align 8 .type _ZTSZZ9benchmarkPN4sycl3_V13vecIfLi4EEENS0_5rangeILi1EEES5_NS0_5queueEENKUlRNS0_7handlerEE_clES8_E9Bandwidth,@function _ZTSZZ9benchmarkPN4sycl3_V13vecIfLi4EEENS0_5rangeILi1EEES5_NS0_5queueEENKUlRNS0_7handlerEE_clES8_E9Bandwidth: ; @_ZTSZZ9benchmarkPN4sycl3_V13vecIfLi4EEENS0_5rangeILi1EEES5_NS0_5queueEENKUlRNS0_7handlerEE_clES8_E9Bandwidth ; %bb.0: ; %entry s_add_u32 s0, s0, s11 s_load_dword s11, s[4:5], 0x4 s_load_dwordx2 s[8:9], s[6:7], 0x0 s_addc_u32 s1, s1, 0 v_mov_b32_e32 v4, 0 s_waitcnt lgkmcnt(0) s_and_b32 s4, s11, 0xffff s_mul_i32 s4, s4, s10 v_add_u32_e32 v3, s4, v0 v_lshlrev_b64 v[0:1], 4, v[3:4] v_mov_b32_e32 v2, s9 v_add_co_u32_e32 v5, vcc, s8, v0 v_addc_co_u32_e32 v6, vcc, v2, v1, vcc global_load_dwordx4 v[0:3], v[5:6], off s_nop 0 buffer_store_dword v4, off, s[0:3], 0 offset:8 buffer_store_dword v4, off, s[0:3], 0 offset:4 buffer_store_dword v4, off, s[0:3], 0 offset:12 s_waitcnt vmcnt(3) v_add_f32_e32 v3, 1.0, v3 v_add_f32_e32 v2, 1.0, v2 v_add_f32_e32 v1, 1.0, v1 v_add_f32_e32 v0, 1.0, v0 global_store_dwordx4 v[5:6], v[0:3], off s_endpgm .section .rodata,#alloc .p2align 6, 0x0 .amdhsa_kernel _ZTSZZ9benchmarkPN4sycl3_V13vecIfLi4EEENS0_5rangeILi1EEES5_NS0_5queueEENKUlRNS0_7handlerEE_clES8_E9Bandwidth .amdhsa_group_segment_fixed_size 0 .amdhsa_private_segment_fixed_size 16 .amdhsa_kernarg_size 8 .amdhsa_user_sgpr_count 10 .amdhsa_user_sgpr_private_segment_buffer 1 .amdhsa_user_sgpr_dispatch_ptr 1 .amdhsa_user_sgpr_queue_ptr 0 .amdhsa_user_sgpr_kernarg_segment_ptr 1 .amdhsa_user_sgpr_dispatch_id 0 .amdhsa_user_sgpr_flat_scratch_init 1 .amdhsa_user_sgpr_private_segment_size 0 .amdhsa_system_sgpr_private_segment_wavefront_offset 1 .amdhsa_system_sgpr_workgroup_id_x 1 .amdhsa_system_sgpr_workgroup_id_y 0 .amdhsa_system_sgpr_workgroup_id_z 0 .amdhsa_system_sgpr_workgroup_info 0 .amdhsa_system_vgpr_workitem_id 0 .amdhsa_next_free_vgpr 7 .amdhsa_next_free_sgpr 12 .amdhsa_reserve_flat_scratch 0 .amdhsa_reserve_xnack_mask 1 .amdhsa_float_round_mode_32 0 .amdhsa_float_round_mode_16_64 0 .amdhsa_float_denorm_mode_32 3 .amdhsa_float_denorm_mode_16_64 3 .amdhsa_dx10_clamp 1 .amdhsa_ieee_mode 1 .amdhsa_fp16_overflow 0 .amdhsa_exception_fp_ieee_invalid_op 0 .amdhsa_exception_fp_denorm_src 0 .amdhsa_exception_fp_ieee_div_zero 0 .amdhsa_exception_fp_ieee_overflow 0 .amdhsa_exception_fp_ieee_underflow 0 .amdhsa_exception_fp_ieee_inexact 0 .amdhsa_exception_int_div_zero 0 .end_amdhsa_kernel .section .text._ZTSZZ9benchmarkPN4sycl3_V13vecIfLi4EEENS0_5rangeILi1EEES5_NS0_5queueEENKUlRNS0_7handlerEE_clES8_E9Bandwidth,#alloc,#execinstr .Lfunc_end0: .size _ZTSZZ9benchmarkPN4sycl3_V13vecIfLi4EEENS0_5rangeILi1EEES5_NS0_5queueEENKUlRNS0_7handlerEE_clES8_E9Bandwidth, .Lfunc_end0-_ZTSZZ9benchmarkPN4sycl3_V13vecIfLi4EEENS0_5rangeILi1EEES5_NS0_5queueEENKUlRNS0_7handlerEE_clES8_E9Bandwidth ; -- End function .section .AMDGPU.csdata ; Kernel info: ; codeLenInByte = 136 ; NumSgprs: 16 ; NumVgprs: 7 ; ScratchSize: 16 ; MemoryBound: 0 ; FloatMode: 240 ; IeeeMode: 1 ; LDSByteSize: 0 bytes/workgroup (compile time only) ; SGPRBlocks: 1 ; VGPRBlocks: 1 ; NumSGPRsForWavesPerEU: 16 ; NumVGPRsForWavesPerEU: 7 ; Occupancy: 8 ; WaveLimiterHint : 0 ; COMPUTE_PGM_RSRC2:SCRATCH_EN: 1 ; COMPUTE_PGM_RSRC2:USER_SGPR: 10 ; COMPUTE_PGM_RSRC2:TRAP_HANDLER: 0 ; COMPUTE_PGM_RSRC2:TGID_X_EN: 1 ; COMPUTE_PGM_RSRC2:TGID_Y_EN: 0 ; COMPUTE_PGM_RSRC2:TGID_Z_EN: 0 ; COMPUTE_PGM_RSRC2:TIDIG_COMP_CNT: 0 .text .protected _ZTSZZ9benchmarkPN4sycl3_V13vecIfLi4EEENS0_5rangeILi1EEES5_NS0_5queueEENKUlRNS0_7handlerEE_clES8_E9Bandwidth_with_offset ; -- Begin function _ZTSZZ9benchmarkPN4sycl3_V13vecIfLi4EEENS0_5rangeILi1EEES5_NS0_5queueEENKUlRNS0_7handlerEE_clES8_E9Bandwidth_with_offset .weak _ZTSZZ9benchmarkPN4sycl3_V13vecIfLi4EEENS0_5rangeILi1EEES5_NS0_5queueEENKUlRNS0_7handlerEE_clES8_E9Bandwidth_with_offset .p2align 8 .type _ZTSZZ9benchmarkPN4sycl3_V13vecIfLi4EEENS0_5rangeILi1EEES5_NS0_5queueEENKUlRNS0_7handlerEE_clES8_E9Bandwidth_with_offset,@function _ZTSZZ9benchmarkPN4sycl3_V13vecIfLi4EEENS0_5rangeILi1EEES5_NS0_5queueEENKUlRNS0_7handlerEE_clES8_E9Bandwidth_with_offset: ; @_ZTSZZ9benchmarkPN4sycl3_V13vecIfLi4EEENS0_5rangeILi1EEES5_NS0_5queueEENKUlRNS0_7handlerEE_clES8_E9Bandwidth_with_offset ; %bb.0: ; %entry s_load_dword s8, s[4:5], 0x4 s_load_dwordx4 s[12:15], s[6:7], 0x0 s_load_dword s9, s[6:7], 0x10 s_add_u32 s0, s0, s11 s_addc_u32 s1, s1, 0 s_waitcnt lgkmcnt(0) s_and_b32 s4, s8, 0xffff s_mul_i32 s4, s4, s10 s_add_i32 s4, s14, s4 v_add_u32_e32 v0, s4, v0 v_mov_b32_e32 v1, 0 v_lshlrev_b64 v[0:1], 4, v[0:1] v_mov_b32_e32 v2, s13 v_add_co_u32_e32 v4, vcc, s12, v0 v_addc_co_u32_e32 v5, vcc, v2, v1, vcc global_load_dwordx4 v[0:3], v[4:5], off v_mov_b32_e32 v6, s9 v_mov_b32_e32 v7, s15 v_mov_b32_e32 v8, s14 buffer_store_dword v6, off, s[0:3], 0 offset:12 buffer_store_dword v7, off, s[0:3], 0 offset:8 buffer_store_dword v8, off, s[0:3], 0 offset:4 s_waitcnt vmcnt(3) v_add_f32_e32 v3, 1.0, v3 v_add_f32_e32 v2, 1.0, v2 v_add_f32_e32 v1, 1.0, v1 v_add_f32_e32 v0, 1.0, v0 global_store_dwordx4 v[4:5], v[0:3], off s_endpgm .section .rodata,#alloc .p2align 6, 0x0 .amdhsa_kernel _ZTSZZ9benchmarkPN4sycl3_V13vecIfLi4EEENS0_5rangeILi1EEES5_NS0_5queueEENKUlRNS0_7handlerEE_clES8_E9Bandwidth_with_offset .amdhsa_group_segment_fixed_size 0 .amdhsa_private_segment_fixed_size 16 .amdhsa_kernarg_size 20 .amdhsa_user_sgpr_count 10 .amdhsa_user_sgpr_private_segment_buffer 1 .amdhsa_user_sgpr_dispatch_ptr 1 .amdhsa_user_sgpr_queue_ptr 0 .amdhsa_user_sgpr_kernarg_segment_ptr 1 .amdhsa_user_sgpr_dispatch_id 0 .amdhsa_user_sgpr_flat_scratch_init 1 .amdhsa_user_sgpr_private_segment_size 0 .amdhsa_system_sgpr_private_segment_wavefront_offset 1 .amdhsa_system_sgpr_workgroup_id_x 1 .amdhsa_system_sgpr_workgroup_id_y 0 .amdhsa_system_sgpr_workgroup_id_z 0 .amdhsa_system_sgpr_workgroup_info 0 .amdhsa_system_vgpr_workitem_id 0 .amdhsa_next_free_vgpr 9 .amdhsa_next_free_sgpr 16 .amdhsa_reserve_flat_scratch 0 .amdhsa_reserve_xnack_mask 1 .amdhsa_float_round_mode_32 0 .amdhsa_float_round_mode_16_64 0 .amdhsa_float_denorm_mode_32 3 .amdhsa_float_denorm_mode_16_64 3 .amdhsa_dx10_clamp 1 .amdhsa_ieee_mode 1 .amdhsa_fp16_overflow 0 .amdhsa_exception_fp_ieee_invalid_op 0 .amdhsa_exception_fp_denorm_src 0 .amdhsa_exception_fp_ieee_div_zero 0 .amdhsa_exception_fp_ieee_overflow 0 .amdhsa_exception_fp_ieee_underflow 0 .amdhsa_exception_fp_ieee_inexact 0 .amdhsa_exception_int_div_zero 0 .end_amdhsa_kernel .text .Lfunc_end1: .size _ZTSZZ9benchmarkPN4sycl3_V13vecIfLi4EEENS0_5rangeILi1EEES5_NS0_5queueEENKUlRNS0_7handlerEE_clES8_E9Bandwidth_with_offset, .Lfunc_end1-_ZTSZZ9benchmarkPN4sycl3_V13vecIfLi4EEENS0_5rangeILi1EEES5_NS0_5queueEENKUlRNS0_7handlerEE_clES8_E9Bandwidth_with_offset ; -- End function .section .AMDGPU.csdata ; Kernel info: ; codeLenInByte = 156 ; NumSgprs: 20 ; NumVgprs: 9 ; ScratchSize: 16 ; MemoryBound: 0 ; FloatMode: 240 ; IeeeMode: 1 ; LDSByteSize: 0 bytes/workgroup (compile time only) ; SGPRBlocks: 2 ; VGPRBlocks: 2 ; NumSGPRsForWavesPerEU: 20 ; NumVGPRsForWavesPerEU: 9 ; Occupancy: 8 ; WaveLimiterHint : 0 ; COMPUTE_PGM_RSRC2:SCRATCH_EN: 1 ; COMPUTE_PGM_RSRC2:USER_SGPR: 10 ; COMPUTE_PGM_RSRC2:TRAP_HANDLER: 0 ; COMPUTE_PGM_RSRC2:TGID_X_EN: 1 ; COMPUTE_PGM_RSRC2:TGID_Y_EN: 0 ; COMPUTE_PGM_RSRC2:TGID_Z_EN: 0 ; COMPUTE_PGM_RSRC2:TIDIG_COMP_CNT: 0 .ident "clang version 17.0.0 (https://github.com/intel/llvm.git 8ea3e8eb65b863dfacb3c970d4403ae322e8d02e)" .ident "clang version 15.0.7" .section ".note.GNU-stack" .addrsig .amdgpu_metadata --- amdhsa.kernels: - .args: - .address_space: global .name: _arg_array .offset: 0 .size: 8 .value_kind: global_buffer .group_segment_fixed_size: 0 .kernarg_segment_align: 8 .kernarg_segment_size: 8 .language: OpenCL C .language_version: - 2 - 0 .max_flat_workgroup_size: 1024 .name: _ZTSZZ9benchmarkPN4sycl3_V13vecIfLi4EEENS0_5rangeILi1EEES5_NS0_5queueEENKUlRNS0_7handlerEE_clES8_E9Bandwidth .private_segment_fixed_size: 16 .sgpr_count: 16 .sgpr_spill_count: 0 .symbol: _ZTSZZ9benchmarkPN4sycl3_V13vecIfLi4EEENS0_5rangeILi1EEES5_NS0_5queueEENKUlRNS0_7handlerEE_clES8_E9Bandwidth.kd .vgpr_count: 7 .vgpr_spill_count: 0 .wavefront_size: 64 - .args: - .address_space: global .offset: 0 .size: 8 .value_kind: global_buffer - .offset: 8 .size: 12 .value_kind: by_value .group_segment_fixed_size: 0 .kernarg_segment_align: 8 .kernarg_segment_size: 20 .language: OpenCL C .language_version: - 2 - 0 .max_flat_workgroup_size: 1024 .name: _ZTSZZ9benchmarkPN4sycl3_V13vecIfLi4EEENS0_5rangeILi1EEES5_NS0_5queueEENKUlRNS0_7handlerEE_clES8_E9Bandwidth_with_offset .private_segment_fixed_size: 16 .sgpr_count: 20 .sgpr_spill_count: 0 .symbol: _ZTSZZ9benchmarkPN4sycl3_V13vecIfLi4EEENS0_5rangeILi1EEES5_NS0_5queueEENKUlRNS0_7handlerEE_clES8_E9Bandwidth_with_offset.kd .vgpr_count: 9 .vgpr_spill_count: 0 .wavefront_size: 64 amdhsa.target: amdgcn-amd-amdhsa--gfx906 amdhsa.version: - 1 - 1 ... .end_amdgpu_metadata ```
tom91136 commented 1 year ago

Just to add, and thanks to @biergaizi for bringing this up with me. We've verified the lower performance on our end as well. Here's BabelStream compiled with ICPX compared to quite a few other models on a MI100 (bandwidth normalised to theoretical peak, a value of 1 is 1228GB/s, higher is better): babelstream-mi100-no-utpx

The only compiler that showed a severe bandwidth limit is ICPX, and we see identical results from a RadeonVII as well. APU and some RDNA numbers are blocked by https://github.com/intel/llvm/issues/11203 which will be resolved if we merge https://github.com/intel/llvm/pull/11254 (thanks @al42and!) ICPX used here is 2023.2.1 with the Codeplay vendor plugin. StdPar on ICPX uses the oneDPL library.

(FYI, @tomdeakin)

GeorgeWeb commented 11 months ago

@biergaizi Could you please try with the latest changes, as we attempted to close the gap by getting rid of unnecessary stack stores with commit: 00cf4c29740b2ec8d027e59e49e376a211599a58 (see intel/llvm/pull/11674).

We found a lot of the perf degradation came from the above and to mitigate it now, we have provided a way to switch off the culprit compiler transformation by compiling with the -mllvm -enable-global-offset=false.

We are looking into more possible causes for such performance degradation.

Thanks!

biergaizi commented 11 months ago

We found a lot of the perf degradation came from the above and to mitigate it now, we have provided a way to switch off the culprit compiler transformation by compiling with the -mllvm -enable-global-offset=false.

I confirm that -mllvm -enable-global-offset=false fixed the performance issue in both cases.

The original low-performance object code was:

; _ZTSZZ9benchmarkPN4sycl3_V13vecIfLi4EEENS0_5rangeILi1EEES5_NS0_5queueEENKUlRNS0_7handlerEE_clES8_E9Bandwidth

        s_add_u32 s0, s0, s11
        s_load_dword s11, s[4:5], 0x4
        s_load_dwordx2 s[8:9], s[6:7], 0x0
        s_addc_u32 s1, s1, 0
        v_mov_b32_e32 v4, 0
        s_waitcnt lgkmcnt(0)
        s_and_b32 s4, s11, 0xffff
        s_mul_i32 s4, s4, s10
        v_add_u32_e32 v3, s4, v0
        v_lshlrev_b64 v[0:1], 4, v[3:4]
        v_mov_b32_e32 v2, s9
        v_add_co_u32_e32 v5, vcc, s8, v0
        v_addc_co_u32_e32 v6, vcc, v2, v1, vcc
        global_load_dwordx4 v[0:3], v[5:6], off
        s_nop 0
        buffer_store_dword v4, off, s[0:3], 0 offset:8
        buffer_store_dword v4, off, s[0:3], 0 offset:4
        buffer_store_dword v4, off, s[0:3], 0 offset:12
        s_waitcnt vmcnt(3)
        v_add_f32_e32 v3, 1.0, v3
        v_add_f32_e32 v2, 1.0, v2
        v_add_f32_e32 v1, 1.0, v1
        v_add_f32_e32 v0, 1.0, v0
        global_store_dwordx4 v[5:6], v[0:3], off
        s_endpgm

; _ZTSZZ9benchmarkPN4sycl3_V13vecIfLi4EEENS0_5rangeILi1EEES5_NS0_5queueEENKUlRNS0_7handlerEE_clES8_E9Bandwidth

        s_load_dword s8, s[4:5], 0x4
        s_load_dwordx4 s[12:15], s[6:7], 0x0
        s_load_dword s9, s[6:7], 0x10
        s_add_u32 s0, s0, s11
        s_addc_u32 s1, s1, 0
        s_waitcnt lgkmcnt(0)
        s_and_b32 s4, s8, 0xffff
        s_mul_i32 s4, s4, s10
        v_add_u32_e32 v0, s14, v0
        v_add_u32_e32 v0, s4, v0
        v_mov_b32_e32 v1, 0
        v_lshlrev_b64 v[0:1], 4, v[0:1]
        v_mov_b32_e32 v2, s13
        v_add_co_u32_e32 v4, vcc, s12, v0
        v_addc_co_u32_e32 v5, vcc, v2, v1, vcc
        global_load_dwordx4 v[0:3], v[4:5], off
        v_mov_b32_e32 v6, s9
        v_mov_b32_e32 v7, s15
        v_mov_b32_e32 v8, s14
        buffer_store_dword v6, off, s[0:3], 0 offset:12
        buffer_store_dword v7, off, s[0:3], 0 offset:8
        buffer_store_dword v8, off, s[0:3], 0 offset:4
        s_waitcnt vmcnt(3)
        v_add_f32_e32 v3, 1.0, v3
        v_add_f32_e32 v2, 1.0, v2
        v_add_f32_e32 v1, 1.0, v1
        v_add_f32_e32 v0, 1.0, v0
        global_store_dwordx4 v[4:5], v[0:3], off
        s_endpgm

After using -mllvm -enable-global-offset=false, the high-performance object code is:

        s_load_dword s2, s[4:5], 0x4
        s_load_dwordx2 s[0:1], s[6:7], 0x0
        v_mov_b32_e32 v1, 0
        s_waitcnt lgkmcnt(0)
        s_and_b32 s2, s2, 0xffff
        s_mul_i32 s2, s2, s8
        v_add_u32_e32 v0, s2, v0
        v_lshlrev_b64 v[0:1], 4, v[0:1]
        v_mov_b32_e32 v2, s1
        v_add_co_u32_e32 v4, vcc, s0, v0
        v_addc_co_u32_e32 v5, vcc, v2, v1, vcc
        global_load_dwordx4 v[0:3], v[4:5], off
        s_waitcnt vmcnt(0)
        v_add_f32_e32 v3, 1.0, v3
        v_add_f32_e32 v2, 1.0, v2
        v_add_f32_e32 v1, 1.0, v1
        v_add_f32_e32 v0, 1.0, v0
        global_store_dwordx4 v[4:5], v[0:3], off
        s_endpgm

This was consistent with my AMDGPU micro-benchmark experience that global_load_dwordx4 is a high-performance pattern.

Running the executables showed that both examples now run at above 700 GB/s. Thus, I confirm that the workaround works.


If anyone is also investigating memory bandwidth on AMD GPU. here my own findings of another related memory bandwidth issue that is of general interest that I've tracked down at a downstream project. Basically, in addition to global_load_dwordx4 in isolation, another high-performance pattern on AMDGPU is four contiguous `global_load_dwordx4 with offset 0, 16, 32, 48 (with offset either incrementing or decreasing), and all pointed to the same 64-byte cache line, such as:

    global_load_dwordx4 v[15:18], v[19:20], off
    global_load_dwordx4 v[11:14], v[19:20], off offset:16
    global_load_dwordx4 v[7:10], v[19:20], off offset:32
    global_load_dwordx4 v[3:6], v[19:20], off offset:48

If, for some reason, the compiler doesn't generate 4 contiguous loads but tries to interleave them with other instructions, performance degradation can be significant. I've recently pinpointed a OpenCL performance degradation from upstream LLVM 15 to LLVM 16 because LLVM started to interleave loads and compute instructions more aggressively, causing performance degradation. The same kind of problem also affects SYCL.

More information can be found at: https://github.com/AdaptiveCpp/AdaptiveCpp/issues/1143

GeorgeWeb commented 11 months ago

@biergaizi Thanks for the proactive response!

If you would like to open a new issue with your latter findings on memory bandwidth performance degradation with a reproducible, we can track it and look into it.

Otherwise let me know if we can further assist on this one, and if not, maybe happy to close it? Also double checking with @tom91136. :)

GeorgeWeb commented 11 months ago

I am closing this now but please feel free to reopen if the original discussed issue isn't fully addressed, and as suggested for any of your following findings you can file a new ticket that we can track and address. Thanks a lot for these!