huawei-noah / bolt

Bolt is a deep learning library with high performance and heterogeneous flexibility.
https://huawei-noah.github.io/bolt/
MIT License
909 stars 158 forks source link

展开OCL kernel中的标量dot操作可以获得更高的GFLOPs #113

Open chillingche opened 2 years ago

chillingche commented 2 years ago

展开前:

#define DOT_A4B16C4(a, b, c)                                        \
    {                                                               \
        c.x += (a.x * b.s0 + a.y * b.s1 + a.z * b.s2 + a.w * b.s3); \
        c.y += (a.x * b.s4 + a.y * b.s5 + a.z * b.s6 + a.w * b.s7); \
        c.z += (a.x * b.s8 + a.y * b.s9 + a.z * b.sa + a.w * b.sb); \
        c.w += (a.x * b.sc + a.y * b.sd + a.z * b.se + a.w * b.sf); \
    }
./test_convolution_ocl 32 128 128 32 3 3 1 1 0
[DEBUG] thread 15285 OCLContext 0x589b080390 constructor start
[DEBUG] thread 15285 try to dlopen libQUALCOMM_Adreno_650_map.so failed, dlopen failed: library "libQUALCOMM_Adreno_650_map.so" not found, create kernel from source code
[DEBUG] thread 15285 gcl_kernel_source 0xb4000074402203c0 constructor
[DEBUG] thread 15285 OCLContext 0x589b080390 constructor end
[DEBUG] thread 15285 get forward run info from cache fail, try to find best forward run info
[DEBUG] thread 15285 KERNEL>>> unknow_conv_direct_sh1_qc_iom_3311 runInfo: ls <0 0 0> executeTime = 2797.056000 us
[DEBUG] thread 15285 KERNEL>>> unknow_conv_direct_sh1_qc_iom_3321 runInfo: ls <0 0 0> executeTime = 1689.088000 us
[DEBUG] thread 15285 KERNEL>>> unknow_conv_direct_sh1_qc_iom_3331 runInfo: ls <0 0 0> executeTime = 1257.984000 us
[DEBUG] thread 15285 KERNEL>>> unknow_conv_direct_sh1_qc_iom_3341 runInfo: ls <0 0 0> executeTime = 1140.992000 us
[DEBUG] thread 15285 KERNEL>>> unknow_conv_direct_sh1_qc_iom_3351 runInfo: ls <0 0 0> executeTime = 1051.136000 us
[DEBUG] thread 15285 KERNEL>>> unknow_conv_direct_sh1_qc_iom_3361 runInfo: ls <0 0 0> executeTime = 1120.000000 us
[DEBUG] thread 15285 KERNEL>>> unknow_conv_direct_sh1_qc_iom_3371 runInfo: ls <0 0 0> executeTime = 1175.040000 us
[DEBUG] thread 15285 KERNEL>>> unknow_conv_direct_sh1_qc_iom_3381 runInfo: ls <0 0 0> executeTime = 1026.048000 us
[DEBUG] thread 15285 KERNEL>>> unknow_conv_direct_sh1_qc_iom_3312 runInfo: ls <0 0 0> executeTime = 2488.832000 us
[DEBUG] thread 15285 KERNEL>>> unknow_conv_direct_sh1_qc_iom_3322 runInfo: ls <0 0 0> executeTime = 1725.952000 us
[DEBUG] thread 15285 KERNEL>>> unknow_conv_direct_sh1_qc_iom_3332 runInfo: ls <0 0 0> executeTime = 1430.016000 us
[DEBUG] thread 15285 KERNEL>>> unknow_conv_direct_sh1_qc_iom_3342 runInfo: ls <0 0 0> executeTime = 1312.000000 us
[DEBUG] thread 15285 KERNEL>>> unknow_conv_direct_sh1_qc_iom_3314 runInfo: ls <0 0 0> executeTime = 5136.896000 us
[DEBUG] thread 15285 KERNEL>>> unknow_conv_direct_sh1_qc_iom_3324 runInfo: ls <0 0 0> executeTime = 3611.136000 us
[DEBUG] thread 15285 KERNEL>>> unknow_conv_direct_sh1_qc_iom_3334 runInfo: ls <0 0 0> executeTime = 3038.976000 us
[DEBUG] thread 15285 enqueue_fill_image runInfo: executeTime = 17.920000 us
[DEBUG] thread 15285 KERNEL>>> unknow_conv_direct_trans_flt_hw_44 runInfo: executeTime = 13.056000 us
[DEBUG] thread 15285 DATATRANS>>> enqueue_write_buffer runInfo: executeTime = 77.056000 us
[DEBUG] thread 15285 KERNEL>>> unknow_mem_trans_om_nchw_to_nchwc4 runInfo: executeTime = 80.128000 us
[DEBUG] thread 15285 Get memory val without allocated, the capacitySize is 0
[DEBUG] thread 15285 Get memory val without allocated, the capacitySize is 0
[DEBUG] thread 15285 KERNEL>>> unknow_conv_direct_sh1_qc_iom_3381 runInfo: ls <0 0 0> executeTime = 1022.976000 us
[DEBUG] thread 15285 KERNEL>>> unknow_conv_direct_sh1_qc_iom_3381 runInfo: ls <0 0 0> executeTime = 1000.960000 us
[DEBUG] thread 15285 SELECT LS KERNEL>>> unknow_conv_direct_sh1_qc_iom_3381 runInfo: best ls = 8 1 8 executeTime = 860.928000 us
[DEBUG] thread 15285 KERNEL>>> unknow_conv_direct_sh1_qc_iom_3381 runInfo: ls <8 1 8> executeTime = 860.160000 us
[INFO] thread 15285 min_time = 0.860160
[INFO] thread 15285 max_time = 0.860160
[INFO] thread 15285 avg_time = -0.000000
[DEBUG] thread 15285 KERNEL>>> unknow_mem_trans_im_nchwc4_to_nchw runInfo: executeTime = 140.032000 us
[DEBUG] thread 15285 DATATRANS>>> enqueue_read_buffer runInfo: executeTime = 79.104000 us
[INFO] thread 15285 16bit,          Convolution,            (1 32 1 128 128)+(32 32 1 3 3)/(1 1 1 1 0 0 1 1 1 1)=(1 32 1 128 128),     TIME    0.860ms,        GFLOPS  351.695
abs(diff) >= 1.000000e+00f, number = 0
abs(diff) >= 1.000000e-01f, number = 0
abs(diff) >= 1.000000e-02f, number = 13129
abs(diff) >= 1.000000e-03f, number = 339363
abs(diff) >= 1.000000e-04f, number = 123968
abs(diff) >= 1.000000e-05f, number = 681
abs(diff) >= 0.000000e+00f, number = 47147
maxabs = 0.046875, a = 4.781250, b = 4.828125 @ 357254
maxrel = 10498.046875, a = 0.002625, b = -0.002625 @ 278147
[DEBUG] thread 15285 OCLContext 0x589b080390 deconstructor start
[DEBUG] thread 15285 gcl_kernel_source 0xb4000074402203c0 constructor
[DEBUG] thread 15285 OCLContext 0x589b080390 deconstructor end

展开后:

#define DOT_A4B16C4(a, b, c) \
    {                        \
        c.x += (a.x * b.s0); \
        c.x += (a.y * b.s1); \
        c.x += (a.z * b.s2); \
        c.x += (a.w * b.s3); \
        c.y += (a.x * b.s4); \
        c.y += (a.y * b.s5); \
        c.y += (a.z * b.s6); \
        c.y += (a.w * b.s7); \
        c.z += (a.x * b.s8); \
        c.z += (a.y * b.s9); \
        c.z += (a.z * b.sa); \
        c.z += (a.w * b.sb); \
        c.w += (a.x * b.sc); \
        c.w += (a.y * b.sd); \
        c.w += (a.z * b.se); \
        c.w += (a.w * b.sf); \
    }
./test_convolution_ocl 32 128 128 32 3 3 1 1 0
[DEBUG] thread 17343 OCLContext 0x5e124b4390 constructor start
[DEBUG] thread 17343 try to dlopen libQUALCOMM_Adreno_650_map.so failed, dlopen failed: library "libQUALCOMM_Adreno_650_map.so" not found, create kernel from source code
[DEBUG] thread 17343 gcl_kernel_source 0xb400007ab98203c0 constructor
[DEBUG] thread 17343 OCLContext 0x5e124b4390 constructor end
[DEBUG] thread 17343 get forward run info from cache fail, try to find best forward run info
[DEBUG] thread 17343 KERNEL>>> unknow_conv_direct_sh1_qc_iom_3311 runInfo: ls <0 0 0> executeTime = 2744.832000 us
[DEBUG] thread 17343 KERNEL>>> unknow_conv_direct_sh1_qc_iom_3321 runInfo: ls <0 0 0> executeTime = 1667.072000 us
[DEBUG] thread 17343 KERNEL>>> unknow_conv_direct_sh1_qc_iom_3331 runInfo: ls <0 0 0> executeTime = 1198.080000 us
[DEBUG] thread 17343 KERNEL>>> unknow_conv_direct_sh1_qc_iom_3341 runInfo: ls <0 0 0> executeTime = 1105.920000 us
[DEBUG] thread 17343 KERNEL>>> unknow_conv_direct_sh1_qc_iom_3351 runInfo: ls <0 0 0> executeTime = 1036.032000 us
[DEBUG] thread 17343 KERNEL>>> unknow_conv_direct_sh1_qc_iom_3361 runInfo: ls <0 0 0> executeTime = 944.896000 us
[DEBUG] thread 17343 KERNEL>>> unknow_conv_direct_sh1_qc_iom_3371 runInfo: ls <0 0 0> executeTime = 958.976000 us
[DEBUG] thread 17343 KERNEL>>> unknow_conv_direct_sh1_qc_iom_3381 runInfo: ls <0 0 0> executeTime = 907.008000 us
[DEBUG] thread 17343 KERNEL>>> unknow_conv_direct_sh1_qc_iom_3312 runInfo: ls <0 0 0> executeTime = 2529.024000 us
[DEBUG] thread 17343 KERNEL>>> unknow_conv_direct_sh1_qc_iom_3322 runInfo: ls <0 0 0> executeTime = 1652.992000 us
[DEBUG] thread 17343 KERNEL>>> unknow_conv_direct_sh1_qc_iom_3332 runInfo: ls <0 0 0> executeTime = 1390.848000 us
[DEBUG] thread 17343 KERNEL>>> unknow_conv_direct_sh1_qc_iom_3342 runInfo: ls <0 0 0> executeTime = 1227.008000 us
[DEBUG] thread 17343 KERNEL>>> unknow_conv_direct_sh1_qc_iom_3314 runInfo: ls <0 0 0> executeTime = 5095.936000 us
[DEBUG] thread 17343 KERNEL>>> unknow_conv_direct_sh1_qc_iom_3324 runInfo: ls <0 0 0> executeTime = 3202.048000 us
[DEBUG] thread 17343 KERNEL>>> unknow_conv_direct_sh1_qc_iom_3334 runInfo: ls <0 0 0> executeTime = 2576.896000 us
[DEBUG] thread 17343 enqueue_fill_image runInfo: executeTime = 17.920000 us
[DEBUG] thread 17343 KERNEL>>> unknow_conv_direct_trans_flt_hw_44 runInfo: executeTime = 12.800000 us
[DEBUG] thread 17343 DATATRANS>>> enqueue_write_buffer runInfo: executeTime = 68.864000 us
[DEBUG] thread 17343 KERNEL>>> unknow_mem_trans_om_nchw_to_nchwc4 runInfo: executeTime = 78.080000 us
[DEBUG] thread 17343 Get memory val without allocated, the capacitySize is 0
[DEBUG] thread 17343 Get memory val without allocated, the capacitySize is 0
[DEBUG] thread 17343 KERNEL>>> unknow_conv_direct_sh1_qc_iom_3381 runInfo: ls <0 0 0> executeTime = 914.944000 us
[DEBUG] thread 17343 KERNEL>>> unknow_conv_direct_sh1_qc_iom_3381 runInfo: ls <0 0 0> executeTime = 895.232000 us
[DEBUG] thread 17343 SELECT LS KERNEL>>> unknow_conv_direct_sh1_qc_iom_3381 runInfo: best ls = 8 1 8 executeTime = 760.064000 us
[DEBUG] thread 17343 KERNEL>>> unknow_conv_direct_sh1_qc_iom_3381 runInfo: ls <8 1 8> executeTime = 768.000000 us
[INFO] thread 17343 min_time = 0.768000
[INFO] thread 17343 max_time = 0.768000
[INFO] thread 17343 avg_time = -0.000000
[DEBUG] thread 17343 KERNEL>>> unknow_mem_trans_im_nchwc4_to_nchw runInfo: executeTime = 139.008000 us
[DEBUG] thread 17343 DATATRANS>>> enqueue_read_buffer runInfo: executeTime = 77.056000 us
[INFO] thread 17343 16bit,          Convolution,            (1 32 1 128 128)+(32 32 1 3 3)/(1 1 1 1 0 0 1 1 1 1)=(1 32 1 128 128),     TIME    0.768ms,        GFLOPS  393.899
abs(diff) >= 1.000000e+00f, number = 0
abs(diff) >= 1.000000e-01f, number = 0
abs(diff) >= 1.000000e-02f, number = 7769
abs(diff) >= 1.000000e-03f, number = 349884
abs(diff) >= 1.000000e-04f, number = 118162
abs(diff) >= 1.000000e-05f, number = 814
abs(diff) >= 0.000000e+00f, number = 47659
maxabs = 0.039062, a = -3.292969, b = -3.253906 @ 68999
maxrel = 11718.750000, a = 0.002930, b = -0.002930 @ 386530
[DEBUG] thread 17343 OCLContext 0x5e124b4390 deconstructor start
[DEBUG] thread 17343 gcl_kernel_source 0xb400007ab98203c0 constructor
[DEBUG] thread 17343 OCLContext 0x5e124b4390 deconstructor end
chillingche commented 2 years ago

./test_convolution_ocl 64 256 256 32 3 3 1 1 0

yuxianzhi commented 2 years ago

image