Open chillingche opened 2 years ago
展开前:
#define DOT_A4B16C4(a, b, c) \ { \ c.x += (a.x * b.s0 + a.y * b.s1 + a.z * b.s2 + a.w * b.s3); \ c.y += (a.x * b.s4 + a.y * b.s5 + a.z * b.s6 + a.w * b.s7); \ c.z += (a.x * b.s8 + a.y * b.s9 + a.z * b.sa + a.w * b.sb); \ c.w += (a.x * b.sc + a.y * b.sd + a.z * b.se + a.w * b.sf); \ }
./test_convolution_ocl 32 128 128 32 3 3 1 1 0 [DEBUG] thread 15285 OCLContext 0x589b080390 constructor start [DEBUG] thread 15285 try to dlopen libQUALCOMM_Adreno_650_map.so failed, dlopen failed: library "libQUALCOMM_Adreno_650_map.so" not found, create kernel from source code [DEBUG] thread 15285 gcl_kernel_source 0xb4000074402203c0 constructor [DEBUG] thread 15285 OCLContext 0x589b080390 constructor end [DEBUG] thread 15285 get forward run info from cache fail, try to find best forward run info [DEBUG] thread 15285 KERNEL>>> unknow_conv_direct_sh1_qc_iom_3311 runInfo: ls <0 0 0> executeTime = 2797.056000 us [DEBUG] thread 15285 KERNEL>>> unknow_conv_direct_sh1_qc_iom_3321 runInfo: ls <0 0 0> executeTime = 1689.088000 us [DEBUG] thread 15285 KERNEL>>> unknow_conv_direct_sh1_qc_iom_3331 runInfo: ls <0 0 0> executeTime = 1257.984000 us [DEBUG] thread 15285 KERNEL>>> unknow_conv_direct_sh1_qc_iom_3341 runInfo: ls <0 0 0> executeTime = 1140.992000 us [DEBUG] thread 15285 KERNEL>>> unknow_conv_direct_sh1_qc_iom_3351 runInfo: ls <0 0 0> executeTime = 1051.136000 us [DEBUG] thread 15285 KERNEL>>> unknow_conv_direct_sh1_qc_iom_3361 runInfo: ls <0 0 0> executeTime = 1120.000000 us [DEBUG] thread 15285 KERNEL>>> unknow_conv_direct_sh1_qc_iom_3371 runInfo: ls <0 0 0> executeTime = 1175.040000 us [DEBUG] thread 15285 KERNEL>>> unknow_conv_direct_sh1_qc_iom_3381 runInfo: ls <0 0 0> executeTime = 1026.048000 us [DEBUG] thread 15285 KERNEL>>> unknow_conv_direct_sh1_qc_iom_3312 runInfo: ls <0 0 0> executeTime = 2488.832000 us [DEBUG] thread 15285 KERNEL>>> unknow_conv_direct_sh1_qc_iom_3322 runInfo: ls <0 0 0> executeTime = 1725.952000 us [DEBUG] thread 15285 KERNEL>>> unknow_conv_direct_sh1_qc_iom_3332 runInfo: ls <0 0 0> executeTime = 1430.016000 us [DEBUG] thread 15285 KERNEL>>> unknow_conv_direct_sh1_qc_iom_3342 runInfo: ls <0 0 0> executeTime = 1312.000000 us [DEBUG] thread 15285 KERNEL>>> unknow_conv_direct_sh1_qc_iom_3314 runInfo: ls <0 0 0> executeTime = 5136.896000 us [DEBUG] thread 15285 KERNEL>>> unknow_conv_direct_sh1_qc_iom_3324 runInfo: ls <0 0 0> executeTime = 3611.136000 us [DEBUG] thread 15285 KERNEL>>> unknow_conv_direct_sh1_qc_iom_3334 runInfo: ls <0 0 0> executeTime = 3038.976000 us [DEBUG] thread 15285 enqueue_fill_image runInfo: executeTime = 17.920000 us [DEBUG] thread 15285 KERNEL>>> unknow_conv_direct_trans_flt_hw_44 runInfo: executeTime = 13.056000 us [DEBUG] thread 15285 DATATRANS>>> enqueue_write_buffer runInfo: executeTime = 77.056000 us [DEBUG] thread 15285 KERNEL>>> unknow_mem_trans_om_nchw_to_nchwc4 runInfo: executeTime = 80.128000 us [DEBUG] thread 15285 Get memory val without allocated, the capacitySize is 0 [DEBUG] thread 15285 Get memory val without allocated, the capacitySize is 0 [DEBUG] thread 15285 KERNEL>>> unknow_conv_direct_sh1_qc_iom_3381 runInfo: ls <0 0 0> executeTime = 1022.976000 us [DEBUG] thread 15285 KERNEL>>> unknow_conv_direct_sh1_qc_iom_3381 runInfo: ls <0 0 0> executeTime = 1000.960000 us [DEBUG] thread 15285 SELECT LS KERNEL>>> unknow_conv_direct_sh1_qc_iom_3381 runInfo: best ls = 8 1 8 executeTime = 860.928000 us [DEBUG] thread 15285 KERNEL>>> unknow_conv_direct_sh1_qc_iom_3381 runInfo: ls <8 1 8> executeTime = 860.160000 us [INFO] thread 15285 min_time = 0.860160 [INFO] thread 15285 max_time = 0.860160 [INFO] thread 15285 avg_time = -0.000000 [DEBUG] thread 15285 KERNEL>>> unknow_mem_trans_im_nchwc4_to_nchw runInfo: executeTime = 140.032000 us [DEBUG] thread 15285 DATATRANS>>> enqueue_read_buffer runInfo: executeTime = 79.104000 us [INFO] thread 15285 16bit, Convolution, (1 32 1 128 128)+(32 32 1 3 3)/(1 1 1 1 0 0 1 1 1 1)=(1 32 1 128 128), TIME 0.860ms, GFLOPS 351.695 abs(diff) >= 1.000000e+00f, number = 0 abs(diff) >= 1.000000e-01f, number = 0 abs(diff) >= 1.000000e-02f, number = 13129 abs(diff) >= 1.000000e-03f, number = 339363 abs(diff) >= 1.000000e-04f, number = 123968 abs(diff) >= 1.000000e-05f, number = 681 abs(diff) >= 0.000000e+00f, number = 47147 maxabs = 0.046875, a = 4.781250, b = 4.828125 @ 357254 maxrel = 10498.046875, a = 0.002625, b = -0.002625 @ 278147 [DEBUG] thread 15285 OCLContext 0x589b080390 deconstructor start [DEBUG] thread 15285 gcl_kernel_source 0xb4000074402203c0 constructor [DEBUG] thread 15285 OCLContext 0x589b080390 deconstructor end
展开后:
#define DOT_A4B16C4(a, b, c) \ { \ c.x += (a.x * b.s0); \ c.x += (a.y * b.s1); \ c.x += (a.z * b.s2); \ c.x += (a.w * b.s3); \ c.y += (a.x * b.s4); \ c.y += (a.y * b.s5); \ c.y += (a.z * b.s6); \ c.y += (a.w * b.s7); \ c.z += (a.x * b.s8); \ c.z += (a.y * b.s9); \ c.z += (a.z * b.sa); \ c.z += (a.w * b.sb); \ c.w += (a.x * b.sc); \ c.w += (a.y * b.sd); \ c.w += (a.z * b.se); \ c.w += (a.w * b.sf); \ }
./test_convolution_ocl 32 128 128 32 3 3 1 1 0 [DEBUG] thread 17343 OCLContext 0x5e124b4390 constructor start [DEBUG] thread 17343 try to dlopen libQUALCOMM_Adreno_650_map.so failed, dlopen failed: library "libQUALCOMM_Adreno_650_map.so" not found, create kernel from source code [DEBUG] thread 17343 gcl_kernel_source 0xb400007ab98203c0 constructor [DEBUG] thread 17343 OCLContext 0x5e124b4390 constructor end [DEBUG] thread 17343 get forward run info from cache fail, try to find best forward run info [DEBUG] thread 17343 KERNEL>>> unknow_conv_direct_sh1_qc_iom_3311 runInfo: ls <0 0 0> executeTime = 2744.832000 us [DEBUG] thread 17343 KERNEL>>> unknow_conv_direct_sh1_qc_iom_3321 runInfo: ls <0 0 0> executeTime = 1667.072000 us [DEBUG] thread 17343 KERNEL>>> unknow_conv_direct_sh1_qc_iom_3331 runInfo: ls <0 0 0> executeTime = 1198.080000 us [DEBUG] thread 17343 KERNEL>>> unknow_conv_direct_sh1_qc_iom_3341 runInfo: ls <0 0 0> executeTime = 1105.920000 us [DEBUG] thread 17343 KERNEL>>> unknow_conv_direct_sh1_qc_iom_3351 runInfo: ls <0 0 0> executeTime = 1036.032000 us [DEBUG] thread 17343 KERNEL>>> unknow_conv_direct_sh1_qc_iom_3361 runInfo: ls <0 0 0> executeTime = 944.896000 us [DEBUG] thread 17343 KERNEL>>> unknow_conv_direct_sh1_qc_iom_3371 runInfo: ls <0 0 0> executeTime = 958.976000 us [DEBUG] thread 17343 KERNEL>>> unknow_conv_direct_sh1_qc_iom_3381 runInfo: ls <0 0 0> executeTime = 907.008000 us [DEBUG] thread 17343 KERNEL>>> unknow_conv_direct_sh1_qc_iom_3312 runInfo: ls <0 0 0> executeTime = 2529.024000 us [DEBUG] thread 17343 KERNEL>>> unknow_conv_direct_sh1_qc_iom_3322 runInfo: ls <0 0 0> executeTime = 1652.992000 us [DEBUG] thread 17343 KERNEL>>> unknow_conv_direct_sh1_qc_iom_3332 runInfo: ls <0 0 0> executeTime = 1390.848000 us [DEBUG] thread 17343 KERNEL>>> unknow_conv_direct_sh1_qc_iom_3342 runInfo: ls <0 0 0> executeTime = 1227.008000 us [DEBUG] thread 17343 KERNEL>>> unknow_conv_direct_sh1_qc_iom_3314 runInfo: ls <0 0 0> executeTime = 5095.936000 us [DEBUG] thread 17343 KERNEL>>> unknow_conv_direct_sh1_qc_iom_3324 runInfo: ls <0 0 0> executeTime = 3202.048000 us [DEBUG] thread 17343 KERNEL>>> unknow_conv_direct_sh1_qc_iom_3334 runInfo: ls <0 0 0> executeTime = 2576.896000 us [DEBUG] thread 17343 enqueue_fill_image runInfo: executeTime = 17.920000 us [DEBUG] thread 17343 KERNEL>>> unknow_conv_direct_trans_flt_hw_44 runInfo: executeTime = 12.800000 us [DEBUG] thread 17343 DATATRANS>>> enqueue_write_buffer runInfo: executeTime = 68.864000 us [DEBUG] thread 17343 KERNEL>>> unknow_mem_trans_om_nchw_to_nchwc4 runInfo: executeTime = 78.080000 us [DEBUG] thread 17343 Get memory val without allocated, the capacitySize is 0 [DEBUG] thread 17343 Get memory val without allocated, the capacitySize is 0 [DEBUG] thread 17343 KERNEL>>> unknow_conv_direct_sh1_qc_iom_3381 runInfo: ls <0 0 0> executeTime = 914.944000 us [DEBUG] thread 17343 KERNEL>>> unknow_conv_direct_sh1_qc_iom_3381 runInfo: ls <0 0 0> executeTime = 895.232000 us [DEBUG] thread 17343 SELECT LS KERNEL>>> unknow_conv_direct_sh1_qc_iom_3381 runInfo: best ls = 8 1 8 executeTime = 760.064000 us [DEBUG] thread 17343 KERNEL>>> unknow_conv_direct_sh1_qc_iom_3381 runInfo: ls <8 1 8> executeTime = 768.000000 us [INFO] thread 17343 min_time = 0.768000 [INFO] thread 17343 max_time = 0.768000 [INFO] thread 17343 avg_time = -0.000000 [DEBUG] thread 17343 KERNEL>>> unknow_mem_trans_im_nchwc4_to_nchw runInfo: executeTime = 139.008000 us [DEBUG] thread 17343 DATATRANS>>> enqueue_read_buffer runInfo: executeTime = 77.056000 us [INFO] thread 17343 16bit, Convolution, (1 32 1 128 128)+(32 32 1 3 3)/(1 1 1 1 0 0 1 1 1 1)=(1 32 1 128 128), TIME 0.768ms, GFLOPS 393.899 abs(diff) >= 1.000000e+00f, number = 0 abs(diff) >= 1.000000e-01f, number = 0 abs(diff) >= 1.000000e-02f, number = 7769 abs(diff) >= 1.000000e-03f, number = 349884 abs(diff) >= 1.000000e-04f, number = 118162 abs(diff) >= 1.000000e-05f, number = 814 abs(diff) >= 0.000000e+00f, number = 47659 maxabs = 0.039062, a = -3.292969, b = -3.253906 @ 68999 maxrel = 11718.750000, a = 0.002930, b = -0.002930 @ 386530 [DEBUG] thread 17343 OCLContext 0x5e124b4390 deconstructor start [DEBUG] thread 17343 gcl_kernel_source 0xb400007ab98203c0 constructor [DEBUG] thread 17343 OCLContext 0x5e124b4390 deconstructor end
./test_convolution_ocl 64 256 256 32 3 3 1 1 0
展开前:
展开后: