Open PENGUINLIONG opened 3 years ago
I tried this patch on my local pc. This is the output on pocl platform:
Platform: Portable Computing Language Device: pthread-Intel(R) Core(TM) i7-8750H CPU @ 2.20GHz Driver version : 1.5 (Linux x64) Compute units : 12 Clock frequency : 4100 MHz
Global memory bandwidth (GBPS) float : 27.44 float2 : 29.97 float4 : 31.91 float8 : 31.71 float16 : 27.86 Single-precision compute (GFLOPS) float : 23.49 float2 : 47.32 float4 : 95.59 float8 : 190.82 float16 : 343.63 No half precision support! Skipped Double-precision compute (GFLOPS) double : 23.37 double2 : 46.87 double4 : 94.91 double8 : 175.67 double16 : 280.08 Integer compute (GIOPS) int : 27183.34 int2 : 8061.12 int4 : 5482.94 int8 : 2874.55 int16 : 2959.33 Integer compute Fast 24bit (GIOPS) int : 27745.27 int2 : 8186.09 int4 : 5211.50 int8 : 2745.67 int16 : 3050.98 Transfer bandwidth (GBPS) enqueueWriteBuffer : 15.34 enqueueReadBuffer : 15.32 enqueueWriteBuffer non-blocking : 15.39 enqueueReadBuffer non-blocking : 15.37 enqueueMapBuffer(for read) : 11645.79 memcpy from mapped ptr : 15.22 enqueueUnmap(after write) : 16989.59 memcpy to mapped ptr : 15.26 Kernel launch latency : 17.60 us
The integer compute number looks abnormally high. My guess is that the compiler is optimising redundant calculations on the rhs in
#define MAD_4(x, y, z) z += (y*x) + y; z += (x*y) + x; z += (y*x) + y; z += (x*y) + x;
Wouldn't this also invalidate all previous benchmark results?
No. These results are on this patchset. Introduction of z variable removes the dependency between statements and allows compiler to optimise out repeated (y*x) + y
calculation
I understand the intention here. We need a better way to optimise this
Profiling data shows that the current implementation only reaches 70% of the full ALU capacity (on Adreno 640), limited by the data dependency between instructions. The provided implementation can reach 100% ALU utility and reflect the actual maximal performance of an OpenCL platform.