Maximize ALU utilization by avoiding pipeline bubbles

PENGUINLIONG commented 3 years ago

Profiling data shows that the current implementation only reaches 70% of the full ALU capacity (on Adreno 640), limited by the data dependency between instructions. The provided implementation can reach 100% ALU utility and reflect the actual maximal performance of an OpenCL platform.

krrishnarraj commented 3 years ago

I tried this patch on my local pc. This is the output on pocl platform:

Platform: Portable Computing Language Device: pthread-Intel(R) Core(TM) i7-8750H CPU @ 2.20GHz Driver version : 1.5 (Linux x64) Compute units : 12 Clock frequency : 4100 MHz

Global memory bandwidth (GBPS)
  float   : 27.44
  float2  : 29.97
  float4  : 31.91
  float8  : 31.71
  float16 : 27.86

Single-precision compute (GFLOPS)
  float   : 23.49
  float2  : 47.32
  float4  : 95.59
  float8  : 190.82
  float16 : 343.63

No half precision support! Skipped

Double-precision compute (GFLOPS)
  double   : 23.37
  double2  : 46.87
  double4  : 94.91
  double8  : 175.67
  double16 : 280.08

Integer compute (GIOPS)
  int   : 27183.34
  int2  : 8061.12
  int4  : 5482.94
  int8  : 2874.55
  int16 : 2959.33

Integer compute Fast 24bit (GIOPS)
  int   : 27745.27
  int2  : 8186.09
  int4  : 5211.50
  int8  : 2745.67
  int16 : 3050.98

Transfer bandwidth (GBPS)
  enqueueWriteBuffer              : 15.34
  enqueueReadBuffer               : 15.32
  enqueueWriteBuffer non-blocking : 15.39
  enqueueReadBuffer non-blocking  : 15.37
  enqueueMapBuffer(for read)      : 11645.79
    memcpy from mapped ptr        : 15.22
  enqueueUnmap(after write)       : 16989.59
    memcpy to mapped ptr          : 15.26

Kernel launch latency : 17.60 us

The integer compute number looks abnormally high. My guess is that the compiler is optimising redundant calculations on the rhs in #define MAD_4(x, y, z) z += (y*x) + y; z += (x*y) + x; z += (y*x) + y; z += (x*y) + x;

doe300 commented 3 years ago

Wouldn't this also invalidate all previous benchmark results?

krrishnarraj commented 3 years ago

No. These results are on this patchset. Introduction of z variable removes the dependency between statements and allows compiler to optimise out repeated (y*x) + y calculation

krrishnarraj commented 3 years ago

I understand the intention here. We need a better way to optimise this

krrishnarraj / clpeak

Maximize ALU utilization by avoiding pipeline bubbles #72