Project Proposal: Memory Optimization and Profiling for MLIR-based HeteroCL with Polyhedral Model

chhzh123 commented 2 years ago

Background

HeteroCL and MLIR

HeteroCL is a programming infrastructure composed of a Python-based domain-specific language (DSL) and a compilation flow that targets heterogeneous hardware platforms. HeteroCL provides a clean abstraction that decouples algorithm specification from three important types of hardware customization in compute, data types, and memory architectures. MLIR stands for Multi-Level Intermediate Representation, which is a modular compiler infrastructure that enables different optimizations performed at different levels of abstraction, and is part of the LLVM project.

Roofline Model

The roofline model is a performance upper-bound model for a given compute kernel or a program running on a platform. It depicts the relationship between the attainable compute performance (GOPs/s) and operation intensity (OPs/Byte). Operation intensity is arithmetic operation per byte of off-chip memory access. It is often used to analyze whether a compute kernel is memory-bound or compute-bound.

Polyhedral Model

In the programming sense, a polyhedral model represents the iteration space of nested loops as a mathematical object called polyhedra. Each loop iteration is a lattice point in the polyhedra. Polyhedral model is widely used in loop optimizations, program scheduling, and memory access analysis.

Read and Write Buffer Generation

To ameliorate the cost of off-chip data access, we can build read buffers to store data for future reuse, and write buffers to decouple memory write back and computation. FPGAs enable us to build customized memory hierarchies to suit the need of computation kernels. An example of a read buffer is the line/window buffer in a typical 2D convolution kernel implementation on an FPGA.

What will you do?

The original HeteroCL implementation for memory primitives (1) only supports read patterns, (2) is not systematic for high-dimensional stencils, and (3) also does not provide guidance to the users on how to optimize the program. To make HeteroCL target more general programs and further improve the performance, we plan to leverage the polyhedral model to analyze memory access patterns, generate on-chip buffers, and profile the performance. Basically, we will do the following three subtasks:

Generate read buffers for the reuse_at primitive: Lots of application may have memory access pattern that incurs data reuse when traversing the arrays. For example, 1D stencil kernel A[i-1] + A[i] + A[i+1] may have two duplicated elements to access in two consecutive iterations. For convolution pattern A[i+rx][j+ry] * B[rx][ry], there will be six elements of array A that can be reused from the previous iteration. Thus, generating buffers for those arrays and reusing the data can greatly reduce redundant memory accesses and improve the final performance.
Generate write buffers for the [buffer_at]() primitive: For the GEMM example, C[i][j] += A[i][k] * B[k][j], if no buffers are specified, we need to load C from off-chip memory, compute and add the result, and store back to memory, which has large overheads. Instead, we can create an intermediate buffer for C to reduce memory loads.
Memory profiling: We can generate roofline model based on the machine specifications and the calculated performance, and see how different design points are shown on the figure.

How will you do it?

We have already designed and implemented the HeteroCL MLIR dialect facilities in a private repository, so we only need to write passes for those read/write optimizations. As mentioned above, we will leverage the polyhedral model to analyze read/write instructions and generate corresponding read/write buffers to reduce memory access from off-chip memory. More specifically, we will implement the above optimizations in the following way:

Read buffer generation (reuse_at): We extract read patterns from the polyhedral model and generate reuse buffer axis by axis. Specifically, suppose we have an input tensor of shape (a1, a2, …, aN) and a stencil kernel that can be covered by a rectangular hull of shape (s1, s2, …, sN), and it exists reuse patterns in each of the dimensions, then we will generate a sequence of hybercube buffers with the following shapes. The figure below shows a 2D example of generated line buffers and window buffers. For more information, please refer to the Memory Customization Tutorial.
```
(s1, a2, ..., aN) // reuse at the 1st axis
(s1, s2, ..., aN) // reuse at the 1st and 2nd axis
...
(s1, s2, ..., sN) // reuse at all the axes
```
Write buffer generation (buffer_at): We extract write patterns from the polyhedral model. Based on the user-specified axis, we generate write buffers in the corresponding place. For example, for the classical GEMM case, we can generate a write buffer in the 1st axis as shown below.
```
// before transformation
for (i, 0, 2)
for (j, 0, 2)
for (k, 0, 2)
  C[i, j] += A[i, k] * B[k, j]
```

// after transformation for (i, 0, 2) { // non-reduction allocate buf_C[2] // on-chip buffer // 1. initalization for (j, 0, 2) // non-reduction buf_C[j] = 0 // 2. computation for (j, 0, 2) for (k, 0, 2) // reduction buf_C[k] += A[i, k] * B[k, j] // 3. write back for (j, 0, 2) C[i, j] = buf_C[j] }


3. Profiling: We can obtain the memory access instructions from polyhedral analysis and calculate the number of memory access for one specific kernel. By giving the memory bandwidth and peak performance of the target device, the resulting arithmetic intensity `I` is the ratio of MAC operations to total data movement (MACs/byte). Finally, we can plot the results in a figure.

## **How will you empirically measure success?**
### Read buffer generation
We measure the success of read buffer generation by generating reuse buffers for stencil programs such as 2D convolution and pipelining the kernel to 1-cycle initial interval (II). The output of our compilation framework is Vivado HLS code. We will validate the kernel with read buffers through simulation and synthesis. 

### Write buffer generation
Similarly, we measure the success of write buffer generation with a case study of 2D convolution with [interleaving accumulation](https://ieeexplore.ieee.org/document/9264692/). The goal is to interleave the matrix multiplication or convolution operations by creating an output line buffer and loop reordering so that the initial interval of the design can also achieve II=1.

### Roofline model generation
The goal of the roofline model generation is two-fold. First, we plot the roofline given the off-chip memory bandwidth and hardware resource. Second, we profile the compute kernel’s memory access and analyze the proportion of arithmetic operations and memory access to estimate the kernel’s operational intensity, which is represented as a data point under the roofline model plot.

## **Team members:**
Hongzheng Chen @chhzh123 
Niansong Zhang @zzzDavid 
Jiajie Li @tonyjie

sampsyo commented 2 years ago

Awesome; this seems super cool, everybody!!!

Here's one question I couldn't quite get from your empirical evaluation plan: what are the two things you will be comparing against one another? Will it be HeteroCL with and without your polyhedral extensions? Or will it be HeteroCL against directly writing Vivado HLS code? Or some combination of the above?

It would be useful to visualize the set of bar charts you will imagine drawing at the end of the project—this will make it clear how to prioritize the implementation work.

chhzh123 commented 2 years ago

I think those three features are not (fully) supported by the previous HalideIR-based HeteroCL. Thus, our main comparison target is the original HeteroCL, and some experiments may also compare with the unoptimized HLS code (i.e. the program without specifying optimization schedule).

For the reuse_at part, we can demonstrate our MLIR compilation flow can generate the design with the same performance as the original HeteroCL, and show more patterns that the original HeteroCL cannot support.
For buffer_at, which is a new primitive that has not been added in HeteroCL, we can demonstrate the performance of GEMM or deep neural networks with or without adding this primitive. This is actually comparing our HCL-MLIR with the original HeteroCL.
For memory profiling, there is no comparison target. It is just used for better guiding the programmers to do optimizations, and can be viewed as an enhancement for HeteroCL.

sampsyo commented 2 years ago

OK, sounds good. It seems like a pre-optimization comparison between Halide-HeteroCL and MLIR-HeteroCL would be useful to put all these pieces into context too.

chhzh123 commented 2 years ago

Sure. We can add that to our experiments.

sampsyo / cs6120