sampsyo / cs6120

advanced compilers
https://www.cs.cornell.edu/courses/cs6120/2023fa/
MIT License
742 stars 157 forks source link

Project Proposal: Banked Memory Compiler Backend in MLIR #307

Closed andrewb1999 closed 1 year ago

andrewb1999 commented 2 years ago

Background

Banked memories, or logical memories that are divided into multiple physical memories, are an important part of achieving high performance FPGA designs. Unfortunately, designing efficient banked memories can be challenging in traditional High-level synthesis tools. Additionally, it can be challenging to efficiently explore the design space of banked memories and optimizations in HLS tools alone. RTL languages such as Verilog are no better as they require specific hardware design knowledge and a large amount of manual effort that makes them inaccessible to most computer scientists.

Another important piece of background is MLIR. MLIR stands for Multi-Level Intermediate Representation and provides a standard way for creating extensions on top of an LLVM like SSA IR. These extensions are called dialects and contain a set of instructions, regions, and types that can be mixed with other dialects to represent a wide range of concepts. CIRCT is the name of a project that builds on top of MLIR to simplify compilation for hardware targets, including FPGAs. CIRCT contains an in development HLS system to compile untimed MLIR progams to hardware circuits.

What will you do?

To address the challenge of designing and optimizing banked memories in an HLS, we will implement a banked memory compiler using MLIR. The compiler will start from an interface that describes the features of the memory exposed to the HLS frontend, such as dimension, number of banks, number of ports, port to bank connections, latency, etc. This interface will then be compiled down through multiple levels of MLIR, resulting in a final RTL design. Optimizations can then be inserted throughout the compilation stack to improve the area efficiency and performance of the final memory.

The core of this project is in the IR design in the multiple stages of the compiler. We likely will not have time to get to all the possible optimizations that can be performed within this compiler. One optimization that we would like to have an initial implementation for is mapping to specific memory primitives. By default, the compiler can generate an unannotated RTL memory array. Performance and area can then be optimized by choosing to map certain sizes of memory banks to memory primitives such as BRAMs and UltraRAMs.

How will you do it?

From the banked memory interface, the compiler will generate an untimed IR that describes the internal structure of the memory and arbitration logic. After performing some optimization on this untimed IR, it will then be lowered to a timed physical implementation, likely represented within the Calyx MLIR dialect. Once reaching a timed physical implementation, the memory can then be linked with other Calyx programs or generate system verilog directly.

The entire compilation flow will be implemented within the MLIR/CIRCT framework and optimizations will be written as passes on the MLIR. We will likely have to implement our own MLIR dialect to represent much of the compilation flow, which we will do with the MLIR TableGen framework. Our optimization that maps banks to specific memory primitives will be general and use a primitive library of some form to describe what memories are available on a physical target.

How will you empirically measure success?

Since we are (up to our knowledge) the first to try to attempt this problem, it might be hard to come up with empirical measure at the moment. Instead, we define a set of "deliverables" that we are planning to accomplish by the end of semester. They include:

Team members: Andrew Butt (@andrewb1999) and Nikita Lazarev (@barabanshek)

sampsyo commented 2 years ago

Sounds like a great project! Before you get started, I think it would be good to think a little bit more about exactly how you will do the benchmarking. Comparing to RTL and Vivado HLS as baselines would be fine, but what benchmarks will you use? If it's standard HLS benchmarks like Rosetta, how will you disentangle the cost of the memories themselves from the accelerator logic that uses them? Maybe by post-processing the HLS report or something? Thinking carefully about what an apples-to-apples comparison will look like will be essential to designing a project that can demonstrate success early on.

Also, may I recommend that you consider chatting with @EclecticGriffin at some point? I recall that they implemented a "banked memory synthesizer" for a class project on program synthesis last semester, so they might have some thoughts or insights to share.

barabanshek commented 2 years ago

Hi Adrian, thanks for the feedback! So here is our thoughts.

We think, it should be relatively straightforward to disentangle the cost of the memories form the rest of the design as long as we can actually synthesize things. The synthesis tools usually report this on per-component granularity and it should be enough to check.

However, another issue here is what benchmark to use. We thinks, Rosetta has limited use of banked memories, so we are thinking of starting with some hang-written microbenchmarks first, and then see how it goes.

One interesting but also ambitious idea can be trying to compile real application like, for example, data sketch algorithms that (1) are well suited for FPGA acceleration, and (2) require a decent amount of memories with non-trivial access patterns. This is one example of such a work: https://arxiv.org/pdf/2005.13332.pdf . Or using more complicated benchmarks with more realistic and non-trivial memory requirements like HiSparse. The biggest issue here is that we won't be able to compile the whole design with our MLIR dialect. So one idea we can try is decoupling the memory part out of the design, then synthesizing the rest of design with, for example, Vivado HLS, and synthesizing the memories with both Vivado and our flow. One challenge here is whether such a decoupling might be possible and whether we can integrate our separately-synthesized memories with the rest of the design compiled in a different flow.

sampsyo commented 2 years ago

We think, it should be relatively straightforward to disentangle the cost of the memories form the rest of the design as long as we can actually synthesize things. The synthesis tools usually report this on per-component granularity and it should be enough to check.

OK! Let's get concrete about what you will measure, though: is it the number of BRAMs? Or some latency number? I'm just not sure what the quantity is that you'll optimize for—and picking this will inform how the project goes.

We thinks, Rosetta has limited use of banked memories, so we are thinking of starting with some hang-written microbenchmarks first, and then see how it goes.

Got it; sounds good. I guess the big thing to answer here is how much the actual application "logic" is relevant to the compiler at hand. You could, for example, not target proper applications at all: just specifications of the memory interface. That is, a "benchmark" could be a specification like this:

…or something like that. Then you'd just be compiling the memory itself, divorced from the application that consumes it. Or you could profile applications that use these memories to see if they go faster or slower.

Anyway, nailing down exactly what the benchmarks will look like would be a good first step!

andrewb1999 commented 2 years ago

OK! Let's get concrete about what you will measure, though: is it the number of BRAMs? Or some latency number? I'm just not sure what the quantity is that you'll optimize for—and picking this will inform how the project goes.

Some combination of BRAM, UltraRAM, and LUTRAM usage is likely what we will measure. As the goal of this system (at least initially) is more to increase productivity, it's unlikely we will be able to beat the area results of hand optimized HLS or RTL. However, being able to match the number of RAMs generated by an equivalent HLS design would probably be a good initial goal.

There are potential optimizations that could reduce the RAM usage compared to HLS, such as packing multiple small banks into one BRAM/UltraRAM or using double pumping to meet the port requirements with fewer RAMs, but these are the kinds of research topics we want to be able to investigate with this compilation flow in the future, not necessarily something in the scope of this project itself.

Got it; sounds good. I guess the big thing to answer here is how much the actual application "logic" is relevant to the compiler at hand. You could, for example, not target proper applications at all: just specifications of the memory interface.

I think separating out the memory from the compute as much as possible would be the goal for this project. Actually using these memories in the CIRCT-HLS flow will require additional work and probably isn't in scope for the project. I think our benchmarks would be something like the interface specification you suggested. We could then write some HLS code by hand to meet these specifications as a comparison baseline. If we are able to meet the memory area of the hand written HLS code in a number of cases, that should be enough to demonstrate that the flow works.

sampsyo commented 2 years ago

OK! One thing to worry about a bit here is that, in my experience, controlling an HLS tool to do anything specific with memories can be pretty tricky. Vivado HLS will often ignore your requests (in the form of pragmas, etc.) and do what it thinks is best. So it might be a good idea to get started now crafting the HLS code for a few example memory setups so you're not surprised later by not having a baseline.

andrewb1999 commented 2 years ago

Yeah I definitely know the frustrations of memory in Vivado HLS. My hope is that our initial benchmarks will be simple enough (straight array partitioning or 2 banks with arbitration) to not be that difficult to write in Vivado HLS. Getting started on writing that HLS code is a good idea though.