[HandshakeToFIRRTL] Lowering of MemoryOp/LoadOp/StoreOp

hanchenye commented 3 years ago

See issue #286 and the discussion here.

I plan to support the lowering of these operations in several PRs:

[x] MemoryOp
[x] LoadOp
[x] StoreOp

mikeurbach commented 3 years ago

@hanchenye thanks for opening this issue. I have only spent a little bit of time reviewing the original Handshake slides and the StandardToHandshake pass to see how the MemoryOp is generated. I am going to spend some time understanding the FIRRTL memories next. I'll share any ideas or questions here as they come up.

For starters, what is a good, minimal test case to start with? I was thinking this one, compiled with circt-opt -create-dataflow: https://github.com/llvm/circt/blob/e42d43925d5a8304ff51eec0b252aca90177e399/test/handshake-runner/loadstore.mlir#L5-L13

JuanEsco063 commented 3 years ago

Thank you @hanchenye and @mikeurbach for working on this! I am not much of a compiler person but I can help whichever way I can. Coming up with testcases or testing the updates.

mikeurbach commented 3 years ago

I have a small update on this issue, and I wanted to share before going too much further. I did some research into the FIRRTL spec and the Handshake slides about memories, and I have a prototype that lowers a Handshake MemoryOp by creating a FIRRTL MemOp. It does not support LSQ at the moment. I have the functionality to wire up the memory's data and address signals for every load/store port, and now I am working on connecting the control logic.

I think I am close to being able to share a PR. Before that, I want to make sure I am understanding the Handshake slides about connecting to memory (slides 55-64). I'm drawing up a diagram of how I plan to implement the logic that I'll share here for review before I jump into coding. Are there any other resources that provide some more detail on how to wire the control logic?

hanchenye commented 3 years ago

@mikeurbach I don't have other materials on my hand about the control logic implementation. Therefore, I think a logic diagram would be very helpful for us to discuss the implementation details.

According to my understanding, since the load port of handshake::MemOp is a 1-to-2 component, the logic should be similar to a fork operation (Fig. 10 of this document) with the address and data connected to the firrtl::MemoryOp combinationally. For the store port which is 2-to-1 component, it should be similar to a join operation with a register inserted to the valid path for aligning with the write latency (suppose a write latency of 1). These observations may be helpful to you as reference.

JuanEsco063 commented 3 years ago

@mikeurbach We would be happy to check the code when it is ready and provide feedback. Unfortunately we don't have any additional material that could help with the control logic but we could provide a few relatively simple examples to test the lowering steps.

mikeurbach commented 3 years ago

@hanchenye thanks for the feedback. It sounds like we are on the same page about the handling the load and adding a buffer to align with the store's write latency. We can set up the FIRRTL MemOp with read latency 0, write latency 1, and read under write behavior set to return the old value. With those settings, I think we can achieve the desired implementation. I'll share a circuit diagram this afternoon.

@JuanEsco063 thanks for the offer. Once the implementation shapes up, I'll be sure to reach out for some examples.

mikeurbach commented 3 years ago

@hanchenye and anyone else who is interested, here is my proposed lowering for the MemoryOp: https://circuitverse.org/users/62563/projects/handshake-to-firrtl-memory. I've never used that tool before but that circuit should be public for you to play around with.

I feel pretty good about that implementation in general, but I'm not sure if it is complete. Some notes:

Read latency = 0, write latency = 1, and read under write = old, as stated above.
This includes one load port and one store port to illustrate both sub-circuits. These can be repeated for multiple ports.
We are not storing bundles in memories, so the bit mask is always a single bit, which is just driven by the enable bit.
The load sub-circuit is currently fully combinational, and waits for both the loaded data output and control signal output to be ready. We could also implement the design from the referenced fork op, which is able to pass data to one downstream consumer and wait for the other. This would require a buffer on each output and a little more logic. Is that behavior required for correctness, or better for performance? It seems to me that the current design might be correct as-is, but I am still thinking this through.
The store sub-circuit buffers the valid signal to delay it by one cycle, but the ready signal is fully combinational and depends on the buffered valid signal. Is the ready logic correct in this case? I'm still thinking this through but in some quick tests in the UI it seemed to be correct.

This is an interesting form of review... let me know what you think. Feel free to fork and tweak the circuit if you want, that should be an option in the UI. There is also a way to export the circuit as SV, so I might give that a test in Verilator.

hanchenye commented 3 years ago

@mikeurbach The issue of the current load logic is: the load_data_valid and load_complete_valid are directly driven by load_addr_valid, which means once one of the outputs is ready, handshake happens and will possibly happen more than once (which is incorrect). Therefore, a LazyFork must wait all the outputs to ready before it can set high the output valids.

However, valid waits for ready will cause dead lock in some cases, which is also the issue of the current LazyFork lowering. Therefore, I'd recommend the referenced fork op implementation. Maybe we could encapsulate the fork logic builder into a method as it may also be used in other operations.

mikeurbach commented 3 years ago

Therefore, a LazyFork must wait all the outputs to ready before it can set high the output valids.

My original intent was to wait for both outputs to be ready before asserting load_data_valid and load_ready_valid. I definitely don't want to create deadlocks when they can be avoided, so I will update this.

Therefore, I'd recommend the referenced fork op implementation. Maybe we could encapsulate the fork logic builder into a method as it may also be used in other operations.

I will go this route, and I see you have already factored out the logic to build the forks in your PR. Thanks for that!

Antonyt80 commented 3 years ago

All, somewhat related to this, for the SODA synthesizer we (Marco) has finalized designs for lowering load and stores to FIRRTL (note that we assume AXI interfacing for all class of memories). The caveat of course is that right now we are working directly with LLVM IR in our synthesizer. We are a little bit swamped in this period, but happy to share/discuss the design at some point. Also, we do have around designs ready for decoupling memory-datapath effectively for dataflow designs that we have played with in Bambu...

mikeurbach commented 3 years ago

@Antonyt80 it would be great to hear more about SODA and the lessons learned. Would you be interested in sharing some time in one of the weekly design meetings?

mikeurbach commented 3 years ago

@hanchenye I can start looking into the StoreOp and LoadOp next. Did you have any thoughts or work in progress there?

hanchenye commented 3 years ago

@mikeurbach Not yet. Which one are you planning to look into? I can work on the other one.

mikeurbach commented 3 years ago

How about I take the StoreOp and you take the LoadOp?

Before I dive in, can I double check the logic?

It looks like we would want to implement the valid/ready logic similar to a JoinOp for both the LoadOp and StoreOp. Otherwise, the address/data operands are connected directly to the address/data results. Does that sound about right?

I'm wondering about the multiple results... do they to be separately buffered like a ForkOp, or can they be implemented more like a LazyForkOp?

hanchenye commented 3 years ago

It looks like we would want to implement the valid/ready logic similar to a JoinOp for both the LoadOp and StoreOp. Otherwise, the address/data operands are connected directly to the address/data results. Does that sound about right?

I think this should be the intention here. The input control channel ensures the memory access order is correctly maintained.

I'm wondering about the multiple results... do they to be separately buffered like a ForkOp, or can they be implemented more like a LazyForkOp?

I don't have a clear sense on this. For StoreOp, I understand that the addr_valid && data_valid signal are buffered in the MemoryOp thereby resolving the possible dead lock. But before we have more experiences, how about we start from ForkOp?

BTW, as we are very closing to a completed HandshakeToFIRRTL lowering, I think a strict verification will be required for each elastic component before we go further. I'm really not an expert on this, should we open up an issue or a forum thread for discussing this?

mikeurbach commented 3 years ago

But before we have more experiences, how about we start from ForkOp?

Sounds good to me.

BTW, as we are very closing to a completed HandshakeToFIRRTL lowering, I think a strict verification will be required for each elastic component before we go further. I'm really not an expert on this, should we open up an issue or a forum thread for discussing this?

I agree, let's start a discussion on the forum about this topic.

mikeurbach commented 3 years ago

For StoreOp, I understand that the addr_valid && data_valid signal are buffered in the MemoryOp thereby resolving the possible dead lock. But before we have more experiences, how about we start from ForkOp

I have more thoughts about this, please let me know if it makes sense to you. The reason ForkOp has to buffer each result individually is to account for different downstream components being ready at other times. In the case of a StoreOp, the two results are the address and data sent to the MemoryOp. We know from our implementation of MemoryOp that the address and data ready signals are always driven at the same time.

As long as that holds, I think we can implement StoreOp combinationally, just like you did for LoadOp. Does that make sense? It is easy enough to call buildForkLogic if we still want to have the results buffered, but it seemed to me that it would be better to avoid the buffering if we don't need it.

mikeurbach commented 3 years ago

@hanchenye another unrelated question: it looks like both load and store have variadic address indices as arguments and results. It seems like #433 is assuming there is a single index. Frankly, that is what I assumed as well until I took another look at the ODS file.

If we really wanted to, we could support variadic arguments and results for these ops, but I'm not sure if there is a use-case for this. From all the examples I've looked at, there was only one index. Is there some more complex use-case that uses loads/stores with multiple indices? If we don't need to support multiple indices, perhaps we can remove Variadic<> from the ODS definitions to make this explicit.

hanchenye commented 3 years ago

@mikeurbach Oh, I didn't realize this before. My understanding is this is prepared for accessing multi-dimensional memories. I think a possible approach is to enhance the MemoryOp conversion (generate multipliers and an adder tree for calculating the single-dimentional address), where the shape information could be obtained from MemRefTypeAttr.

I'll also update the logic of LoadOp. Does it make sense to factor out the JoinOp logic builder as well?

mikeurbach commented 3 years ago

Got it thanks for the background.

Does it make sense to factor out the JoinOp logic builder as well?

I thought about this with memories, and decided the logic was simple enough to repeat. Now that we are repeating it again, I am starting to think it would be nice to re-use.

By the way, I just pushed the naive implementation of StoreOp here: https://github.com/llvm/circt/pull/441. I wasn't sure about the above so I just implemented it for one address with combinational logic to get the ball rolling. Let me know here or on that PR what you think.

mikeurbach commented 3 years ago

@hanchenye looks like all the support is merged now. That's super exciting.

I guess the next step is to try writing some integration tests along the lines of this one?

JuanEsco063 commented 3 years ago

This is very exciting! I will try some sample kernels tomorrow!

JuanEsco063 commented 3 years ago

@mikeurbach @hanchenye The example you mention can be lowered to FIRRTL no problem ( the Verilog module ports still have issues unrolling the bundles so I might be missing an additional step but that is another issue).

Maybe I haven't fully grasped the scope of the operations supported by new lowering passes because I am getting an error when trying to lower the code from the file "code.txt" into FIRRTL ( you can see the error log in "error.txt").

The code accumulates the elements of a vector "v" into an accumulator "a".

So in summary (after converting code.txt into code.mlir), circt-opt -create-dataflow code.mlir > code_HS.mlir ----> Works circt-opt -lower-handshake-to-firrtl code_HS.mlir ----> Fails

error.txt code.txt

mikeurbach commented 3 years ago

I can reproduce the error you are showing. This should be supported, but clearly something is going wrong. It is in the middle of generating a fairly complex piece of FIRRTL, so we will have to unravel what is going wrong.

JuanEsco063 commented 3 years ago

@mikeurbach Let me know if I can do anything on my end to help

mikeurbach commented 3 years ago

The failing example is super helpful. I'm not sure when either of us will be able to debug this and resolve it, so feel free to take a look yourself if you are feeling ambitious. Otherwise, we can report here once we have a chance to dig further.

hanchenye commented 3 years ago

@JuanEsco063 @mikeurbach I've found the reason of this issue and will fix it later tonight. Thanks for catching this!

JuanEsco063 commented 3 years ago

@mikeurbach @hanchenye If I am keeping up with all the updates correctly, right now there are passes to lower standard code using memory operations to Verilog with the caveat the memories are single port (not a big issue) and the FIRRTL bundles at the port interfaces are still not lowered correctly. Once this issue is fixed, the bundles should be able to be lowered correctly. Did I miss something?

mikeurbach commented 3 years ago

I think you are pretty close, but I have a couple clarifications.

I don't think there is anything preventing memories with more than a single port from working. I haven't tested it, but I don't see why it would fail. Is there an example you have that fails with multiple ports? There was previously some discussion about multiple indices in a load/store, which is a separate issue currently not supported. That would entail supporting ops like load[0, 0] in addition to load[0]. The current state should support 1-dimensional memories with as many ports as you want.

About memories with FIRRTL bundles at the port interfaces. Yes, we will need to resolve #478 before that can work, but I don't think this situation should arise when lowering the Standard dialect through Handshake to FIRRTL. The bundles that Handshake creates are only ever at the FIRRTL module ports, and we should be selecting the data field out of such bundles before connecting to a memory. Do you have an example using the Handshake lowering that connects a bundle to a memory's port?

Also, note that the crash mentioned in #478 can occur even when primitive FIRRTL types are connected to a memory, not just bundles. This is the case with FIRRTL generated from Handshake, so yes, it is a blocker. As I mentioned there, resolving #479 should define away the issue that leads to #478 for primitive types. Once #479 is done, we should be able to lower FIRRTL generated from Handshake with memories.

seldridge commented 3 years ago

I should have both #478 and #479 fixed by next week (in that #488). Weirdly, returning multiple types started to expose problems with lowering. With that in place, you should be able to have arbitrary aggregates as memory types and these will be flattened to one memory per leaf aggregate (this splitting is necessary due to restrictions on how FIRRTL IR requires that the write mask matches the aggregate structure of the memory). Alternatively, #493 will provide a path to blackbox these memories instead of lowering them.

JuanEsco063 commented 3 years ago

@hanchenye @mikeurbach @seldridge This is amazing! Great job everyone. I have a few sample test kernels written in standard I have been testing and I am generating Verilog no problem now!

However, there might still be a few bugs in the generated logic.

1) In the generated Verilog sometimes the signal "reset" is used asynchronously and sometimes is used as a clock. i.e. sometimes you find "if(reset) ...." and others "always(posedge clk or posedge reset)". This confuses the synthesis tool (Vivado to be precise) and throws an error saying there is an ambiguous clock event. Removing the "or posedge reset" part solves this issue.

2) This one might also be a problem with Vivado. The Verilog code generated for the Counter code attached (just counts from 0 to 100) seems to be using or assuming 3 port memories. It has an address dedicated for reading (mem0_load0_addr) used here:

assign mem0_load0_data = mem0[mem0_load0_addr]; // Counter_f.mlir:4:49

but also 2 store address, as seen here

always @(posedge mem0_store0_clk) begin if (mem0_store0_en & mem0_store0_mask) mem0[mem0_store0_addr] <= mem0_store0_data; // Counter_f.mlir:4:49 end // always @(posedge) always @(posedge mem0_store1_clk) begin if (mem0_store1_en & mem0_store1_mask) mem0[mem0_store1_addr] <= mem0_store1_data; // Counter_f.mlir:4:49 end // always @(posedge)

This causes Vivado to throw the error "[Synth 8-2913] Unsupported Dual Port Block-RAM template for mem0_reg". Commenting one of the two aforementioned always blocks seems to make the error go away but I am still not quite sure of the effect in the functionality.

3) Last but not least, there seems to be some duplicated logic. Some of the control signals are seemingly generated more than once (I could be missing something). For example, the signals with + are duplicated:

... wire mem0_load0_addr; // Counter_f.mlir:4:49 + wire mem0_load0_en; // Counter_f.mlir:4:49 + wire mem0_load0_clk; // Counter_f.mlir:4:49 + wire [31:0] mem0_load0_data; // Counter_f.mlir:4:49 reg [31:0] mem0[0:0]; // Counter_f.mlir:4:49 wire mem0_store0_addr; // Counter_f.mlir:4:49 + wire mem0_store0_en; // Counter_f.mlir:4:49 + wire mem0_store0_clk; // Counter_f.mlir:4:49 + wire [31:0] mem0_store0_data; // Counter_f.mlir:4:49 wire mem0_store0_mask; // Counter_f.mlir:4:49 wire mem0_store1_addr; // Counter_f.mlir:4:49 + wire mem0_store1_en; // Counter_f.mlir:4:49 + wire mem0_store1_clk; // Counter_f.mlir:4:49 + wire [31:0] mem0_store1_data; // Counter_f.mlir:4:49 wire mem0_store1_mask; // Counter_f.mlir:4:49 wire mem0_load0_addr; // Counter_f.mlir:5:26 + wire mem0_load0_en; // Counter_f.mlir:8:24 + wire mem0_load0_clk; // Counter_f.mlir:11:25 + wire mem0_store0_addr; // Counter_f.mlir:15:27 + wire mem0_store0_en; // Counter_f.mlir:18:25 + wire mem0_store0_clk; // Counter_f.mlir:21:26 + wire mem0_store1_addr; // Counter_f.mlir:26:27 + wire mem0_store1_en; // Counter_f.mlir:29:25 + wire mem0_store1_clk; // Counter_f.mlir:32:26 + ...

I believe this duplication is indeed not for any practical purposes because you can find the following assignments later in the code where you just make them equal:

assign mem0_load0_addr = mem0_load0_addr; // Counter_f.mlir:6:12, :7:7 assign mem0_load0_en = mem0_load0_en; // Counter_f.mlir:9:12, :10:7 assign mem0_load0_clk = mem0_load0_clk; // Counter_f.mlir:12:12, :13:7 assign mem0_store0_addr = mem0_store0_addr; // Counter_f.mlir:16:12, :17:7 assign mem0_store0_en = mem0_store0_en; // Counter_f.mlir:19:12, :20:7 assign mem0_store0_clk = mem0_store0_clk; // Counter_f.mlir:22:12, :23:7 assign mem0_store1_addr = mem0_store1_addr; // Counter_f.mlir:27:12, :28:7 assign mem0_store1_en = mem0_store1_en; // Counter_f.mlir:30:13, :31:7 assign mem0_store1_clk = mem0_store1_clk; // Counter_f.mlir:33:13, :34:7

Counter.txt

mikeurbach commented 3 years ago

I just took a look at the latest main and I can also lower some examples with memories all the way to Verilog. That is super exciting.

I'm starting to look into the correctness of the generated circuits as well. @JuanEsco063 those issues are helpful, but it seems they will require further inspection, and may not even be related to this Handshake lowering per-se. Do you mind filing them separately?

JuanEsco063 commented 3 years ago

@mikeurbach Sure, I will make a new issue with those and additional problems I found

mikeurbach commented 3 years ago

Should we close this issue, since the initial lowering is implemented and we have been discussing some of the end-to-end issues in #543?

JuanEsco063 commented 3 years ago

@mikeurbach I second closing this issue @hanchenye

llvm / circt

[HandshakeToFIRRTL] Lowering of MemoryOp/LoadOp/StoreOp #337