[HandshakeToFIRRTL] Issues with Verilog generated from Standard involving memories

JuanEsco063 commented 3 years ago

This is a follow-up to #337 . Now that we can generate Verilog from code in standard using memories I have noticed the following issues:

1) In the generated Verilog sometimes the signal "reset" is used asynchronously and sometimes synchronously. i.e. sometimes you find

always(posedge clk) if(reset)

and others

always(posedge clk or posedge reset)

This confuses the synthesis tool (Vivado to be precise) and throws an error saying there is an ambiguous clock event. Removing the "or posedge reset" part solves this issue.

2) This one might also be a problem with Vivado. The Verilog code generated for the Counter code attached (just counts from 0 to 100) seems to be using or assuming 3 port memories. It has an address dedicated for reading (mem0_load0_addr) used here

assign mem0_load0_data = mem0[mem0_load0_addr]; // Counter_f.mlir:4:49

but also 2 store address, as seen here

always @(posedge mem0_store0_clk) begin if (mem0_store0_en & mem0_store0_mask) mem0[mem0_store0_addr] <= mem0_store0_data; // Counter_f.mlir:4:49 end // always @(posedge) always @(posedge mem0_store1_clk) begin if (mem0_store1_en & mem0_store1_mask) mem0[mem0_store1_addr] <= mem0_store1_data; // Counter_f.mlir:4:49 end // always @(posedge)

This causes Vivado to throw the error "[Synth 8-2913] Unsupported Dual Port Block-RAM template for mem0_reg". Commenting one of the two aforementioned always blocks seems to make the error go away but I am still not quite sure of the effect in the functionality.

3) there seems to be some duplicated logic. Some of the control signals are seemingly generated more than once (I could be missing something). For example, the signals with + are duplicated: ... wire mem0_load0_addr; // Counter_f.mlir:4:49 + wire mem0_load0_en; // Counter_f.mlir:4:49 + wire mem0_load0_clk; // Counter_f.mlir:4:49 + wire [31:0] mem0_load0_data; // Counter_f.mlir:4:49 reg [31:0] mem0[0:0]; // Counter_f.mlir:4:49 wire mem0_store0_addr; // Counter_f.mlir:4:49 + wire mem0_store0_en; // Counter_f.mlir:4:49 + wire mem0_store0_clk; // Counter_f.mlir:4:49 + wire [31:0] mem0_store0_data; // Counter_f.mlir:4:49 wire mem0_store0_mask; // Counter_f.mlir:4:49 wire mem0_store1_addr; // Counter_f.mlir:4:49 + wire mem0_store1_en; // Counter_f.mlir:4:49 + wire mem0_store1_clk; // Counter_f.mlir:4:49 + wire [31:0] mem0_store1_data; // Counter_f.mlir:4:49 wire mem0_store1_mask; // Counter_f.mlir:4:49 wire mem0_load0_addr; // Counter_f.mlir:5:26 + wire mem0_load0_en; // Counter_f.mlir:8:24 + wire mem0_load0_clk; // Counter_f.mlir:11:25 + wire mem0_store0_addr; // Counter_f.mlir:15:27 + wire mem0_store0_en; // Counter_f.mlir:18:25 + wire mem0_store0_clk; // Counter_f.mlir:21:26 + wire mem0_store1_addr; // Counter_f.mlir:26:27 + wire mem0_store1_en; // Counter_f.mlir:29:25 + wire mem0_store1_clk; // Counter_f.mlir:32:26+ ...

I believe this duplication is indeed not for any practical purposes because you can find the following assignments later in the code where you just make them equal:

assign mem0_load0_addr = mem0_load0_addr; // Counter_f.mlir:6:12, :7:7 assign mem0_load0_en = mem0_load0_en; // Counter_f.mlir:9:12, :10:7 assign mem0_load0_clk = mem0_load0_clk; // Counter_f.mlir:12:12, :13:7 assign mem0_store0_addr = mem0_store0_addr; // Counter_f.mlir:16:12, :17:7 assign mem0_store0_en = mem0_store0_en; // Counter_f.mlir:19:12, :20:7 assign mem0_store0_clk = mem0_store0_clk; // Counter_f.mlir:22:12, :23:7 assign mem0_store1_addr = mem0_store1_addr; // Counter_f.mlir:27:12, :28:7 assign mem0_store1_en = mem0_store1_en; // Counter_f.mlir:30:13, :31:7 assign mem0_store1_clk = mem0_store1_clk; // Counter_f.mlir:33:13, :34:7

4) There seems to be an issue where the input and output ports of the Verilog modules are not connected to internal signals. For example in the attached code MemRW, we have a simple code in standard with three input arguments (write add, write data, read add) and returns mem[read add]. The generated Verilog has not syntax errors and structurally looks correct but the input and output ports are not connected to any internal logic (see attached image).

5) I am still in the process of debugging and having more precise information on this one but thought to mention it as well. There seems to be some issue with net assignments where some wires are being driven by multiple sources (see below screenshot)

Counter.txt MemRW.txt

mikeurbach commented 3 years ago

I tried generating some Verilog as well, and linted it with Verilator. I think I can at least reproduce 4) and 5):

4) I see warnings from Verilator about unused and undriven signals that indeed seem to indicate these are not connected:

%Warning-UNDRIVEN: examples/linalg_dot.sv:350:15: Signal is not driven: 'mem0'
... other warnings
%Warning-UNUSED: examples/linalg_dot.sv:1037:17: Signal is not used: 'arg0_data'

5) I see similar warnings from Verilator:

%Warning-UNOPTFLAT: examples/linalg_dot.sv:1620:15: Signal unoptimizable: Feedback to clock or circular logic: 'dot._T_69_arg3_data'
... many more warnings

I am not able to reproduce 1) and 3) using the latest main. Is it possible these have been resolved already? I'm not sure what to do about 2) here...

There are also some other Verilator warnings I'm finding, but for now I think we should debug the above two issues and see where we are after they are resolved.

JuanEsco063 commented 3 years ago

@mikeurbach I just double checked and rebuilt everything with the latest main and I still see issues 1) and 3). I am using the following lowering steps and passes to generate Verilog:

circt-opt counter_std.mlir -create-dataflow -lower-handshake-to-firrtl | circt-opt -pass-pipeline='firrtl.circuit(firrtl.module(firrtl-lower-types))' | firtool --format=mlir -verilog

Using that command on the code from Counter.txt, you can see the generated verilog (attached) on line 173 there is a "if (reset) " in the initial block which on itself is a bit odd. However, on line 193 you see the asynchronous behavior where there is a line with always@(posedge clock or posedge reset).

Similarly, lines 79 and line 94 both define the wire _mem0_load0addr. You can see this duplicated logic on other signals too.

Regarding 2), who would know more details about why is the logic requesting 1 read and 2 write ports even for a simple circuit like this?

counter.txt

mikeurbach commented 3 years ago

@JuanEsco063 I think there is another command to transform the FIRRTL to Verilog using CIRCT, which actually produces different results. Can you try using the firtool directly, like:

circt-opt counter_std.mlir -create-dataflow -lower-handshake-to-firrtl | firtool -format=mlir -lower-to-rtl -enable-lower-types -verilog

The -enable-lower-types flag is invoking the firrtl.circuit(firrtl.module(firrtl-lower-types)) pass. The difference is in the -lower-to-rtl flag.

The command you shared (which I think you got from me), goes from the FIRRTL dialect to the Verilog language directly. I think the preferred way to lower FIRRTL is actually to first transform it to the RTL dialect, and then emit it to Verilog. This is where the FIRRTL folks have been focusing effort, but the older path from FIRRTL directly to Verilog was the only way that worked for this case until recently.

If it doesn't crash, the command I pasted above should be the best way to get the lowering all the way to Verilog. When I said I couldn't reproduce the issues 1) and 3), it was using this command. The fact that the lowering produces different Verilog depending on whether it goes from FIRRTL to Verilog directly, or through the RTL dialect, is probably a non-issue (it's not my call, but I expect this is a "won't fix"). The ability to produce Verilog from FIRRTL directly isn't really being maintained... it's been mentioned that this path should go away.

I think the focus is on lowering FIRRTL to RTL, and outputting good Verilog from RTL, which is the path taken by the command above.

darthscsi commented 3 years ago

Lowering to rtl first will use the always_ff blocks and merge them. It should give consistent reset behavior.

JuanEsco063 commented 3 years ago

@mikeurbach Thank you very much for those clarifications! Yes that command works perfectly and it does solve all but issue 2) where Vivado can't synthesize the memory element as is. However, there is an easy fix. The TCL console gives this messages:

So simply changing the definition of the "mem0" register from [31:0] mem0 [0:0] to [31:0] mem0[0:1] gets rid of this issue too. Of course, in this not "ideal" since we create a piece of memory that is larger than really needed and thus the circuit generated is not optimal. This seems to indicate a single register (i.e. an array of FF to be precise on what I am thinking, specially after last week's discussion) is still not fully supported.

@darthscsi See above. After lowering to RTL as @mikeurbach suggested I see the correct reset behavior as you mention.

mikeurbach commented 3 years ago

@JuanEsco063 glad to hear the preferred lowering path is almost working for you out of the box.

The log from Vivado is also really helpful, and I think I'm understanding now what is going on. The generated circuit has a "memory" that is really just storing a single 32-bit value. Vivado seems to handle this by trying to lower it to a single 32-bit register instead, and then fails because it isn't able to deal with the fact that there are multiple places this "memory"/register is written to.

I wonder if it might be possible to handle this as an optimization in CIRCT, rather than leaving it up to the synthesis tool (and watching the tool fail). We already have the ability to represent a 32-bit register like [31:0] mem0 in CIRCT, so we should be able to detect when the memory has only one element and transform the circuit to just use a register. This might require some finesse to handle the multiple writers (the thing that Vivado couldn't handle), and other issues, but it would be nice to see this work.

There might also be another way to resolve this problem in CIRCT... the hack to change [31:0] mem0 [0:0] to [31:0] mem0[0:1] should really not be necessary....

JuanEsco063 commented 3 years ago

@mikeurbach Interesting. Do you have some sample code that would generate a single 32-bit register? Unless all I do is read and write back to a piece of memory, no matter how simple I make my code the generated memory element has 1 read and 2 write ports. Even for something that should just be interpreted as a register such as a MAC kernel.

mikeurbach commented 3 years ago

I only mean that we have the ability to represent a register in CIRCT, not that there is any way to necessarily use it in the lowering from Standard to Handshake to FIRRTL right now. We can have registers in FIRRTL:

circuit Foo:
  module Foo:
    input clock: Clock
    reg myReg: UInt<32>, clock

Which become registers in the generated Verilog using firtool -format=fir -lower-to-rtl -verilog:

module Foo(
  input clock);

  reg [31:0] myReg; // <stdin>:4:5

I am looking at the memory case with one element as well and I am trying to see what we can do in the short term.

mikeurbach commented 3 years ago

I guess the simplest example that can work is something like this:

module {
  func @main() -> i8 {
    %c42 = constant 42 : i8
    %idx0 = constant 0 : index
    %reg = alloc() : memref<1xi8>
    store %c42, %reg[%idx0] : memref<1xi8>
    %val = load %reg[%idx0] : memref<1xi8>
    return %val : i8
  }
}

The single-element memory generated for the%reg memref is only read once and written once, and Vivado seems to be able to handle this case in the generated Verilog. However, if there is more than one store to %reg, like in Counter.txt, you will hit the issue.

JuanEsco063 commented 3 years ago

@mikeurbach I see what you are saying. Take a look at the attached code (MultiRW.txt). As you uncomment more store/load pairs, the generated Verilog has more and more store ports instead of detecting these are sequential assignments and old ports can be reused. Curiously, by the time you uncomment the 4th pair, you start getting this error (error.txt) when you try to generate Verilog.

MultiRW.txt error.txt

mikeurbach commented 3 years ago

detecting these are sequential assignments and old ports can be reused

That's a good point. The control logic should be generated in a way that memory accesses happen in the right order, so I think it should also be possible to share ports at the FIRRTL level. If that's valid, maybe this could even be a transformation at the Handshake level to merge the loads and stores.

JuanEsco063 commented 3 years ago

Of course, the compiler might be trying to squeeze every ounce of data parallelism and that is why it is doing the multiple ports which in general is great but then in this case the address is set statically so it should not even try to parallelize it and detect sequential read/writes to a single register/memory. Or worst case detect only the last assignment is being reflected in the return value and just keep that one. But this is amazing progress. Everything is almost working perfectly.

JuanEsco063 commented 3 years ago

It seems there is a deeper bug in the handshake signal generation when there is a loop in the standard code, resulting in a one-shot behavior. This is, only a single load/store operation is performed for the execution of the program.

For example, code mac_std works as intended (See image below) because there are no loops involved. All the signals are triggered externally. Note I had to remove the initialization of the register in standard and do it manually in Verilog to avoid the multiple store -> multiple store ports issue previously discussed but that is not a big issue right now.

However, if the memory is used in a loop, whether it is a single element memory and we want to update the value stored such as in counter_std or we have an array and we want to initialize it to a certain value, thus performing a sweep of all the memory locations as in vect_init_std, the action is only performed once ( see images below).

counter_std

vect_std

I have traced the bug to be that at some point, the output data "ready" signal of the producing module is never asserted by the consumer modules receiving the data down the dataflow.

mac_std.txt counter_std.txt vect_init_std.txt

mikeurbach commented 3 years ago

Thanks for the detailed test and your digging @JuanEsco063. I will post here if I can reproduce this and if there is any bug/mitigation I find.

mikeurbach commented 3 years ago

the output data "ready" signal of the producing module is never asserted by the consumer modules receiving the data down the dataflow.

I think I'm seeing the same thing in simulation. Trying to follow the dataflow and see who isn't correctly propagating readiness upstream.

mikeurbach commented 3 years ago

I don't know the exact cause, but I'm still seeing some notable warnings from Verilator that I'm looking into based on vect_init_std.txt:

%Warning-BLKSEQ: /tmp/a:89:11: Blocking assignments (=) in sequential (flop or latch) block

This one seems like a special case, since other registers are generated as you'd expect. In the memory module, some wires got generated for the clock, which seems to have confused the logic as to whether the generated assign should be blocking.

%Warning-UNOPTFLAT: /tmp/a:940:8: Signal unoptimizable: Feedback to clock or circular logic: 'main._T_8.win'

Then, there are a lot of warnings like this. I have a feeling these are what's jamming up the ready network.

I'm trying to isolate at what level these issues really exist and if there is something to fix I'll file issues.

mikeurbach commented 3 years ago

The remaining issue is related to the Feedback to clock or circular logic warnings. I found that even with the -handshake-insert-buffer pass, there are still cycles. It appears the ready network is not having its cycle broken by the buffer. I think this can be it's own issue to resolve, so I'll open a ticket.

mikeurbach commented 3 years ago

I made a patch to address the above buffer issue and fix another bug, but now I am running up against a Handshake scenario I'm not sure about. The attached file contains the sample program I'm using, which is basically the example vect_init_std above.

I'm compiling it like this before lowering to FIRRTL:

circt-opt -create-dataflow -canonicalize -cse -handshake-insert-buffer -canonicalize -cse -analyze-dataflow

I've taken the graph from -analyze-dataflow and outlined the part of the circuit I'm having issues with in black.

output

This is the top control merge that comes in from the entrypoint or the loop after the store. The control result is forked to several MuxOps. My concern is specifically handshake.mux_17. When I debug it in a simulator, I find that the select input, which comes from the fork, is valid for one cycle, and consumes the select token from the fork as well as the selected input, as expected.

The problem is the select value can't be updated until all the other uses of the fork have been consumed, and one of the uses leads to the other data input of handshake.mux_17. The path I highlighted in red never becomes ready, and that initial fork is never fully consumed. This prevents the select in handshake.mux_17 from ever updating to index to the result of the AddOp, and the circuit appears to deadlock after the first "loop iteration".

I'm not sure what to do about this. If I try to run this circuit through handshake-runner, that crashes. Notably, if I don't run the -handshake-insert-buffer pass, handshake-runner does not crash. But if I leave off this pass before lowering to FIRRTL, I get warnings from Verilator about feedback loops, and the simulation still deadlocks in the same way.

Is there something to do differently here during the -create-dataflow pass? It seems like somewhere along the red path we should just sink the initial token from the fork... but I'm not really sure.

test.txt

mikeurbach commented 3 years ago

If I try to run this circuit through handshake-runner, that crashes. Notably, if I don't run the -handshake-insert-buffer pass, handshake-runner does not crash. But if I leave off this pass before lowering to FIRRTL, I get warnings from Verilator about feedback loops, and the simulation still deadlocks in the same way.

For what it's worth, this is the same behavior with this test case as well: https://github.com/llvm/circt/blob/main/test/handshake-runner/simple_loop.mlir.

So, I don't think it has to do with the memory operations necessarily.

hanchenye commented 3 years ago

@mikeurbach This is interesting. Are there any hints why the red data input of handshake.mux_17 cannot ready? Is this because the output of handshake.mux_17 is also not ready? What about the valid signal on this path?

So, I don't think it has to do with the memory operations necessarily.

If the simple_loop test case has the same behavior, maybe we can debug on simple_loop to find out a minimal code piece that can reproduce this issue.

mikeurbach commented 3 years ago

The output of handshake.mux_17 is valid and ready for the first cycle, so the mux fires and consumes the select token from the fork. After that, it is waiting for a new select token, but the fork cannot yet provide one because the red path will never become ready. Our lowering for MuxOp only marks the selected input ready, so the path in red cannot become ready until it has been selected, but the select won't change because the fork won't fire again until the red path is ready.

I will try to reduce a test case based on simple_loop to debug.

mikeurbach commented 3 years ago

@hanchenye I just shared the patches I've been carrying locally. I think both of those are necessary, but not sufficient.

I am able to reproduce the deadlock with the simple_loop example even when I include both patches.

I will keep digging on the simple example with no memories, but I wanted to share the patches first while I keep digging.

Here's a preview of where I'm at: it looks to me that we are getting stuck in the initial states of the program. Notably, section 4 of this reference has some mention of buffering that I think may be relevant:

Such buffering is mandatory on loops in the network and on channels with initial tokens

I have been invoking the -handshake-insert-buffer pass to add buffers to loops, but it doesn't do anything about initial tokens, which sounds related to our situation. I am currently experimenting with how we can add buffering to prevent deadlock after the initial loop through the circuit.

mikeurbach commented 3 years ago

Following on from my previous comment, I made some tests with different buffering strategies. I tried buffering the initial valid token in each Handshake FuncOp, but the circuits deadlocked. I tried buffering the output of every fork (not sure why, just a guess), and the circuits deadlocked.

Finally, I made the following observation about "channels with initial tokens": essentially every channel in our circuits may see an initial token. So I tried buffering every single channel in the -handshake-insert-buffer pass, leading to the following circuit:

loop

And that worked! I was able to test the generated circuit in Vivado simulator, and it computed the correct result:

Screenshot from 2021-02-16 16-56-22

Obviously buffering every channel isn't ideal (it takes 8 cycles to compute the next counter value in this example), but it helps frame the solution space. The previous logic was not sufficient to prevent deadlocks, so the optimal buffering must lie somewhere between the previous implementation and this new extreme. I will share the patch I used to generate the complete buffering of every channel. This might be too conservative to land, but at least I will share it for pedagogical purposes.

Now, I am going to take all three patches together and try out some more complex circuits with memories, loads, and stores.

mikeurbach commented 3 years ago

I have an update one this topic. In the patch I shared above, I updated the HandshakeInsertBuffer pass to only create a single buffer on each channel, when it was previously creating two buffers. The circuit I have been testing with vect_init was still deadlocking using that patch. I set it back to generating two buffers per channel, and kept the logic to generate buffers on every channel. With that, the vect_init program can work.

In the simulator, you can see that the circuit's top-level output is marked valid after some cycles, and the constant 2 has indeed been stored at each index in the memory:

Screenshot from 2021-02-18 12-08-57

I will keep testing with some more complex examples, but it is exciting to see the memory operations working in hardware. @JuanEsco063 if you want to try it out, you can do so by checking out this branch: https://github.com/llvm/circt/tree/handshake-buffer-all.

I've been thinking about this topic of "buffering" in Handshake and dataflow networks in general. There is obviously a lot of interesting research in this topic that we can apply here, and I'm personally planning on exploring some optimization approaches. For now, I'm wondering what is the simplest and most conservative approach that leads to correct circuits. I'm reviewing the literature, but does anyone involved here have a better grasp of what is theoretically necessary and sufficient?

My hope would be to have that approach implemented in the HandshakeInsertBuffer pass, and then add flags to try out other approaches as potential optimizations. I'm still digging on this scenario and specifically looking into how to handle the initial tokens. Hopefully the approach can be less heavy handed than #602.

I'm also considering the possibility that one or more of the Handshake ops simply isn't lowered properly, but I've been looking at the logic and simulation timing a lot recently, and I've only found that one bug in the MuxOp's decoder so far.

Finally, I have some ideas for potential canonicalizations at the Handshake level that I've discovered while staring at the GraphViz output from -analyze-dataflow. I'll open separate issues for those.

JuanEsco063 commented 3 years ago

@mikeurbach Sorry it took me so long to get back to you. I have been testing the branch you mentioned and with the new buffer insertion my test cases seem to be working so yay!

I even tested a simple mat mul kernel (see attached code) and I am getting the right results (see below)

My next step is to test kernels with more exotic memory access patterns such as SpMV. I still see the small bug where the generated memory RTL has as many ports as write operations to said memory appear in the code. I.e., if writes to memory A appear on 3 different lines, then the RTL will assume 3 different ports. If there are 4 lines with writes, then the memory will have 4 ports, etc.

matmul_std_lin.txt

mikeurbach commented 3 years ago

@JuanEsco063 thanks for testing it out. It's great you are seeing correct results! Hopefully we can get some integration tests for this flow established.

I still see the small bug where the generated memory RTL has as many ports as write operations to said memory

I think this can be considered the expected behavior right now, and we can treat adjusting this as an optimization. After a functional integration test is checked in, it would be easier to rework this and be confident in the correctness.

JuanEsco063 commented 3 years ago

I found what can possibly be a bug ( I am not sure if you are aware of it and is expected behavior). The logic seems to rely on asynchronous reads to function correctly. However, in Xilinx architecture all reads are synchronous so if you replace the register-based memory for a BRAM block the circuit gives the wrong results.

mikeurbach commented 3 years ago

The logic seems to rely on asynchronous reads to function correctly

This is exactly right. When we create a FIRRTL memory, it is currently set to have a write latency of 1 cycle, and a read latency of 0 cycles: https://github.com/llvm/circt/blob/main/lib/Conversion/HandshakeToFIRRTL/HandshakeToFIRRTL.cpp#L1684

Until two days ago, this was the only kind of memory that FIRRTL supported: https://github.com/llvm/circt/pull/585.

There was also a pass landed yesterday to introduce "blackbox" external modules for memories: https://github.com/llvm/circt/pull/493.

I think there are a lot of interesting things we can do with memories these days, and I'm looking forward to the discussion in the weekly meeting today. As far as the Handshake lowering goes, I think we can consider it expected behavior for now. If you want to try something different, feel free to send a PR.

JuanEsco063 commented 3 years ago

I see! Those two are really helpful! Do you know if your fork from #602 works correctly with these two new merges? (Is there a way to get all 3? I don't know if your fork is synced with the main repo) Is the information about the syntax of the new passes in the documentation?

mikeurbach commented 3 years ago

I haven't updated my fork to include the latest main since those landed, but I can do that in a bit.

The new pass is documented here: https://circt.llvm.org/docs/Passes/#-firrtl-blackbox-memory-replace-all-firrtl-memories-with-an-external-module-blackbox

JuanEsco063 commented 3 years ago

That would be great! Thanks @mikeurbach . Memory operations seem to be shaping up. With the black box and non-zero read delay plus the buffers I think we can have a very flexible and working framework.

mikeurbach commented 3 years ago

@JuanEsco063 just updated my fork with the latest from the main branch. I've been on some performance issues, but I'm going to resume testing this flow.

JuanEsco063 commented 3 years ago

Great! Thank you @mikeurbach. With #585 and #674 now everything seems to be in place and working for the most part ( I could be doing something silly).

@seldridge It seems something breaks when changing the read latency from 0 to 1 as supported in #585 (maybe because I am using the blackbox option added by @youngar in #493 AND the buffer insertion pass from #602 )

See attached zip for files. I am working with a simple mat mul example (matmul_std_lin.mlir). I first lower the code to the FIRRTL dialect by running:

circt-opt matmul_std_lin.mlir -create-dataflow -handshake-insert-buffer -lower-handshake-to-firrtl > matmul_firrtl_1.mlir

Then I edit matmul_firrtl_1.mlir to change the read latency to 1 and add the black box module as supported in #493 by running:

circt-opt matmul_firrtl_1.mlir -firrtl-blackbox-memory > matmul_firrtl_2.mlir

Last but not least, I emit Verilog:

firtool -format=mlir matmul_firrtl_2.mlir -lower-to-rtl -enable-lower-types -verilog > new_matmul_2.v

I then manually expose all the relevant control signals for memories (addr, data, en) all the way to the top level module as seen in blackboxsource.v

The issue is: the write enable signal happens 1 clock cycle too early and the wrong value is stored. This 1-clock delay is probably due to the 1-clock delay from the read operation (instead of the original 0 latency). If I delay the WE and WAddr by one clock cycle then everyhting works fine. See the below figure as reference.

We see the read signals work correctly. First we read back the value of the element in matrix C, one clock cycle later (most likely to account for the read latency) we read from A and B (although all these reads could happen in parallel but that is a topic for another day). However, the Write Enable C (WEC) signal is asserted one clock cycle before the right value is computed and reflected in dataCout. Delaying WEC (WECr) and the corresponding address by one clock cycle makes sure the right value is stored back in C and the computation finishes correctly.

seldridge commented 3 years ago

@JuanEsco063: I think there may be some zip files missing here.

I'm still processing this, but a couple of notes:

The memory lowering and memory blackboxing are supposed to be separate things. A FIRRTL memory is either blackboxed (and eventually filled in with an FPGA-specific template or the output of a memory generator) or it's lowered to Verilog using some combination of what lower types does (breaking up aggregates and, with https://github.com/llvm/circt/pull/691, breaking up readwrite ports into separate read and write ports) and lower to RTL does (adds read/write delays based on the read and write latency). These should compose as a blackboxed memory won't be lowered.
I'm surprised that changing the read latency has an effect on write behavior. A read latency of zero means that there's a combinational read. A read latency of one buffers the address/enable by one cycle. However, the write path is (supposed to be) unchanged here.

If it looks like there's a problem with the memory lowering, it would probably be easiest to try and find a minimal test case showing what those passes is doing wrong. @drom is looking at fuzzing memories soon and this may help shed some light on this.

mikeurbach commented 3 years ago

FYI, there are a couple items in flight, and I'm wondering if maybe they will address the behavior @JuanEsco063 reported above. No promises :) but they seemed related to memories and latencies, so perhaps we can test again when those changes are in.

https://github.com/llvm/circt/pull/717 https://github.com/llvm/circt/issues/746 https://github.com/llvm/circt/issues/750 https://github.com/llvm/circt/pull/753

JuanEsco063 commented 3 years ago

Hey @mikeurbach sorry I have been MIA. Thank you for the updates! It seems there has been a lot of improvements which is great. Only #717 seems to be the only one still not merged. I look forward to tomorrow's meeting.

mikeurbach commented 3 years ago

Yep, a lot has been going on. I have been on other tasks myself, but I do want to loop back to these tests in light of all the recent changes.

From my end, I am going verify that we still need https://github.com/llvm/circt/pull/602 to prevent deadlocking. I'm pretty sure we still need something like that, in which case I'll merge the latest main into that branch for further experimentation.

I want to see if we can trim that pass down to do something a little less heavy handed with the buffer insertions. I am not making any promises, but I have a feeling we can address the situation without inserting buffers everywhere.

output

I've also added some canonicalizations that were pretty low hanging fruit, so the generated Handshake graph is considerably simpler now. I.e. if there is one loop, there should only be one ControlMergeOp. These simplifications should help when we are trying to decide where to place buffers. I.e. we might be able to just buffer the control output of each control merge, or something along those lines.

I chatted with @stephenneuendorffer about this, and it seems like the long term solution would be to have some sort of interfaces that allow operations to express their firing behavior, and guide this pass towards a more optimal buffer placement along the lines of this work: https://dl.acm.org/doi/10.1145/3373087.3375314. That would be really exciting stuff, but a lot of work. I'm hoping we can get there incrementally, starting with an improvement in #602.

JuanEsco063 commented 3 years ago

@mikeurbach I am revisiting old examples and using the latest version of your branch #602 on the same code from my comment of Feb 26th (matmul_std_lin.mlir) and following the same steps I am getting an error stating custom op 'alloc' is unknown when running:

circt-opt matmul_std_lin.mlir -create-dataflow -handshake-insert-buffer -lower-handshake-to-firrtl > matmul_firrtl_1.mlir

Did something change?

mikeurbach commented 3 years ago

Hmm apparently. I was planning to revisit those same examples shortly and experiment with #602. I suspect it just needs to re-base master, but I will check.

JuanEsco063 commented 3 years ago

@mikeurbach thank you!

mikeurbach commented 3 years ago

I fixed one bug here 8801f6c758a62d12c2721051b993a8e1d149b063, which was causing canonicalization to break the dataflow graph. I'm also working on setting up an integration test using Vivado. Before going to the matmul example, I tried starting with just a simple loop. That program will not run without the change in #602, but it does complete with it. So, I am back to testing #602, and generally thinking about where to put buffers in this situation.

mikeurbach commented 3 years ago

One more update from my experiments: I updated the handshake-insert-buffers pass to always put a single slot buffer on the input to each control merge operation, and the circuit does not deadlock. This is in addition to the buffers necessary to break combinational cycles.

This points to the lowering of control merge to hardware as a culprit. It already has internal buffers, but the fact that I had to buffer the inputs to prevent deadlock seems to indicate the generated circuit is not actually latency insensitive. I'm going to ponder this and review the design of Figure 7 here, which I attempted to follow exactly in our implementation. I also have a copy of the FHW compiler that Richard Townsend kindly provided, so I can cross reference that.

I'm hopeful that we are one small tweak away from reliably generating hardware for circuits with loops.

stephenneuendorffer commented 3 years ago

Latency insensitivity and deadlock are separate issues. From a formal perspective, latency insensitivity is a property of designs with infinite-size fifos on each channel. If a design deadlocks when fifo sizes get smaller, then this is a proper approximation of the correct execution. It is very possible that we need to insert buffers in strategic places to avoid deadlock and to increase performance beyond that. One approach is here, for instance: https://www.cs.upc.edu/~jordicf/gavina/BIB/files/FPGA2020.pdf

mikeurbach commented 3 years ago

Got it, thanks for the correction. I guess the latency insensitivity of the circuit is not in question, we are just in a situation where the design deadlocks. Is that the correct way to phrase it?

It is very possible that we need to insert buffers in strategic places to avoid deadlock

This is what I'm hoping to achieve in the short term, before going into approaches like that paper. The way we are generating (and canonicalizing) Handshake, we end up with a ControlMergeOp for each basic block that has multiple predecessors (I think of this as a phi node, but I'm not sure if that abstraction leaks). Anyway, it sort of makes intuitive sense to me that we need to buffer the inputs to the ControlMergeOp: each input token that arrives may need to wait while the other execution path(s) occur. This is just me waving my hands, is there a more formal definition of this?

I'm not sure how this approach will generalize, but at least in the case of this circuit, it prevents deadlock. I will push the code I've been testing to #602 to better illustrate what I mean.

teqdruid commented 3 years ago

I should have mentioned earlier that the LI-BDN paper lays out a set of rules which guarantee the individual components won't cause deadlock. dspace cover page (mit.edu) https://dspace.mit.edu/bitstream/handle/1721.1/58834/Vijayaraghavan-2009-Bounded%20Dataflow%20Networks%20and%20Latency-Insensitive%20Circuits.pdf?sequence=1&isAllowed=y I found it very helpful in a past life while working on LLPM.

On Fri, Apr 9, 2021 at 10:42 AM mikeurbach @.***> wrote:

Got it, thanks for the correction. I guess the latency insensitivity of the circuit is not in question, we are just in a situation where the design deadlocks. Is that the correct way to phrase it?

It is very possible that we need to insert buffers in strategic places to avoid deadlock

This is what I'm hoping to achieve in the short term, before going into approaches like that paper. The way we are generating (and canonicalizing) Handshake, we end up with a ControlMergeOp for each basic block that has multiple predecessors (I think of this as a phi node, but I'm not sure if that abstraction leaks). Anyway, it sort of makes intuitive sense to me that we need to buffer the inputs to the ControlMergeOp: each input token that arrives may need to wait while the other execution path(s) occur. This is just me waving my hands, is there a more formal definition of this?

I'm not sure how this approach will generalize, but at least in the case of this circuit https://github.com/llvm/circt/blob/main/test/handshake-runner/simple_loop.mlir, it prevents deadlock. I will push the code I've been testing to #602 https://github.com/llvm/circt/pull/602 to better illustrate what I mean.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/llvm/circt/issues/543#issuecomment-816846612, or unsubscribe https://github.com/notifications/unsubscribe-auth/AALNXYFPGJQOLVECEGC6UCLTH4367ANCNFSM4W5VDG6Q .

mikeurbach commented 3 years ago

Thanks for the more formal reference @teqdruid. I think I've come across that before but I'm reading it for real now. I think we have been pretty careful to preserve the NED and SC properties within the implementation of each component, but more tests/formal reasoning about them would be great.

I think the issue I'm grappling with is related to the compositions of components. In that paper, Lemma 3 mentions deadlock freedom "given infinite sinks for outputs and infinite source for inputs". But that is always the rub- there are no infinite fifos in real life.

I'll try to confirm the ControlMergeOp isn't violating the required properties. If it isn't, I think we just need to make sure the "infinite fifos" feeding ControlMergeOp are at least of size 1 in reality, like what I put in #602. I haven't worn a formal hat since undergrad, but maybe this is something I could prove...

mikeurbach commented 3 years ago

Did something change?

@JuanEsco063 yes, looks like MLIR has changed since you wrote that example. The alloc, load, and store ops now live in the memref dialect: https://llvm.discourse.group/t/rfc-split-the-memref-dialect-from-std/2667. At first glance, I think that just means you need to add a memref. prefix to all of them.

JuanEsco063 commented 3 years ago

@mikeurbach I don't know how I missed this reference. Sorry for the delay, You were correct! adding memref. to all memory related operations solved the issue. However, I have a few comments: 1) Perhaps a bit off-topic. I don't see any information on https://circt.llvm.org/docs/Dialects/ or https://circt.llvm.org/docs/Dialects/ about the memref dialect. Am I looking at a wrong/outdated reference? I assume eventually there will be a pass or lowering step to convert regular alloc operations in standard to the memref.alloc operation

2) I am still having issues with your branch #602 (see image below). I am currently not initializing the memories but as you can see, the circuit is not generating the correct memories or triggering the read/write enable signals. However, it is doing the addition ( C matrix always outputs 3 and A and B always output 2 so one accumulation is 2*2+3=7 which is the result we see in din for C)

3) Similarly, if I switch to the main branch now after the merge from #864 , the memories are still not getting the right control signals/memory address (see below). Again, the accumulation is correct.

4) Interestingly, in either branch I have removed all initialization of the matrices in the std code, thus leaving a single read operation on matrices A and B, and a single read and a single write operation on C, and yet the RTL for the C matrix has 1 read port and 2 write ports which fails to elaborate since it would mean a 3-port memory. The other memories are generated as expected.

5) Last but not least, I am catching up with a few other threads such as #745. Having something like the pass idea @JianyiCheng proposed and showed in the example code to allow the definition of both memory interfaces (for now we can imagine simple address, data, wen) coming from a higher level circuit and internally defined memory elements is great.

mikeurbach commented 3 years ago

Hi @JuanEsco063 thanks as always for taking a look and the detailed traces. Let me try to address some of these points.

The memref dialect does not live in CIRCT, it is part of MLIR itself: https://mlir.llvm.org/docs/Dialects/MemRef/. There is no longer such a thing as "regular alloc operations in standard", that exact operation is now "memref.alloc". This is part of an ongoing effort in MLIR to break up the "standard" dialect into smaller, more domain specific, dialects.
I have been testing your matmul example, and similar examples, and I see the same thing. The current addition in #602 is enough to allow correctly functioning circuits in some cases. I wrote test cases that use Vivado that have a loop or talk to memory, but I am still experimenting with more complex cases like yours. I think I need to test a) multiple nested loops, and b) loops (single or nested) with memory. I think I can write a test to show a single nested loop with memory works... that's what I'm trying now. I have a feeling there are some funny interactions with multiple loops around a memory.
I haven't taken a look on the latest main, but the memory test I put above should work on it. If it doesn't, I'll fix it before that PR ever lands. The loop test will require #602 for sure.
I'm not sure exactly what happened in this particular case, but this is something I still think about. With the way Handshake generates control signals, it should be possible to share a single port to each memory. This would be an interesting area to research, and we could potentially develop a separate pass to enable such a transformation. Practically speaking, if the goal was to run all memory accesses through a single port, that would open up the option to expose the second port on an FPGA BRAM, which I know you were interested in.
Definitely! This is an interesting topic, and came up again today in the LATTE workshop. Having the right abstractions for memories within CIRCT would be really great. In case you missed it, the discussion in #745 spilled out into the forum: https://llvm.discourse.group/t/rfc-handshake-top-level-memory-handling/2994.

Anyway, the bottom line is I am still honing in on the right places to put buffers in Handshaking circuits. Putting them on each cycle in the dataflow graph is enough to prevent physical feedback loops, but not enough to allow a program with a loop. The current approach in #602 enables a simple loop, but the circuit still appears to deadlock for more complex cases. The original approach in #602 was enough to support the more complex matrix multiplications, but it was quite heavy handed. At some point, I think we will just need to find a decent approach that works, and consider more intersting buffer placement (like the work Stephen linked above) an open area for future work.

JuanEsco063 commented 3 years ago

@mikeurbach I apologize for the long reply times. I have been doing some basic tests again and I feel I must be missing something because things don't seem to be working on neither main or handhsake-buffer-all. Going point by point:

1) Oh I see. So the framework I have been using to generate standard needs to accommodate for this and we need to add the extra pass/lowering to the memref dialect.

2,3,4) I think I must be doing something wrong on my end. Something seems to be very broken because I can't even get a simple MAC operation to work correctly. The MAC module in question has 2 inputs, it multiplies them, and adds them to the previous value which is the output. In my testbench I am just setting both inputs to 1, effectively just making it a counter but there is no internal loop . Starting from the standard code in mac_std.txt (attached) I run:

circt-opt mac_std.mlir -create-dataflow -handshake-insert-buffer -lower-handshake-to-firrtl > mac_f.mlir

I change the read latency to 1 (in my case to match the behavior of a BRAM) and then to generate Verilog run:

firtool -format=mlir mac_f.mlir -lower-to-rtl -enable-lower-types -verilog > mac.v

I have attached the generated file as mac.txt too.

I started digging and the problem seems to be in the generation of the control signals for the memory module. The problem is, as you can see in the figure below, the signals arg0 and arg1 valid, corresponding to data_in valid and addr valid, are asserted on a falling edge of a clock (a) and last until the next raising edge (b) (not even a whole clock cycle) and THEN on that rising edge (b) we get the result of the multiplication in the input ( albeit calculated correctly) which is lost by the next falling edge (c) and since the memory is updated on every rising edge, the wrong value is stored.

Original_MAC

After trying a few things, the only way I could get it to work is by doing the following: a) Register the _T_8 signal in handshake_memory_3ins_3outs_ui32_id0 internally at the positive edge

reg r_T_8;
always @(posedge clock)
  r_T_8<=_T_8;

b) Split always block of handshake_memory_3ins_3outs_ui32_id0 at line 77 in 2. c) The first of the new always blocks pipelines the address and enable signals for the read at the negative edge. I.e.:

  always @(negedge clock) begin // MAC_f.mlir:20:35
    mem0_load0_en_pipe[_T_6] <= arg2_valid; // MAC_f.mlir:20:35
    if (arg2_valid) // MAC_f.mlir:20:35
      mem0_load0_addr_pipe[_T_6] <= _T_5;   // MAC_f.mlir:20:35
  end

c) The second always block stores the data using the resgistered _T_8 signal at the negative edge. I.e.:

  always @(negedge clock) begin
    if (r_T_8)  // MAC_f.mlir:20:35
      mem0[_T_4] <= arg0_data;  // MAC_f.mlir:20:35
  end

I am not sure if logically this is equivalent to another (simpler) operation. However, after making the above changes I get the system to work as seen below.

Modified_MAC

I just tried the Counter code that I posted before and effectively it is not working. I tried applying the same fix but it seems the control signals only trigger once.

Also thank you for CC me in those other issues. They have been really interesting. And the one about XMR seems very promising!

mac_std.txt mac.txt

llvm / circt

[HandshakeToFIRRTL] Issues with Verilog generated from Standard involving memories #543