rdaly525 / coreir

BSD 3-Clause "New" or "Revised" License
100 stars 24 forks source link

Collapse inlined operators #626

Open rsetaluri opened 6 years ago

rsetaluri commented 6 years ago

Even with inlining, we have bloat of wires in output verilog. See a simple add:

magma:

Top = m.DefineCircuit("Top", "I0", m.In(m.UInt(8)), "I1", m.In(m.UInt(8)), "O", m.Out(m.UInt(8)))
sum_ = Top.I0 + Top.I1
m.wire(sum_, Top.O)
m.EndCircuit()

verilog:

module global_Top (
  input [7:0] I0,
  input [7:0] I1,
  output [7:0] O
);

  // Inlined from inst0(width:8)() : coreir.add
  wire [7:0] inst0__in0;
  wire [7:0] inst0__in1;
  wire [7:0] inst0__out;
  assign inst0__out = inst0__in0 + inst0__in1;

  assign inst0__in0[7:0] = I0[7:0];

  assign inst0__in1[7:0] = I1[7:0];

  assign O[7:0] = inst0__out[7:0];

endmodule  // global_Top

The ideal verilog is

module global_Top (
  input [7:0] I0,
  input [7:0] I1,
  output [7:0] O
);

  assign O[7:0] = I0[7:0] + I1[7:0];

endmodule  // global_Top

This problem is exaggerated if we have bigger expressions, e.g. res = ~(a + b) + c or something. One way to do this is to explore the magma -> verilog path directly and generate these statements there, but I'm a strong proponent of always going through CoreIR so we should fix it there.

I believe it should be tractable to figure out these paths, and flatten them into fairly simple verilog statements. Basically, any contiguous chains of inlined expressions can be flattened. For each output we can trace it back to a set of inputs, between each internal module instance, all logic is inlined and can be flattened. We do lose some of the debuggability, namely the comments saying where we inline from or tracing back to magma lines (if the logic is implemented across several lines of python). So perhaps we can have inlining levels, i.e. --inline=1 does the thing we have now, --inline=2 does the thing I'm proposing, etc.

rdaly525 commented 5 years ago

Definitely seems possible. I think one possible good "generatory" way to do this is for CoreIR to add something like a comb_AST generator.

I am imagining this generator where you pass in an AST of combinational CoreIR operators (using JSON). When outputting to verilog, CoreIR can just generate better verilog directly from this AST (either assign or always_comb blocks.).

CoreIR can then have a pass to fuse all comb ops into this AST op, so front-ends can either pass down CoreIR normally, or directly instance the comb_AST generator. @rsetaluri @leonardt thoughts?

rsetaluri commented 5 years ago

I think this is a good strategy. So to be clear, you want to implement 2 things:

This way people can pass normal CoreIR and the first pass will flatten those combinational paths, or this can be handled in the front end as well.

leonardt commented 5 years ago

FYI I have an initial patch that supports this by extending the existing inlining logic. It works for the basic case in this issue, but I discovered some subtle issues with a more complex circuit. Working on that.

I think this is possible without having to add the separate pass for fusion and having to construct an AST representation, but doesn't preclude that implementation if we decide it's a more general solution. I may change my mind as I dig deeper into the issues I'm facing with the more complex circuits (mainly ports that are not a bulk connection, so wiring 1 bit to an adder input low bit and 6 bits to the same adder input hight bits).

Here's a pseudo algorithm:

queue = output ports of current definition
while queue is not empty
    output = queue.pop()
    if output is driven by instance that can be inlined
        recursively inline any inputs to instance if possible, otherwise use standard input wire
        inline the instance and have it drive the output
        add any non-inlined inputs of the instance to the queue
    else
       emit code normally for the output and instance
       add the inputs of the instance to the queue
rsetaluri commented 5 years ago

Very nice!! Can you briefly explain what the complications are?

leonardt commented 5 years ago

Actually, I thought about it more and I think my proposed algorithm might actually be the basis for the fusion pass, it just implements the inlining logic immediately rather than decoupling it into two phases, so it should be possible to factor it out into a separate phase if we find that useful

leonardt commented 5 years ago

Right now, the logic takes the input wire and regexes it into the definition, so if we had

module corebit_and (
  input in0,
  input in1,
  output out
);
  assign out = in0 & in1;

endmodule //corebit_and

and

  wire  inst0__in0;
  wire  inst0__in1;
  wire  inst0__out;
  corebit_and inst0(
    .in0(inst0__in0),
    .in1(inst0__in1),
    .out(inst0__out)
  );
  assign inst0__in0 = I[0];
  assign inst0__in1 = I[1];
  assign O = inst0__out;

we'd get something along the lines of

  assign inst0__out = inst0__in0 & inst0__in1
  assign inst0__in0 = I[0];
  assign inst0__in1 = I[1];
  assign O = inst0__out;

My first pass attempt at this was to take the other end of the connections, and inline that into the definition rather than the wire name (which you can think of the instance end of the connection), which results in something like

  assign O = I[0] & I[1]

The problem arises when inst0_in0 isn't just a single element that we can inline, for example if instead it was driven by partial connections. Suppose we had some connections like this (and the add module changes to be width 2)

  assign inst0__in0[0] = I[0];
  assign inst0__in0[1] = I[1];
  assign inst0__in1[0] = I[0];
  assign inst0__in1[1] = I[1];

Then, we would need to create an inlining that was along the lines of

  assign O = {I[0], I[1]} & {I[0], I[1]}

The issue is that when we were inlining the wire names, we could do a simple search and replace in the definition. Now that we're inlining connections, we have to do some logic to do packing where necessary for these non-bulk connections.