llvm / circt

Circuit IR Compilers and Tools
https://circt.org
Other
1.62k stars 281 forks source link

[FIRRTL][HW] Feature Request to emit FIRRTL Memories Inline #5681

Open seldridge opened 1 year ago

seldridge commented 1 year ago

FIRRTL memories that do not go through the --repl-seq-mem path to replace them with blackboxes always produce a wrapper module around the memory. However, this can result in poorer performance for Vivado tooling for designs which rely on Vivado optimizations to optimize certain memory-to-memory paths.

For more information, see the discussion here: https://matrix.to/#/!jSbgpvvpsPbLpsLZOC:matrix.org/$NRsJEcgzZH0ghUHWnRG00OIqkHOLIIutn9WShxiNK6E?via=matrix.org&via=gitter.im&via=austin-harris.com

Add an option that enables inline emission of memories.

Example

Input FIRRTL:

FIRRTL version 3.0.0
circuit Foo:
  module Foo:
    input r: {addr: UInt<3>, en: UInt<1>, clk: Clock, flip data: UInt<32>}
    input w: {addr: UInt<3>, en: UInt<1>, clk: Clock, data: UInt<32>, mask: UInt<1>}

    mem memory :
      data-type => UInt<32>
      depth => 8
      read-latency => 1
      write-latency => 1
      reader => r
      writer => w
      read-under-write => undefined

    connect memory.r, r
    connect memory.w, w

Current output Verilog:

module memory_8x32(
  input  [2:0]  R0_addr,
  input         R0_en,
                R0_clk,
  input  [2:0]  W0_addr,
  input         W0_en,
                W0_clk,
  input  [31:0] W0_data,
  output [31:0] R0_data
);

  reg [31:0] Memory[0:7];
  reg        _GEN;
  reg [2:0]  _GEN_0;
  always @(posedge R0_clk) begin
    _GEN <= R0_en;
    _GEN_0 <= R0_addr;
  end // always @(posedge)
  always @(posedge W0_clk) begin
    if (W0_en)
      Memory[W0_addr] <= W0_data;
  end // always @(posedge)
  assign R0_data = _GEN ? Memory[_GEN_0] : 32'bx;
endmodule

module Foo(
  input  [2:0]  r_addr,
  input         r_en,
                r_clk,
  input  [2:0]  w_addr,
  input         w_en,
                w_clk,
  input  [31:0] w_data,
  input         w_mask,
  output [31:0] r_data
);

  memory_8x32 memory_ext (
    .R0_addr (r_addr),
    .R0_en   (r_en),
    .R0_clk  (r_clk),
    .W0_addr (w_addr),
    .W0_en   (w_en & w_mask),
    .W0_clk  (w_clk),
    .W0_data (w_data),
    .R0_data (r_data)
  );
endmodule

Desired output Verilog:

module Foo(
  input  [2:0]  r_addr,
  input         r_en,
                r_clk,
  input  [2:0]  w_addr,
  input         w_en,
                w_clk,
  input  [31:0] w_data,
  input         w_mask,
  output [31:0] r_data
);

  reg [31:0] Memory[0:7];
  reg        _GEN;
  reg [2:0]  _GEN_0;
  always @(posedge r_clk) begin
    _GEN <= r_en;
    _GEN_0 <= r_addr;
  end // always @(posedge)
  always @(posedge w_clk) begin
    if (w_en)
      Memory[w_addr] <= w_data;
  end // always @(posedge)
  assign r_data = _GEN ? Memory[_GEN_0] : 32'bx;

endmodule
seldridge commented 9 months ago

I missed this when filing the original issue. There is a mechanism to do this today using a documented annotation that has no explicit Chisel API. This converts memories to registers of vectors in the FIRRTL pipeline:

FIRRTL version 3.0.0
circuit Foo: %[[
  {"class": "sifive.enterprise.firrtl.ConvertMemToRegOfVecAnnotation$"}
]]
  module Foo:
    input r: {addr: UInt<3>, en: UInt<1>, clk: Clock, flip data: UInt<32>}
    input w: {addr: UInt<3>, en: UInt<1>, clk: Clock, data: UInt<32>, mask: UInt<1>}

    mem memory :
      data-type => UInt<32>
      depth => 8
      read-latency => 1
      write-latency => 1
      reader => r
      writer => w
      read-under-write => undefined

    connect memory.r, r
    connect memory.w, w

This produces the following Verilog:

// Generated by CIRCT firtool-1.59.0-92-g496930cf1
module Foo(
  input  [2:0]  r_addr,
  input         r_en,
                r_clk,
  output [31:0] r_data,
  input  [2:0]  w_addr,
  input         w_en,
                w_clk,
  input  [31:0] w_data,
  input         w_mask
);

  reg  [31:0]      memory_0;
  reg  [31:0]      memory_1;
  reg  [31:0]      memory_2;
  reg  [31:0]      memory_3;
  reg  [31:0]      memory_4;
  reg  [31:0]      memory_5;
  reg  [31:0]      memory_6;
  reg  [31:0]      memory_7;
  reg  [2:0]       addr;
  wire [7:0][31:0] _GEN =
    {{memory_7},
     {memory_6},
     {memory_5},
     {memory_4},
     {memory_3},
     {memory_2},
     {memory_1},
     {memory_0}};
  always @(posedge r_clk) begin
    if (w_en & w_mask & w_addr == 3'h0)
      memory_0 <= w_data;
    if (w_en & w_mask & w_addr == 3'h1)
      memory_1 <= w_data;
    if (w_en & w_mask & w_addr == 3'h2)
      memory_2 <= w_data;
    if (w_en & w_mask & w_addr == 3'h3)
      memory_3 <= w_data;
    if (w_en & w_mask & w_addr == 3'h4)
      memory_4 <= w_data;
    if (w_en & w_mask & w_addr == 3'h5)
      memory_5 <= w_data;
    if (w_en & w_mask & w_addr == 3'h6)
      memory_6 <= w_data;
    if (w_en & w_mask & (&w_addr))
      memory_7 <= w_data;
    addr <= r_addr;
  end // always @(posedge)
  assign r_data = _GEN[addr];
endmodule

It may be alternatively reasonable to convert this from an annotation to a command line option or to just add a Chisel API for it.

sterin commented 9 months ago

Vivado does not infer BRAMs from this Verilog style, even though register reads and writes are all mutually exclusive. Vivado is very picky about the style of Verilog it is able to infer memories from. Even the slightest deviation from the recommended style tends to prevent BRAMs from being inferred.

For small memories it is fine to use registers, but, but for large ones it is critical that Vivado is able to infer a cascade of BRAMs, and merge it with a shift register at the output.

sterin commented 9 months ago

For future reference, here are some relevant parts from the Vivado design methodology guide

Using an output register is required for designs operating at higher clock frequency, and is recommended for all designs to ease timing closure. This improves the clock to output timing of the block RAM. Additionally, a second output register is beneficial, as slice output registers have faster clock to out timing than a block RAM register. Having both registers has a total read latency of 3. When inferring these registers, they should be in the same level of hierarchy as the RAM array. This allows the tools to merge the block RAM output register into the primitive.

https://docs.xilinx.com/r/2023.1-English/ug949-vivado-design-methodology/Performance-Considerations-When-Implementing-RAM

AMD recommends that the memory and the output registers are all inferred in a single level of hierarchy, because this is the easiest method to ensure inference is as intended. There are two scenarios that will infer a block RAM output register. The first one is when an extra register exists on the output, and the second is when the read address register is retimed across the memory array. This can only happen using single port RAM. This is illustrated below:

https://docs.xilinx.com/r/2023.1-English/ug949-vivado-design-methodology/Scenarios-Preventing-Block-RAM-Output-Register-Inference

seldridge commented 9 months ago

Fair point. Thanks for checking that!