google / xls

XLS: Accelerated HW Synthesis
http://google.github.io/xls/
Apache License 2.0
1.21k stars 178 forks source link

Codegen: bug: RAM: inconsistent behavior with different RAM latency settings #1047

Open rdob-ant opened 1 year ago

rdob-ant commented 1 year ago

I wrote a small test proc to experiment with 1R1W RAMs since they're required for #995 .

TLDR: Using it, I've observed that relative sequencing of read w.r.t. write transactions produced by XLS-generated Verilog module over RAM interface is sometimes inconsistent, codegen RAM latency setting seems to have little correlation with the actual latency (delay) between rd_en strobe and rd_data sampling, and RAM latency setting of 0 appears to break the module.

Setup

The proc behavior is like this, where mem is a 1R1W semi-dual port RAM, o_data is an extra output channel used for debugging and stalling the pipeline:

addr = state;
rd_byte = mem[addr];
send(o_data, (addr,rd_byte) );
mem[addr+1] = rd_byte;
next_state = addr+1;

Verilog module is generated from the above proc with ram_configurations codegen option. It is then instantiated from a hand-written Verilog test module. The test is run using iverilog, and generated .vcd waveform file is manually analyzed in GtkWave. The test module behavior can be summarized like this:

* Generate free-running clock `clk`.
* o_data_rdy generator:
  1. init to 0
  2. after 14 clock cycles set to 1
* ram_rd_data generator:
  1. init to 0
  2. increment by 1 on each clock cycle
* test flow:
  1. assert `rst` for 5 clock cycles, deassert `rst`
  2. run for 1000 clock cycles
  3. finish simulation

After running the test, I look at ram_rd, ram_wr and o_data signals and check the following:

Case 1

Observations:

XLS-generated module does not do anything, RAM write transactions and o_data sends do not happen. DSLX proc must at least send some data to o_data. In this case it looks like module's pipeline is always stalled - while it should not be.

image

Case 2

Observations:

  1. ram_rd_data is sampled on the second clock edge following assertion of ram_rd_en. This is as expected.
  2. Write to address k+1 (from N-th iteration of next()) and read from address k+1 (from N+1-th iteration of next()) happen in the same clock cycle when the pipeline is not stalled, and read is delayed when the pipeline is stalled. The intention of DSLX code is that the read in the beginning of next() should happen only when the write from the end of previous next() 'iteration' has been completed, not that it will happen simultaneously. This is also seen in opt-IR code: next() is 'called' with the token returned from recv(ram_wr_resp), meaning that send(ram_rd_req) should not happen prior to the completion of recv(ram_wr_resp). The fact that read sometimes coincides with write, and sometimes is delayed w.r.t write forces one to assume write-before-read RAM semantics, and use a corresponding RAM IP block that is not always available. However, see cases below, as the the situation gets worse for higher-latency RAMs.
  3. Writes to addresses k+1 and even k+2 happen before o_data send transaction for address k is concluded. This is not a real problem and can be tolerated as long as RAM writes do not have any interaction with o_data channel in the circuitry external to the module. Nevertheless, my understanding of DSLX code is that RAM writes should be executed only after completion of send(o_data) transactions.
  4. Performance note: irrespective of the pipeline depth (I've tried 3 and 20), RAM reads are not bundled (ram_rd_en is asserted for at most 1 clock cycle at a time and is re-asserted after ram_rd_data has been sampled). This happens even when the module performs only RAM reads, that is RAM WR and o_data channels are disabled (using send_if/recv_if with predicate set to false). Although RAM with 1-cycle latency is expected to be the fastest synchronous RAM available, the throughput of XLS-generated module with such RAM turns out to be limited to 1 read per 3 clock cycles, while RAM itself is capable of 1 read per cycle.

image

Case 3

Observations:

The behavior is mostly the same as in case 2, expect for the following:

  1. ram_rd_data is sampled at the same time as before - on the second clock cycle after assertion of ram_rd_en. This means that XLS assumes RAM latency to be 1 - the same as in case 2, while our intention was to have latency of 2, so sampling should happen later.
  2. (can be result of p.1) reads from address k+1 happen before writes to k+1. This breaks the intended behavior of DSLX code, which assumes that writes will happen before reads.
  3. Performance: generated module now runs 2 RAM reads per 3 clock cycles.

image

Case 4

Observations:

  1. The sampling of ram_rd_data is unpredictable and seems to be broken. IMO, this makes it impossible to build a deterministic Verilog RAM module that will interwork with XLS-generated RAM interface. See the data sent to o_data:
    • 0x2002 - read @ 0x20 reads 0x02 - the sampling is on the second clock edge (same as with RAM latency 1)
    • 0x2103 - read @ 0x21 reads 0x03 - the same
    • 0x2204 - read @ 0x22 reads 0x04 - the same
    • 0x230D - read @ 0x23 reads 0x0D - the sampling is on the 10th clock edge after ram_rd_en!
    • 0x240E - read @ 0x24 reads 0x0E - the sampling is on the 3rd clock edge.
  2. Same as before: reads from address k+1 and even k+2 happen before writes to k+1.
  3. Performance: it seems that in this case XLS can schedule RAM read on each cycle, potentially utilizing full RAM bandwidth (assuming above issues can be fixed).

image

Effects of above observations

Code

DSLX (ram_test.x) ``` import std import xls.examples.ram as ram type RamWriteReq = ram::WriteReq; type RamWriteResp = ram::WriteResp; type RamReadReq = ram::ReadReq; type RamReadResp = ram::ReadResp; struct OutData { addr: u8, data: u8, } proc ram_test { o_data: chan out; o_ram_write_req: chan> out; i_ram_write_resp: chan in; o_ram_read_req: chan> out; i_ram_read_resp: chan> in; init{u8:0x20} config( o_data: chan out, o_ram_write_req: chan> out, i_ram_write_resp: chan in, o_ram_read_req: chan> out, i_ram_read_resp: chan> in, ) { ( o_data, o_ram_write_req, i_ram_write_resp, o_ram_read_req, i_ram_read_resp ) } next(tok: token, addr: u8) { let tok = send(tok, o_ram_read_req, ram::ReadWordReq(addr)); let (tok, rd) = recv(tok, i_ram_read_resp); let tok = send(tok, o_data, OutData{addr: addr, data: rd.data}); let tok = send(tok, o_ram_write_req, ram::WriteWordReq(addr + u8:1, rd.data)); let (tok, _) = recv(tok, i_ram_write_resp); addr + u8:1 } } ```
Verilog (ram_test_test.v) ``` `timescale 1ns/1ns module test; reg clk = 0; reg rst = 0; wire [15:0] o_data_data; wire o_data_vld; reg o_data_rdy = 0; reg [7:0] ram_rd_data; wire [7:0] ram_rd_addr; wire ram_rd_en; wire [7:0] ram_wr_addr; wire [7:0] ram_wr_data; wire ram_wr_en; ram_test dut ( .clk(clk), .rst(rst), .ram_test__o_data_rdy(o_data_rdy), .ram_test__o_data_data(o_data_data), .ram_test__o_data_vld(o_data_vld), .ram_rd_data(ram_rd_data), .ram_rd_addr(ram_rd_addr), .ram_rd_en(ram_rd_en), .ram_wr_addr(ram_wr_addr), .ram_wr_data(ram_wr_data), .ram_wr_en(ram_wr_en) ); // clk gen always #1 clk = ~clk; // o_data_rdy gen initial begin # 28 o_data_rdy = 1; end // Dummy RAM - ignore writes, return reads from clock counter always @(posedge clk) begin if (rst) ram_rd_data <= 8'd0; else ram_rd_data <= ram_rd_data + 8'd1; end initial begin $dumpfile("ram_test.vcd"); $dumpvars(0,test); // Reset @(posedge clk) rst = 1; # 5 @(posedge clk) rst = 0; // Run # 1000 $finish; end endmodule ```
Bazel ``` load( "//xls/build_rules:xls_build_defs.bzl", "xls_dslx_library", "xls_dslx_ir", "xls_ir_opt_ir", "xls_ir_verilog", "xls_dslx_test", ) package( default_applicable_licenses = ["//:license"], default_visibility = ["//xls:xls_users"], licenses = ["notice"], ) xls_dslx_library( name = "ram_test_dslx", srcs = [ "ram_test.x" ], deps = [ "//xls/examples:ram_dslx", ], ) xls_dslx_ir( name = "ram_test_ir", dslx_top = "ram_test", library = "ram_test_dslx", ir_file = "ram_test_ir.ir", ) xls_ir_opt_ir( name = "ram_test_opt_ir", src = "ram_test_ir.ir", top = "__ram_test__ram_test_0_next", ) xls_ir_verilog( name = "ram_test_verilog", src = "ram_test_opt_ir.opt.ir", verilog_file = "ram_test.v", codegen_args = { "module_name": "ram_test", "delay_model": "unit", "pipeline_stages": "7", "reset": "rst", "use_system_verilog": "false", "streaming_channel_data_suffix": "_data", "ram_configurations": ",".join([ "ram:1R1W:{rd_req}:{rd_resp}:{wr_req}:{wr_resp}:3".format( rd_req = "ram_test__o_ram_read_req", rd_resp = "ram_test__i_ram_read_resp", wr_req = "ram_test__o_ram_write_req", wr_resp = "ram_test__i_ram_write_resp", ), ]), }, ) ```
Running iverilog ``` % bazel build --test_output=streamed -s -c opt -- //xls/modules/test:ram_test_verilog % iverilog -o iv_test -l bazel-bin/xls/modules/test/ram_test.v xls/modules/test/ram_test_test.v && vvp -v iv_test ```
grebe commented 1 year ago

Sorry for the slow reply, between taking time to digest this and time off for July 4th, it's been a while.

For latency=0, (at least for now) it is an expected failure that it will deadlock. Setting latency=0 schedules a send and a receive in the same cycle, and our current codegen has send/recvs performed atomically in such a way that this will deadlock (if I remember, this is because receives must complete before . I think it's pretty substantial surgery to make this work the way it needs to and we don't have a pressing need to support flop arrays via the memory interface (I think synthesis is good enough for our use-cases now). Setting latency=0 is a useful debugging tool to figure out what's going on with the scheduler, but the codegen'd output is not expected to work right now.

For latency>1, I think some of the weirdness (especially the reads running ahead) come from the codegen option not really supporting latency =/= 1. There's some internal rewriting of control flow signals that currently uses a 1-element fifo. I think what's likely happening is that valid signals are coming in earlier than expected or out of phase and not backpressuring correctly. I've added #1053 to track this.

For latency=1 I'm a bit puzzled by what's going on as I definitely have some tests that show simple examples with memories running at full throughput. Those tests are of the "manually run the simulator and inspect VCD" variety, so I've added #1054 to add some tests that fail if full throughput isn't met. Your example here seems like a good test case to include in that test. Maybe I've missed a regression or maybe your example is somehow different in a way that triggers bad behavior.

proppy commented 1 year ago

keeping this open until #1053 and #1054 are closed as this is blocking #995