Open rdob-ant opened 1 year ago
Sorry for the slow reply, between taking time to digest this and time off for July 4th, it's been a while.
For latency=0, (at least for now) it is an expected failure that it will deadlock. Setting latency=0 schedules a send and a receive in the same cycle, and our current codegen has send/recvs performed atomically in such a way that this will deadlock (if I remember, this is because receives must complete before . I think it's pretty substantial surgery to make this work the way it needs to and we don't have a pressing need to support flop arrays via the memory interface (I think synthesis is good enough for our use-cases now). Setting latency=0 is a useful debugging tool to figure out what's going on with the scheduler, but the codegen'd output is not expected to work right now.
For latency>1, I think some of the weirdness (especially the reads running ahead) come from the codegen option not really supporting latency =/= 1. There's some internal rewriting of control flow signals that currently uses a 1-element fifo. I think what's likely happening is that valid signals are coming in earlier than expected or out of phase and not backpressuring correctly. I've added #1053 to track this.
For latency=1 I'm a bit puzzled by what's going on as I definitely have some tests that show simple examples with memories running at full throughput. Those tests are of the "manually run the simulator and inspect VCD" variety, so I've added #1054 to add some tests that fail if full throughput isn't met. Your example here seems like a good test case to include in that test. Maybe I've missed a regression or maybe your example is somehow different in a way that triggers bad behavior.
keeping this open until #1053 and #1054 are closed as this is blocking #995
I wrote a small test proc to experiment with 1R1W RAMs since they're required for #995 .
TLDR: Using it, I've observed that relative sequencing of read w.r.t. write transactions produced by XLS-generated Verilog module over RAM interface is sometimes inconsistent, codegen RAM latency setting seems to have little correlation with the actual latency (delay) between
rd_en
strobe andrd_data
sampling, and RAM latency setting of0
appears to break the module.Setup
The proc behavior is like this, where
mem
is a 1R1W semi-dual port RAM,o_data
is an extra output channel used for debugging and stalling the pipeline:Verilog module is generated from the above proc with
ram_configurations
codegen option. It is then instantiated from a hand-written Verilog test module. The test is run using iverilog, and generated .vcd waveform file is manually analyzed in GtkWave. The test module behavior can be summarized like this:After running the test, I look at
ram_rd
,ram_wr
ando_data
signals and check the following:ram_rd_en
pulses.addr+1
in the current iteration ofnext()
is followed by read from the same address in the next iteration).o_data
contains address & data read from RAM, which allows to precisely determine the moment ofram_rd_data
bus sampling.o_data_rdy
initially set to LOW allows observing module's behavior with stalled and unstalled pipeline in the single simulation run.Case 1
Observations:
XLS-generated module does not do anything, RAM write transactions and
o_data
sends do not happen. DSLX proc must at least send some data too_data
. In this case it looks like module's pipeline is always stalled - while it should not be.Case 2
Observations:
ram_rd_data
is sampled on the second clock edge following assertion ofram_rd_en
. This is as expected.k+1
(from N-th iteration ofnext()
) and read from addressk+1
(from N+1-th iteration ofnext()
) happen in the same clock cycle when the pipeline is not stalled, and read is delayed when the pipeline is stalled. The intention of DSLX code is that the read in the beginning ofnext()
should happen only when the write from the end of previousnext()
'iteration' has been completed, not that it will happen simultaneously. This is also seen in opt-IR code:next()
is 'called' with the token returned fromrecv(ram_wr_resp)
, meaning thatsend(ram_rd_req)
should not happen prior to the completion ofrecv(ram_wr_resp)
. The fact that read sometimes coincides with write, and sometimes is delayed w.r.t write forces one to assumewrite-before-read
RAM semantics, and use a corresponding RAM IP block that is not always available. However, see cases below, as the the situation gets worse for higher-latency RAMs.k+1
and evenk+2
happen beforeo_data
send transaction for addressk
is concluded. This is not a real problem and can be tolerated as long as RAM writes do not have any interaction witho_data
channel in the circuitry external to the module. Nevertheless, my understanding of DSLX code is that RAM writes should be executed only after completion ofsend(o_data)
transactions.ram_rd_en
is asserted for at most 1 clock cycle at a time and is re-asserted afterram_rd_data
has been sampled). This happens even when the module performs only RAM reads, that is RAM WR ando_data
channels are disabled (using send_if/recv_if with predicate set tofalse
). Although RAM with 1-cycle latency is expected to be the fastest synchronous RAM available, the throughput of XLS-generated module with such RAM turns out to be limited to 1 read per 3 clock cycles, while RAM itself is capable of 1 read per cycle.Case 3
Observations:
The behavior is mostly the same as in case 2, expect for the following:
ram_rd_data
is sampled at the same time as before - on the second clock cycle after assertion ofram_rd_en
. This means that XLS assumes RAM latency to be1
- the same as in case 2, while our intention was to have latency of2
, so sampling should happen later.k+1
happen before writes tok+1
. This breaks the intended behavior of DSLX code, which assumes that writes will happen before reads.Case 4
Observations:
ram_rd_data
is unpredictable and seems to be broken. IMO, this makes it impossible to build a deterministic Verilog RAM module that will interwork with XLS-generated RAM interface. See the data sent too_data
:ram_rd_en
!k+1
and evenk+2
happen before writes tok+1
.Effects of above observations
Code
DSLX (ram_test.x)
``` import std import xls.examples.ram as ram type RamWriteReq = ram::WriteReq; type RamWriteResp = ram::WriteResp; type RamReadReq = ram::ReadReq; type RamReadResp = ram::ReadResp; struct OutData { addr: u8, data: u8, } proc ram_test { o_data: chanVerilog (ram_test_test.v)
``` `timescale 1ns/1ns module test; reg clk = 0; reg rst = 0; wire [15:0] o_data_data; wire o_data_vld; reg o_data_rdy = 0; reg [7:0] ram_rd_data; wire [7:0] ram_rd_addr; wire ram_rd_en; wire [7:0] ram_wr_addr; wire [7:0] ram_wr_data; wire ram_wr_en; ram_test dut ( .clk(clk), .rst(rst), .ram_test__o_data_rdy(o_data_rdy), .ram_test__o_data_data(o_data_data), .ram_test__o_data_vld(o_data_vld), .ram_rd_data(ram_rd_data), .ram_rd_addr(ram_rd_addr), .ram_rd_en(ram_rd_en), .ram_wr_addr(ram_wr_addr), .ram_wr_data(ram_wr_data), .ram_wr_en(ram_wr_en) ); // clk gen always #1 clk = ~clk; // o_data_rdy gen initial begin # 28 o_data_rdy = 1; end // Dummy RAM - ignore writes, return reads from clock counter always @(posedge clk) begin if (rst) ram_rd_data <= 8'd0; else ram_rd_data <= ram_rd_data + 8'd1; end initial begin $dumpfile("ram_test.vcd"); $dumpvars(0,test); // Reset @(posedge clk) rst = 1; # 5 @(posedge clk) rst = 0; // Run # 1000 $finish; end endmodule ```Bazel
``` load( "//xls/build_rules:xls_build_defs.bzl", "xls_dslx_library", "xls_dslx_ir", "xls_ir_opt_ir", "xls_ir_verilog", "xls_dslx_test", ) package( default_applicable_licenses = ["//:license"], default_visibility = ["//xls:xls_users"], licenses = ["notice"], ) xls_dslx_library( name = "ram_test_dslx", srcs = [ "ram_test.x" ], deps = [ "//xls/examples:ram_dslx", ], ) xls_dslx_ir( name = "ram_test_ir", dslx_top = "ram_test", library = "ram_test_dslx", ir_file = "ram_test_ir.ir", ) xls_ir_opt_ir( name = "ram_test_opt_ir", src = "ram_test_ir.ir", top = "__ram_test__ram_test_0_next", ) xls_ir_verilog( name = "ram_test_verilog", src = "ram_test_opt_ir.opt.ir", verilog_file = "ram_test.v", codegen_args = { "module_name": "ram_test", "delay_model": "unit", "pipeline_stages": "7", "reset": "rst", "use_system_verilog": "false", "streaming_channel_data_suffix": "_data", "ram_configurations": ",".join([ "ram:1R1W:{rd_req}:{rd_resp}:{wr_req}:{wr_resp}:3".format( rd_req = "ram_test__o_ram_read_req", rd_resp = "ram_test__i_ram_read_resp", wr_req = "ram_test__o_ram_write_req", wr_resp = "ram_test__i_ram_write_resp", ), ]), }, ) ```Running iverilog
``` % bazel build --test_output=streamed -s -c opt -- //xls/modules/test:ram_test_verilog % iverilog -o iv_test -l bazel-bin/xls/modules/test/ram_test.v xls/modules/test/ram_test_test.v && vvp -v iv_test ```