google / xls

XLS: Accelerated HW Synthesis
http://google.github.io/xls/
Apache License 2.0
1.18k stars 172 forks source link

Codegen: bug: incorrect scheduling of send_if/recv_if operations #1051

Closed rdob-ant closed 1 year ago

rdob-ant commented 1 year ago

In a primitive testcase codegen does not schedule send_if/recv_if channel transactions in the order defined by the token graph.

Test proc:

proc test2 {
    req1: chan<u8> out;
    resp1: chan<u8> in;
    req2: chan<u8> out;
    resp2: chan<u8> in;
    dbg: chan<u8> out;

    init{u8:0x20}

    config(
        req1: chan<u8> out,
        resp1: chan<u8> in,
        req2: chan<u8> out,
        resp2: chan<u8> in,
        dbg: chan<u8> out,
    ) {
        (req1, resp1, req2, resp2, dbg)
    }

    next(tok: token, state: u8) {
        let pred1 = state > u8:0x10;
        let pred2 = state < u8:0x80;

        let tok = send_if(tok, req1, pred1, state);
        let (tok, v1) = recv_if(tok, resp1, pred1, u8:0x0A);

        let tok = send_if(tok, req2, pred2, v1);
        let (tok, v2) = recv_if(tok, resp2, pred2, u8:0xB0);

        let tok = send(tok, dbg, v2);

        state + u8:1
    }
}
Verilog generation ``` xls_ir_verilog( name = "test2_verilog", src = "test2_opt_ir.opt.ir", verilog_file = "test2.v", codegen_args = { "module_name": "test2", "delay_model": "unit", "pipeline_stages": "10", "reset": "rst", "use_system_verilog": "false", "streaming_channel_data_suffix": "_data", }, ) ```
Verilog testbench ``` `timescale 1ns/1ns module test; reg clk = 0; reg rst = 0; // dbg wire [7:0] dbg_data; wire dbg_vld; reg dbg_rdy = 0; // req1 wire [7:0] req1_data; wire req1_vld; wire req1_rdy; // resp1 reg [7:0] resp1_data; reg resp1_vld = 1; wire resp1_rdy; // req2 wire [7:0] req2_data; wire req2_vld; wire req2_rdy; // resp2 reg [7:0] resp2_data; reg resp2_vld = 1; wire resp2_rdy; test2 dut ( .clk(clk), .rst(rst), // dbg .test2__dbg_data(dbg_data), .test2__dbg_vld(dbg_vld), .test2__dbg_rdy(dbg_rdy), // req1 .test2__req1_data(req1_data), .test2__req1_vld(req1_vld), .test2__req1_rdy(req1_rdy), // resp1 .test2__resp1_data(resp1_data), .test2__resp1_vld(resp1_vld), .test2__resp1_rdy(resp1_rdy), // req2 .test2__req2_data(req2_data), .test2__req2_vld(req2_vld), .test2__req2_rdy(req2_rdy), // resp2 .test2__resp2_data(resp2_data), .test2__resp2_vld(resp2_vld), .test2__resp2_rdy(resp2_rdy) ); // // REQ-RESP FSM 1 assign req1_rdy = !resp1_vld || (resp1_vld && resp1_rdy); always @(posedge clk) begin if (rst) begin resp1_vld <= 1'b0; end else begin if (resp1_vld && resp1_rdy) begin resp1_vld <= 1'b0; end if (req1_vld && req1_rdy) begin resp1_vld <= 1'b1; end end end // // REQ-RESP FSM 2 assign req2_rdy = !resp2_vld || (resp2_vld && resp2_rdy); always @(posedge clk) begin if (rst) begin resp2_vld <= 1'b0; end else begin if (resp2_vld && resp2_rdy) begin resp2_vld <= 1'b0; end if (req2_vld && req2_rdy) begin resp2_vld <= 1'b1; end end end // clk gen always #1 clk = ~clk; // dbg_rdy gen initial begin # 28 dbg_rdy = 1; end // response data generator always @(posedge clk) begin if (rst) begin resp1_data <= 8'hAA; resp2_data <= 8'hBB; end else begin resp1_data <= (resp1_data + 8'h01) & 8'h0F; resp2_data <= (resp2_data + 8'h10) & 8'hF0; end end initial begin $dumpfile("test2.vcd"); $dumpvars(0,test); // Reset rst = 1; # 6 rst = 0; // Run # 10000 $finish; end endmodule ```

Expected: whenever one of the predicates is true, the corresponding send_if(req) / recv_if(resp) channel transactions should be executed in the specified order, that is, first req_vld = 1 should be set, and when req_rdy && req_vld == 1 is observed, the module must block till it observes resp_vld == 1. The module must not block waiting for resp_vld == 1 unless it has already observed acknowledgment of the request's transaction (req_rdy && req_vld == 1).

Actual: generated Verilog module is stuck and does nothing, despite the external circuit having set req1_rdy == 1; req2_rdy == 1; dbg_rdy == 1;.

image

Modification 1

If pred1 and pred2 predicates are hardcoded to e.g. pred1 = pred2 = true, the module gets going and executes transactions in correct order.

image

Modification 2

With the original code, if the external circuit briefly sets resp1_vld = 1; resp2_vld = 1; during the reset, the module gets going but executes transactions in a wrong order. As one can see from the value of req2_data = 0xAA at the cursor (red vertical line), at some time in the beginning the module receives value from resp1 channel with resp1_data (v1) == 0xAA - and that value is only present on the bus before the first clock pulse after rst signal de-assertion. This indicates that the first thing the module does upon coming out of reset is to execute recv(resp1) - which is wrong. The first operation executed by the module upon coming out of reset must be send(req1).

image

Impact

Many procs using 1R1W RAM interface will contain logic akin to the one used by this proc (imagine that req1/resp1 is a RAM write bus, req2/resp2 is a RAM read bus, and reads and writes are executed conditionally depending on the internal state of the module, with all operations expected to be executed in a strict order). This issue demonstrates that the behavior of Verilog code generated for such procs can be incorrect. This blocks #995. @proppy

P.S. I've previously reported issues related to RAM interface rewriting. Can it be that e.g. #1047 is at least partially caused by the findings reported here?

grebe commented 1 year ago

I believe this is similar to the latency=0 case discussed in #1047. For send/recv operations that are scheduled in the same stage, all receives must complete before any sends can begin, which is why you see resp_valid unblocks the req transactions.

Although somewhat counterintuitive, token semantics are that operations taking a token input must occur in the same stage or later than whatever produced the token. If the two operations end up in the same stage in the pipeline, there is no ordering between them except that receives must fire before sends.

If you want to enforce that requests fire before sends, you should add a scheduling constraint that keeps them out of the same cycle. For memories, this is done by setting latency>0, and for normal channel operations you can do this with a scheduling constraint

proppy commented 1 year ago

@rdob-ant can you confirm if https://google.github.io/xls/tutorials/intro_to_procs/#scheduling-constraints is an acceptable workaround for this issue? (as @grebe commented on #1047, it seems that this is WAI per the current token and scheduling semantics)

rdob-ant commented 1 year ago

@grebe

Although somewhat counterintuitive, token semantics are that operations taking a token input must occur in the same stage or later than whatever produced the token. If the two operations end up in the same stage in the pipeline, there is no ordering between them except that receives must fire before sends.

That's a great piece of info. Thanks! I haven't realized that the semantics were like that, but with this explanation it all begins to make sense :+1:

@proppy As for scheduling constraints, the effect of this is two-fold.

First, it allows me to schedule sends before receives, like it was intended :heavy_check_mark: . So the problem with a stalled Verilog module is fixed.

However, it still does not allow me to enforce that req1 from the N+1'th iteration should only happen when resp2 from the N'th iteration has been completed (IO constraints can't handle "backedges"?.. not sure how to call this cross-iteration relationship). So, I can't set constraints like this:

        "io_constraints": ",".join([
            "test2__req1:send:test2__resp1:recv:1:none",
            "test2__req2:send:test2__resp2:recv:1:none",
            "test2__resp2:recv:test2__req1:send:1:none",
        ])

The last constraint only works if you ask it for latency of -2 cycles or lower (more negative). The codegen fails for values of -1 and higher.

This results in the following waveforms (note that the second req1_vld assertion happens before the first resp2_rdy): image

When req1/req2 channels are independent, and we only care whether the data is sent to them, but we don't care about data ordering between the channels, this presents no problem. But if those are RD/WR ports of the same 1R1W RAM block, it can be problematic, because instead of the intended "WR-RD-WR-RD" the RAM can sometimes see "WR-WR-RD-RD" access pattern, which may break some code.

As the gist of this issue (stalled Verilog module that was first trying to receive instead of sending) is mostly fixed by using io_constraints, perhaps it's ok to close it.

rdob-ant commented 1 year ago

IO constraints resolved the core problem reported in the issue. The inability to specify IO constraints between iterations is something that's best to be discussed in a separate issue.

Thanks for the inputs everyone! Closing.