ucb-bar / riscv-sodor

educational microarchitectures for risc-v isa
Other
688 stars 155 forks source link

race condition in SimDTM.v under Verilator; related to "taddr-4" in fesvr/dtm.cc, write_chunk #55

Open smcpeak opened 4 years ago

smcpeak commented 4 years ago

vsrc/SimDTM.v contains this call to debug_tick:

      __exit = debug_tick(
        __debug_req_valid,
        __debug_req_ready,
        __debug_req_bits_addr,
        __debug_req_bits_op,
        __debug_req_bits_data,
        __debug_resp_valid,
        __debug_resp_ready,
        __debug_resp_bits_resp,
        __debug_resp_bits_data
      );

This reads inputs, such as __debug_req_valid, that may depend on its own outputs, such as __debug_resp_valid. This makes the circuit behavior undefined.

There are existing code lines that attempt to solve this by introducing delays, for example:

  assign #0.1 debug_req_valid = __debug_req_valid;

but the delay specifier ("#0.1") has no effect under Verilator.

One way to make this race condition visible is to add a diagnostic printout to debug.scala:

diff --git a/src/common/debug.scala b/src/common/debug.scala
index ead55bc..b21d391 100644
--- a/src/common/debug.scala
+++ b/src/common/debug.scala
@@ -169,6 +169,32 @@ class DebugModule(implicit val conf: SodorConfiguration) extends Module {
   dmstatus.allhalted := dmcontrol.haltreq
   dmstatus.allrunning := dmcontrol.resumereq
   io.dcpath.halt := dmstatus.allhalted && !dmstatus.allrunning
+
+  // Stronger condition.
+  val outercond =
+    ( (io.dmi.req.bits.op === DMConsts.dmi_OP_WRITE) &&
+      (decoded_addr(DMI_RegAddrs.DMI_COMMAND)) &&
+      io.dmi.req.valid );
+
+  // Weaker condition.  I call this "inner" because its two elements are
+  // syntactically contained inside the "outer" condition.
+  val innercond = 
+    ( (decoded_addr(DMI_RegAddrs.DMI_COMMAND)) &&
+      io.dmi.req.valid );
+
+  // This creates an extra use of 'innercond' that changes the way
+  // Verilator generates code.  It is needed to expose the bug.  The
+  // value also has to be used somewhere, which is why it appears in
+  // the "bug happened" printout below.
+  val anothercond = innercond && dmcontrol.haltreq;
+
+  // Test for the impossible combination.
+  when (outercond) {
+    when (!innercond) {
+      printf("bug happened! anothercond=%d\n", anothercond);
+    }
+  }
+
   when (io.dmi.req.bits.op === DMConsts.dmi_OP_WRITE){
     when((decoded_addr(DMI_RegAddrs.DMI_ABSTRACTCS)) && io.dmi.req.valid) {
       val tempabstractcs = wdata.asTypeOf(new ABSTRACTCSFields())

With the above change, we see printouts indicating the impossible condition happens:

emulator/rv32_1stage$ ./emulator +max-cycles=30000 ../../install/riscv-tests/rv32ui-p-simple
Instantiated DTM.
bug happened! anothercond=0
bug happened! anothercond=0
bug happened! anothercond=0
bug happened! anothercond=0
Cyc=          0 [1] pc=[80000000] W[r 0=80000004][1] Op1=[r 0][00000000] Op2=[r12][00000000] inst=[04c0006f]  J  DASM(04c0006f)
[...]

At first I thought this was a bug in Verilator and filed https://github.com/verilator/verilator/issues/2177 for it, but was informed about the race condition in SimDTM.v, so I'm reporting it here.

To fix the bug, I propose (following the advice in the linked issue):

diff --git a/vsrc/SimDTM.v b/vsrc/SimDTM.v
index a5f8c23..1d968d6 100644
--- a/vsrc/SimDTM.v
+++ b/vsrc/SimDTM.v
@@ -46,12 +46,18 @@ module SimDTM(
   bit __debug_resp_ready;
   int __exit;

-  assign #0.1 debug_req_valid = __debug_req_valid;
-  assign #0.1 debug_req_bits_addr = __debug_req_bits_addr[6:0];
-  assign #0.1 debug_req_bits_op = __debug_req_bits_op[1:0];
-  assign #0.1 debug_req_bits_data = __debug_req_bits_data[31:0];
-  assign #0.1 debug_resp_ready = __debug_resp_ready;
-  assign #0.1 exit = __exit;
+  // Copy data from 'debug_tick' on the negative clock edge so it does
+  // not happen at the same time as a positive clock edge, which would
+  // create a dependency cycle.
+  always @(negedge clk)
+  begin
+    debug_req_valid     <= __debug_req_valid;
+    debug_req_bits_addr <= __debug_req_bits_addr[6:0];
+    debug_req_bits_op   <= __debug_req_bits_op[1:0];
+    debug_req_bits_data <= __debug_req_bits_data[31:0];
+    debug_resp_ready    <= __debug_resp_ready;
+    exit                <= __exit;
+  end

   always @(posedge clk)
   begin

That change eliminates the observed misbehavior in debug.scala. However, tests then fail. For example:

emulator/rv32_1stage$ ./emulator +max-cycles=30000 ../../install/riscv-tests/rv32ui-p-simple
Instantiated DTM.
Cyc=          0 [1] pc=[80000000] W[r30=00000002][1] Op1=[r 0][00000000] Op2=[r 2][00000000] inst=[34202f73]     DASM(34202f73)
[...]
Cyc=      29714 [1] pc=[00000060] W[r 5=00000060][1] Op1=[r 0][00000000] Op2=[r 0][00000060] inst=[00000297]     DASM(00000297)
*** FAILED *** (timeout) after 30000 cycles

The problem now is that the program does not get loaded correctly. The first instruction should be 0x04c0006f, not 0x34202f73, which is the second instruction. I tracked this down to riscv-isa-sim/fesvr/dtm.cc, write_chunk, which has this line:

  write(DMI_SBADDRESS0, taddr-4);

The problem is the mysterious -4. There is nothing in the RISCV debug specification to suggest that should be needed (on the contrary, it should not), and there isn't anything similar in read_chunk nor in the non-Sodor implementation of write_chunk. I speculate the -4 was added because it compensated for an effect of the race condition I'm describing. If I remove it, and also fix the race as described above, then programs work again.

Unfortunately, even then not all of the tests pass. The first to fail is rv32_3stage, rv32ui-p-xori:

emulator/rv32_3stage$ ./emulator +max-cycles=30000 ../../install/riscv-tests/rv32ui-p-xori
[...]
Cyc=        280 [1] pc=[80000048] W[r 0=8000004c][1] Op1=[r31][0000000b] Op2=[r25][00000000] inst=[ff9ff06f]     DASM(ff9ff06f)
terminate called after throwing an instance of 'std::runtime_error'
  what():  bad syscall #342274048501219439
Exit 255

I have not attempted to debug that failure yet. I figured I'd file what I have and get some feedback before going further.