google / CFU-Playground

Want a faster ML processor? Do it yourself! -- A framework for playing with custom opcodes to accelerate TensorFlow Lite for Microcontrollers (TFLM). . . . . . Online tutorial: https://google.github.io/CFU-Playground/ For reference docs, see the link below.
http://cfu-playground.rtfd.io/
Apache License 2.0
455 stars 116 forks source link

VexRiscV - SlimoptCfu does not appear to be working #374

Open alanvgreen opened 2 years ago

alanvgreen commented 2 years ago

I have a minimal Cfu and with SlimOptCfu on HPS Proto2, the result appears to be always wrong:

/* Generated by Yosys 0.12+13 (git sha1 21fbdb663, clang 11.1.0-4+build2 -fPIC -Os) */

(* \nmigen.hierarchy  = "Cfu" *)
(* top =  1  *)
(* generator = "nMigen" *)
module Cfu(cmd_ready, cmd_payload_function_id, cmd_payload_inputs_0, cmd_payload_inputs_1, rsp_valid, rsp_ready, rsp_payload_outputs_0, reset, port0_addr, port1_addr, port2_addr, port3_addr, port0_din, port1_din, port2_din, port3_din, clk, rst, cmd_valid);
  reg \$auto$verilog_backend.cc:2083:dump_module$2  = 0;
  (* src = "/usr/local/google/home/avg/src/CFU-Playground/third_party/python/nmigen/nmigen/hdl/ir.py:524" *)
  input clk;
  (* src = "/usr/local/google/home/avg/src/CFU-Playground/python/nmigen_cfu/cfu.py:182" *)
  input [9:0] cmd_payload_function_id;
  (* src = "/usr/local/google/home/avg/src/CFU-Playground/python/nmigen_cfu/cfu.py:184" *)
  input [31:0] cmd_payload_inputs_0;
  (* src = "/usr/local/google/home/avg/src/CFU-Playground/python/nmigen_cfu/cfu.py:185" *)
  input [31:0] cmd_payload_inputs_1;
  (* src = "/usr/local/google/home/avg/src/CFU-Playground/python/nmigen_cfu/cfu.py:181" *)
  output cmd_ready;
  (* src = "/usr/local/google/home/avg/src/CFU-Playground/python/nmigen_cfu/cfu.py:180" *)
  input cmd_valid;
  (* src = "/usr/local/google/home/avg/src/CFU-Playground/proj/hps_accel/gateware/gen2/hps_cfu.py:32" *)
  reg f = 1'h0;
  (* src = "/usr/local/google/home/avg/src/CFU-Playground/proj/hps_accel/gateware/gen2/hps_cfu.py:32" *)
  reg \f$next ;
  (* src = "/usr/local/google/home/avg/src/CFU-Playground/python/nmigen_cfu/cfu.py:190" *)
  input [31:0] port0_addr;
  (* src = "/usr/local/google/home/avg/src/CFU-Playground/python/nmigen_cfu/cfu.py:191" *)
  input [31:0] port0_din;
  (* src = "/usr/local/google/home/avg/src/CFU-Playground/python/nmigen_cfu/cfu.py:190" *)
  input [31:0] port1_addr;
  (* src = "/usr/local/google/home/avg/src/CFU-Playground/python/nmigen_cfu/cfu.py:191" *)
  input [31:0] port1_din;
  (* src = "/usr/local/google/home/avg/src/CFU-Playground/python/nmigen_cfu/cfu.py:190" *)
  input [31:0] port2_addr;
  (* src = "/usr/local/google/home/avg/src/CFU-Playground/python/nmigen_cfu/cfu.py:191" *)
  input [31:0] port2_din;
  (* src = "/usr/local/google/home/avg/src/CFU-Playground/python/nmigen_cfu/cfu.py:190" *)
  input [31:0] port3_addr;
  (* src = "/usr/local/google/home/avg/src/CFU-Playground/python/nmigen_cfu/cfu.py:191" *)
  input [31:0] port3_din;
  (* src = "/usr/local/google/home/avg/src/CFU-Playground/python/nmigen_cfu/cfu.py:189" *)
  input reset;
  (* src = "/usr/local/google/home/avg/src/CFU-Playground/python/nmigen_cfu/cfu.py:188" *)
  output [31:0] rsp_payload_outputs_0;
  (* src = "/usr/local/google/home/avg/src/CFU-Playground/python/nmigen_cfu/cfu.py:187" *)
  input rsp_ready;
  (* src = "/usr/local/google/home/avg/src/CFU-Playground/python/nmigen_cfu/cfu.py:186" *)
  output rsp_valid;
  (* src = "/usr/local/google/home/avg/src/CFU-Playground/third_party/python/nmigen/nmigen/hdl/ir.py:524" *)
  input rst;
  always @(posedge clk)
    f <= \f$next ;
  always @* begin
    if (\$auto$verilog_backend.cc:2083:dump_module$2 ) begin end
    \f$next  = 1'h1;
    (* src = "/usr/local/google/home/avg/src/CFU-Playground/third_party/python/nmigen/nmigen/hdl/xfrm.py:519" *)
    casez (rst)
      1'h1:
          \f$next  = 1'h0;
    endcase
  end
  assign rsp_valid = 1'h1;
  assign rsp_payload_outputs_0 = 32'd305419896;
  assign cmd_ready = 1'h1;
endmodule

Every result received by the CPU is 0x13355779

This was the nmigen used to generate the Cfu.

from nmigen import Signal
from nmigen_cfu import CfuBase

from .constants import Constants

class Cfu(CfuBase):
    """Gen2 accelerator CFU.

    Assumes working with a slimopt+cfu VexRiscV, which rsp_ready is always true.
    """
    def elab(self, m):
        # for now, everything completes in a single cycle, so can be always ready
        m.d.comb += self.cmd_ready.eq(1)

        m.d.comb += self.rsp_out.eq(0x12345678)
        m.d.comb += self.rsp_valid.eq(1)

        f = Signal()
        m.d.sync += f.eq(1)

def make_cfu():
    return Cfu()

The 'f' signal exists to ensure that a clk signal is generated.

I haven't had time to isolate the bug better yet, but the above failing code is available at https://github.com/alanvgreen/CFU-Playground/blob/slimopt-bug/proj/hps_accel/gateware/gen2/hps_cfu.py

alanvgreen commented 2 years ago

I have verified that the above Cfu works as expected when the default VexRiscV CPU is used.

tcal-x commented 2 years ago

Hi @alanvgreen , did it meet timing?

tcal-x commented 2 years ago

The CFU w/ slimopt+cfu worked correctly for me on Arty. Build command:

make  EXTRA_LITEX_ARGS="--cpu-variant=slimopt+cfu" clean prog

I'll try proto2 now.

tcal-x commented 2 years ago

On proto2, I met timing by a ways (100MHz+), and I do see a symptom similar to yours: result is 0x12345778.

tcal-x commented 2 years ago

And on proto2, with slim+cfu and your CFU, I get the correct 0x12345678.

alanvgreen commented 2 years ago

I believe it did meet timing, yes.

On Sun, Dec 12, 2021 at 3:26 PM TCal @.***> wrote:

Hi @alanvgreen https://github.com/alanvgreen , did it meet timing?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/google/CFU-Playground/issues/374#issuecomment-991833053, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJ4WATO2UVPXKJARWCY3WLUQQP6DANCNFSM5J3PFE3A . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

tcal-x commented 2 years ago

When I build slimopt+cfu using Radiant, I see the correct answer 0x12345678.

alanvgreen commented 2 years ago

Ran again - confirmed it meets timing (102.87 MHz). Still producing incorrect output - see uart output below.

Hello, World!

CFU Playground
==============
 1: TfLM Models menu
 2: Functional CFU Tests
 3: Project menu
 4: Performance Counter Tests
 5: TFLite Unit Tests
 6: Benchmarks
 7: Util Tests
 8: SPI Flash Debugging
main> 3

Running Project menu

Project Menu
============
 p: Ping CFU
 v: Verify register tests
 L: test layer 20
 x: eXit to previous menu
project> v

Running Verify register tests
input = c57b29a7, output = 13355779 - FAIL
input = 2b3d0954, output = 13355779 - FAIL
input = 02354ca2, output = 13355779 - FAIL
input = 3727d2b9, output = 13355779 - FAIL
input = 7963b9c5, output = 13355779 - FAIL
input = e3c5de35, output = 13355779 - FAIL
input = f2886d5b, output = 13355779 - FAIL
input = 46f99ee6, output = 13355779 - FAIL
input = b1257a48, output = 13355779 - FAIL
input = d8184f02, output = 13355779 - FAIL
FAIL: 10 failures out of 10 tests
---
tcal-x commented 2 years ago

I ran with slimopt+cfu, default toolchain, on proto2, with the only change that I specified the 7- grade part LIFCL-17-7UWG72C.

With this run, I matched @alanvgreen 's incorrect result exactly, 0x13355779.

tcal-x commented 2 years ago

Some other experiments I ran --

Results -- always a similar error, an LSB of a hex digit being 1'b1 when it should be 1'b0.

Another experiment -- I tried to reproduce on Fomu with the same CPU/CFU, and a similar tool chain -- I did not see the error with Fomu.

tcal-x commented 2 years ago

I did another experiment with Yosys synthesis and Radiant PnR. The results are even stranger.

make PLATFORM=hps EXTRA_LITEX_ARGS="--cpu-variant=slimopt+cfu --toolchain=radiant --synth-mode=yosys" clean bitstream

Output for interactive test:

Running Run interactive tests
CFU Interactive Test:
  First operand value   > 0
  Second operand value  > 0
cfu_op0(00000000, 00000000) = 13355009 (hex), 322261001 (signed), 322261001 (unsigned)
cfu_op1(00000000, 00000000) = 13355679 (hex), 322262649 (signed), 322262649 (unsigned)
cfu_op2(00000000, 00000000) = 13355009 (hex), 322261001 (signed), 322261001 (unsigned)
cfu_op3(00000000, 00000000) = 13355679 (hex), 322262649 (signed), 322262649 (unsigned)
cfu_op4(00000000, 00000000) = 13355009 (hex), 322261001 (signed), 322261001 (unsigned)
cfu_op5(00000000, 00000000) = 13355679 (hex), 322262649 (signed), 322262649 (unsigned)
cfu_op6(00000000, 00000000) = 13355009 (hex), 322261001 (signed), 322261001 (unsigned)
cfu_op7(00000000, 00000000) = 13355679 (hex), 322262649 (signed), 322262649 (unsigned)

So with the CFU outputting constant 0x12345678, the output as read by the CPU alternates between two bad values -- and now we sometime see bits incorrectly forced to 1'b0 (before we only saw bits being corrupted to 1'b1). The alternation is not just random -- in all the tests, every successive result flips to the other one.

tcal-x commented 2 years ago

I ran Radiant again full flow (2.2.1), and saw correct behavior.

tcal-x commented 2 years ago

Another clue: with the full Oxide flow (yosys/nextpnr-nexus) on this design, I get the correct behavior if I add -noflatten to the arguments to synth_nexus in the Yosys script hps_proto2_platform.ys.

So I suspect that constant propagation / folding from the CFU constant value into the VexRiscv receiving FIFO, particularly across flop boundaries, triggers the issue. This receiving FIFO only exists in the slimopt variant.