RI5CY Core Gate Level ASIC Synthesis and Simulation

ranaya123 commented 4 years ago

Hi All,

Right now I only want to perform the gate-level simulation on the RI5CY core (4 stage) of Pulpissimo (master branch) platform. I have the RTL simulations successfully up and running and I want to run Coremark on the synthesized netlist. I use my own test bench for the core which exactly represents the Coremark intput/output data to the core. I got its functionality verified by running an RTL simulation on the isolated riscv_core.v (without soc_domains and other top-level entities).

However this testbench does not work with the synthesized netlist of the riscv_core.v. I traced back the erroneous transitions and the issue was found in the riscv_fetch_fifo.sv that comes under the pre-fetch buffer. The input data to this module in pre-fetch buffer are perfectly synchronized and are inline with original Coremark simulation.

i.e. I expect to see following :

But what I see is this :

Moreover, during the synthesis, the design compiler had removed several registers from riscv_fetch_fifo.sv. For instance, addr_n and addr_Q are supposed to be arrays of 4-32 Bit registers, but after the synthesis, they had shrinked to 2-32 Bit registers. Is riscv_fetch_fifo.sv fully synthesizable ? Can someone shed a light on this ? This is the source file for FIFO module I used in the build : https://pastebin.com/KcxLgGMZ

Thanks Anuradha

FrancescoConti commented 4 years ago

Hi Anuradha, gate-level simulation is tricky. You can probably ask more details on the implementation choices in CV32E40P (formerly RI5CY) here https://github.com/openhwgroup/cv32e40p ; this question is not really PULPissimo related anyways. But that said:

the full core is synthesizable and silicon-checked many times now; of course, there could still be bugs hidden. What commit of RI5CY are you using?
double check your timings; you can clearly see the effect to see gate delays in transitions in your waveform, if this is intentional check that the delays are properly annotated, that the clock is aligned with your synthesis constraints, and that there are no hold violations. You should take particular care in timing of inputs to the core generated by the testbench.
if timing effects are not intentional, you should probably disable all delays (including those of your standard cell models) in order to get a working simulation. Testing the gate-level netlist without any timing annotation might also help you understand if your problem is related to timing.
check the reset states, there might be some uninitialized flip-flops in the design, which then cause havoc in post-synthesis simulation due to X propagation (it doesn't seem your case though). Many simulators have options to set all uninitialized flops to a random value, which could help.

Hope that helps!

ranaya123 commented 4 years ago

Hello Francesco, Thanks for your inputs.

The last commit of the git looks like below : commit be1fdf506facb6468a41ed325dcbfaa88a4e0c51 Merge: fae8a38 c1b7d49 Author: bluew bluewww@users.noreply.github.com Date: Mon Nov 18 14:29:13 2019 +0100 Merge pull request #193 from pulp-platform/add-verilator-dm Add verilator support to tb/dm

2-3. Mm.. it seems the delays from the .sdf are properly annotated. However when the -nospecify switch is enabled, I see correct transitions. So this means that the synthesized netlist is functionally correct.

With these observations, I now clearly see where the issue is in. In fetch_fifo, at every positive edge of the clock, addr_Q is updated by addr_n as shown below (which is with -nospecify) :

With the annotated delays, it looks like this:

Still, the timing margins are sufficient to capture the right data from addr_n (to addr_Q) with the annotated delays. This is what I don't understand.....

Update The issue is in the fetch_fifo local clock gating cell. addr_Q is updated at the gated clock signal which is the delayed version of the original clock. Since the addr_n is updated in an always_comb block, in which the inputs are triggered to the original non-gated clock, following clock process always results wrong values in addr_Q.

always_ff @(posedge clk, negedge rst_n) begin if(rst_n == 1'b0) begin

end
else
begin
  // on a clear signal from outside we invalidate the content of the FIFO
  // completely and start from an empty state
  if (clear_i) begin
    valid_Q    <= '0;
    is_hwlp_Q  <= '0;
  end else begin
    **addr_Q    <= addr_n;**
    rdata_Q   <= rdata_n;
    valid_Q   <= valid_n;
    is_hwlp_Q <= is_hwlp_n;
  end
end

end

Anuradha

gautschimi commented 4 years ago

are you using a proper clock gating cell or a behavioural model?

gautschimi commented 4 years ago

maybe add a figure where we can see the clock of the FF in question and all input/output signals of that FF. it looks like your addr_n toggels very late, but probably still before the clock edge. typical setup violation, do you not get any error messages?

or your clock comes too late for some reason, hard to say without more info

ranaya123 commented 4 years ago

@gautschimi Hi, No I am using a proper clock gating cell from the std_lib. A clear picture of the signal transitions with original and gated clocks goes here :

But the late toggling of addr_n makes sense here as it depends on out_ready_i input as well. The thing is, for any type of clock gate cell, the gated clock signal cannot capture the stable addr_n during this time. And it is obvious since the addr_n is computed inside an always_comb block which is only sensitive to the inputs of the fetch_fifo. For ideal behavior, clock gating cell should have 0 delay.

Anuradha

gautschimi commented 4 years ago

I think everything is correct in this picture! addr_n changes before the clock edge and addr_q is therefore updated to the this value.

you need to find out why addr_n is updated so late. is it influenced by a tb signal?

please also make sure that the the design is timing clean and that you don't clock it too fast

ranaya123 commented 4 years ago

@gautschimi Thanks for your reply.

I think everything is correct in this picture! addr_n changes before the clock edge and addr_q is therefore updated to the this value.

Yes, the outcome is correct. But the late update of addr_n happens here due to its (combinatorial) sensitivity to the late arrival of the in_addr_i[31:0] and out_ready_i signals ? So it makes sense, don't you think ? Even if you make these two input signals perfectly align to the original clock, addr_n bounds to change (from 1A000084 to 1A000088) for a short time due to imperfect comb delays. Or else we need another register for addr_n inside clock process to make it stable.

Anuradha

gautschimi commented 4 years ago

if the addr_n toggels with this clock edge than this is a timing violation. if you see the design timing clean then your constraints are wrong. maybe the clock does not propagate through your clock gate or something like this.

you can report the timing in your synthesis tool to this specific endpoint and see if the path is constrained properly e.g. report_timing -through in_addr_i -to this_flip_flop

another likely source is that you do not annotate the right delays, or some delays don't get annotated properly. I'd also check the log file of the sdf annotation and make sure that all delays are annotated

ranaya123 commented 4 years ago

@gautschimi Mmm.... to answer your questions :

Yes, the SDF delays are properly annotated as evident from the .log files.
I don't see any timing violations in the design.

if the addr_n toggels with this clock edge than this is a timing violation.

See, here the point is addr_n is updated through addr_int which is not synchronized to the clock signal (you may also want to check the code here https://pastebin.com/KcxLgGMZ) :

addr_n update :

addr_int update :

So transitions after the clock edge in addr_n is quite possible even if you set proper constraints to the design. It is not a timing violation.

gautschimi commented 4 years ago

yes of course this transition is there. but it should only be captured at the next rising clock edge. your addr_n comes too early. synthesis tools are for sure checking this. this path must be unconstrained, otherwise you would see a violation. how do you define your clocks?

ranaya123 commented 4 years ago

but it should only be captured at the next rising clock edge

Indeed, so that the gated clock should transit before the change of the addr_n. This means clock -> gated_clock delay should be lower than the shortest comb delay of this block. But how to define it during the synthesis ? I have only specified the clock constraints for the riscv_core as follows :

set clk_pr 4 create_clock ${CLK} -period ${clk_pr} set_input_delay -max 0.3 -clock ${CLK} [remove_from_collection [all_inputs] [get_ports ${CLK}]] set_output_delay -max 0.3 -clock ${CLK} [all_outputs] set_clock_uncertainty 0.5 ${CLK}

Thanks

gautschimi commented 4 years ago

maybe add a figure with all clock gate signals (input clk, output clk, enable). + all signals of the FF (D/Q and CLK)

ranaya123 commented 4 years ago

@gautschimi : Here we go -> The CG shown here is the one gates addr_Q registers. Btw I re-synthesized the design at lower frequency (30 ns, earlier it was 8ns) expecting the addr_n combinatorial delay to get longer. But the result is same. Thanks

gautschimi commented 4 years ago

well the comb delay of addr_n obviously remains the same.

I would report the timing to this FF in synopsys genus or whatever you are using to see if it is constrained or not. otherwise I don't know. maybe you annotate wrong delays, e.g. fast corner for some stdcells and slow for others.

you can also check if the annotated delays are in line with what your synthesis tool reports

FrancescoConti commented 4 years ago

I agree with @gautschimi: there must be some problem with the annotated delays. Specifically, to me it looks a lot like the CG cell delay you are using in simulation is not the same the synthesis tool is considering. If the two edges (CG CLK and ENCLK) were nearer to one another, setup and hold would be met correctly: if the synthesis tool does not report any violation, that's likely what it is considering.

ranaya123 commented 4 years ago

@gautschimi & @gautschimi

Timing constraint : report_timing -through in_addr_i -to this_flip_flop (through in_addr_i to addr_Q register D input which is addr_n) https://pastebin.com/Ah7pwc2E
I annotate TYPICAL delays from the sdf file. i.e. And I see the exact CLK to ENCLK delay in the simulation :
However I should mention here that, I use automatic clock gating insertion during the synthesis.
What I do not understand is this. Change of addr_n will definitely occur withing a short time of period as it is sensitive to the inputs. On the other hand here, a CG cell is nearly driving larger cap-load for 32 addr_Q register. So inevitably CK->ENCLK becomes larger and given this situation, to capture a stable addr_n, in my opinion, in_addr_i should be assigned to addr_int (there by to addr_n) inside the clocked process. This satisfy the capturing of right addr_n at right clock edge. Correct me if I am wrong ?

FrancescoConti commented 4 years ago

From your timing report, the clock is indeed considered ideal, so it's not surprising that an edge delayed by 640ps captures the wrong stuff. You might want to add some clock uncertainty (>640 ps) in synthesis to avoid this and other problems...

See e.g. https://vlsi.pro/set_clock_uncertainty/

gautschimi commented 4 years ago

@FrancescoConti yes thats it. that will solve the problem

in synthesis you can also specify what clock gating cells the tool can use. the tool might use one with a weak driver. but later during p&r it will swap it anyway

ranaya123 commented 4 years ago

@FrancescoConti @gautschimi
Hi Guyz, yes I solved the issue by the explicit use of a CG cells (of higher driving strength) from the library. The reason that I had not seen the violations was, the clock was not in the "propagated mode" during the synthesis. With this mode enabled, I see the hold violations in the report. Anyway as @gautschimi pointed out, things will be swapped in PnR stage, so instead of higher CG cells, probably tool will insert delay elements for hold fixing. This is good enough for the gate level simulation.

Thanks a lot for all the inputs.

ranaya123 commented 4 years ago

please also make sure that the the design is timing clean and that you don't clock it too fast

@gautschimi Hi, is there a particular reason for "not clocking too fast" ? I have done PnR for several different frequencies and it seems the core is only functional for frequencies lower than 40 MHz ! Even with larger timing margins, I see XXXs for moderately higher frequencies (100MHz). However I don't see any setup/hold violations (this is after PnR, clock is always in propagated mode) in these frequencies.

Does the design have multi-cycle paths ?

Thanks

phuong2403 commented 3 years ago

Hi Anuradha, gate-level simulation is tricky. You can probably ask more details on the implementation choices in CV32E40P (formerly RI5CY) here https://github.com/openhwgroup/cv32e40p ; this question is not really PULPissimo related anyways. But that said:

the full core is synthesizable and silicon-checked many times now; of course, there could still be bugs hidden. What commit of RI5CY are you using?

double check your timings; you can clearly see the effect to see gate delays in transitions in your waveform, if this is intentional check that the delays are properly annotated, that the clock is aligned with your synthesis constraints, and that there are no hold violations. You should take particular care in timing of inputs to the core generated by the testbench.

if timing effects are not intentional, you should probably disable all delays (including those of your standard cell models) in order to get a working simulation. Testing the gate-level netlist without any timing annotation might also help you understand if your problem is related to timing.

check the reset states, there might be some uninitialized flip-flops in the design, which then cause havoc in post-synthesis simulation due to X propagation (it doesn't seem your case though). Many simulators have options to set all uninitialized flops to a random value, which could help.

Hope that helps!

Hi @FrancescoConti , @ranaya123 , I am running netlist simulation for RISCY core only. I also replace cluster_clock_gating cell in netlist file by RTL behavioral implementation (with latching). I also check reset, sleeping, fetch_enable states, they work as RTL simulation. But most of other waves are in red (XXXX). It may be as @FrancescoConti said that it is related to X propagation. Do you have any recommendations for me? Thank you so much.

ranaya123 commented 3 years ago

havoc

Hi @FrancescoConti , @ranaya123 , I am running netlist simulation for RISCY core only. I also replace cluster_clock_gating cell in netlist file by RTL behavioral implementation (with latching).

Hi, have you also removed the SDF information related to the original CG cells from .sdf file ? If yes, then the gate level simulation should work. But you have to replace all the CG cells in the design, related to every interface.

phuong2403 commented 3 years ago

havoc

Hi @FrancescoConti , @ranaya123 , I am running netlist simulation for RISCY core only. I also replace cluster_clock_gating cell in netlist file by RTL behavioral implementation (with latching).

Hi, have you also removed the SDF information related to the original CG cells from .sdf file ? If yes, then the gate level simulation should work. But you have to replace all the CG cells in the design, related to every interface.

Hi @ranaya123 , I have not considered to .sdf file yet. I just run with synthesized file (.v) and my own library file. In my synthezised file, I only found one module:

 `module cluster_clock_gating ( clk_i, en_i, test_en_i, clk_o );
   input clk_i, en_i, test_en_i;
    output clk_o;
    wire   N0, clk_en;

    DLL_X1 clk_en_reg ( .D(N0), .GN(clk_i), .Q(clk_en) );
   AND2_X2 U2 ( .A1(clk_en), .A2(clk_i), .ZN(clk_o) );
 OR2_X1 U3 ( .A1(en_i), .A2(test_en_i), .ZN(N0) );
  endmodule`

then I replace it by behavioral model :

> `module cluster_clock_gating ( clk_i, en_i, test_en_i, clk_o );
  input clk_i, en_i, test_en_i;
  output clk_o;
 reg clk_en;

  always @*
 begin
   if (clk_i == 1'b0)
     clk_en <= en_i | test_en_i;
  end

 assign clk_o = clk_i & clk_en;

endmodule`

Is that enough? Can you explain more what you do mean by CG cells related to the interface (e.g peripherals.)? Because I think I am running synthesized RISCY core only, not the whole platform. Are there any points I missed? Thank you!

phuong2403 commented 3 years ago

Hi @ranaya123 , Do you have any helps? I see a module name pulp_clock_gating.sv, but it seems that I do not see it is used anywhere on the platform. Can you indicate and help me to figure out?

Thank you so much.

ranaya123 commented 3 years ago

Hi @ranaya123 , Do you have any helps? I see a module name pulp_clock_gating.sv, but it seems that I do not see it is used anywhere on the platform. Can you indicate and help me to figure out?

Thank you so much.

Hi, you have to look at following issue post for gate level simulation. https://github.com/pulp-platform/pulpissimo/issues/77

There I've provided a synthesis script. In my case I had to remove all the CG cells used in many interfaces in the core. This means the register_file, core_cg etc. Since the generated .sdf also has the instances of these CGs, to get the .sdf annotation properly work, you have to remove the CG instances in the .sdf as well. So basically there are no CG delays in the design, leading to an ideal clock propagation. This was good enough for my simulation.

phuong2403 commented 3 years ago

Hi @ranaya123 , Thank you so much for your reply. I am sorry because I have some following basic following questions. I really appreciate if you can take the time to help me.

I also saw your synthesis script before, then I used it set_dont_touch core_clock_gate_i, and also includes analyze -format sverilog ../cluster_clock_gating.sv , but I will get: (1)

module cluster_clock_gating ( clk_i, en_i, test_en_i, clk_o );
input clk_i, en_i, test_en_i;
output clk_o;
wire   N0, clk_en, N1;
GTECH_NOT I_0 ( .A(clk_i), .Z(N1) );
GTECH_OR2 C18 ( .A(en_i), .B(test_en_i), .Z(N0) );
GTECH_AND2 C19 ( .A(clk_i), .B(clk_en), .Z(clk_o) )

for the CG cell in the synthesized file which I could not make a simulation.

Besides, Can I ask if I need to synthesize all files (riscv_tracer.sv, and in include/ such as apu_core_package.sv...) And if not, If I should compile them when I run gate-level simulation with riscv core?

+> In my case I had to remove all the CG cells used in many interfaces in the core. This means the register_file, core_cg etc

For this statement, Am I right what you mean is that delete completely (not replace) all CG cells from interfaces in the SYNTHESIZED file, or replace them with RTL models? OR remove (ore replace) CG RTL models in other parts outside the core of the platform with CG cell (1) in SYNTHESIZED file? Could you please confirm that?

Hope to see your reply again.

pulp-platform / pulpissimo

RI5CY Core Gate Level ASIC Synthesis and Simulation #135

always_ff @(posedge clk, negedge rst_n) begin if(rst_n == 1'b0) begin