pulp-platform / pulpissimo

This is the top-level project for the PULPissimo Platform. It instantiates a PULPissimo open-source system with a PULP SoC domain, but no cluster.
Other
383 stars 165 forks source link

Pulpissimo Post Synthesis Simulation - Hart Halt ! #77

Open ranaya123 opened 5 years ago

ranaya123 commented 5 years ago

Hi All,

This question is regarding the gate level simulation on Pulpissimo platform. Right now I only want to perform this simulation on core level. So I have the synthesized netlist of the riscV core, added it to the build tree with the model file of the standard-cells, and successfully built the system.

When I run the simulation in Modelsim, I can see that the nestlist of the riscV core appears in the design hierarchy and the SDF file was properly annotated to the right scope :

**tb_pulp.i_dut.soc_domain_i.pulp_soc_i.fc_subsystem_i.FC_CORE.lFC_CORE**

Until JTAG propagates the debug request signal, everything is fine : 
-----------------------------------------------------------------
 [TB]       1ns - Using FLL
 [TB]       1ns - Not using CAM SDVT
 Loading default stimuli
[JTAG] SoftReset Done(    701ns)
[JTAG] Bypass Test Passed (  33301ns)
[JTAG] Tap ID: 249511c3 (  43701ns)
[JTAG] Tap ID Test PASSED (  43701ns)
[test_mode_if]   50301ns - Init
[TB]   50301ns - Enabling clock out via jtag
[test_mode_if]   51801ns - Setting confreg to value 003.
[TB]   51801ns - jtag_conf_reg set to 003
[TB]   51801ns - Releasing hard reset
[TB]   53401ns - Init PULP TAP
[pulp_tap_if] WRITE32 burst @1c008080 for           4 bytes.
[TB]   67501ns - Write32 PULP TAP
[JTAG] R/W test of L2 succeeded
[TB]  177701ns - Halting the Core
-----------------------------------------------------------------

I've also put the riscv_tracer riscv_tracer_i() in the synthesized netlist. What's possibly going wrong here ?

Thanks in advance

FrancescoConti commented 5 years ago

Hi @ranaya123 , post-synthesis simulation is often tricky. It is a bit difficult to guess what could be happening here without knowing more details. A few possibilities that come to my mind (not a complete list):

A good way to check these things is to perform an identical simulation using the RTL code and directly compare what happens at the core interface level.

ranaya123 commented 5 years ago

Hi, thanks for your input. The issue seems to be related to debug_req_i port of riscv_core where this port directly goes to the riscv_controller. In the RTL level simulation (hello world), this is asserted for a very short time while in post synthesis simulation, it hangs in 'high' state throughout the simulation as shown in following figure. So this is highly likely a hang in FSM of the controller.

image

i.e. the debug_mode signal has been removed by the synthesizer as it's not connected to the top level (fc_subsystem) design and cs_register. Synthesizer had also removed "debug_ebreaku" signal from cs_register, which is strange ! It has a direct connection from the riscv_controller to the cs_register, so I expect it to remain between the modules !!!

Btw to verify, do you provide a sample ASIC synthesis script for the entire pulpissimo platform ? Atleast would it be possible to get appropriate and realistic constraints (with uncertainties) for each and every block of the design ?

Thanks

FrancescoConti commented 5 years ago

I think you should focus on interface signals, the ones that "disappeared" seem to be internal ones. Whatever is going wrong, with >90% prob it's happening at the interface between netlist and RTL.

ranaya123 commented 5 years ago

I think you should focus on interface signals, the ones that "disappeared" seem to be internal ones. Whatever is going wrong, with >90% prob it's happening at the interface between netlist and RTL.

Btw to verify, do you provide a sample ASIC synthesis script for the entire pulpissimo platform ? Atleast would it be possible to get appropriate and realistic constraints (with uncertainties) for each and every block of the design ?

Is there a documentation written on, how the RTLs should be made synthesizable (i.e. dc_shell directives) ?

FrancescoConti commented 5 years ago

We don't have a full "clean" script that I can share (if so we would have put it in the repo!), but in general it's quite mundane, something like this (I put only the key commands):

source -echo -verbose ./scripts/analyze_auto/ips_add_files.tcl > ip_errors.rpt
source -echo -verbose ./scripts/analyze_auto/rtl_add_files.tcl > rtl_errors.rpt

elaborate pulpissimo -work work

write -format ddc -hier -o ./unmapped/pulpissimo_unmapped.ddc pulpissimo

link
after 10000
set uniquify_naming_style "soc_%s_%d"
uniquify -force

source -echo -verbose -scripts/constraints.tcl
compile_ultra -no_autoungroup -no_boundary_optimization -timing -gate_clock

Proper constraints (especially I/O) depend a lot on your setup, however one thing that I can say is that some of the blocks (standard-cell memory based register files) will require exceptions, otherwise you will over-constrain them:

set_multicycle_path 2 -setup -through [get_pins soc_domain_i/pulp_soc_i/fc_subsystem_i/lFC_CORE/id_stage_i/registers_i/riscv_register_file_i/mem_reg*/Q]
set_multicycle_path 1 -hold  -through [get_pins soc_domain_i/pulp_soc_i/fc_subsystem_i/lFC_CORE/id_stage_i/registers_i/riscv_register_file_i/mem_reg*/Q]
set_multicycle_path 2 -setup -through [get_pins soc_domain_i/pulp_soc_i/fc_subsystem_i/lFC_CORE/id_stage_i/registers_i/riscv_register_file_i/mem_fp_reg*/Q]
set_multicycle_path 1 -hold  -through [get_pins soc_domain_i/pulp_soc_i/fc_subsystem_i/lFC_CORE/id_stage_i/registers_i/riscv_register_file_i/mem_fp_reg*/Q]
ranaya123 commented 5 years ago

@FrancescoConti : Okay I was able to pinpoint the issue. The main reason for the "halt" seems to be a wrongly issued instruction address caused by the FSM in the prefetch_buffer_i. Take a look at the following figure for RTL simulation : Proof_RTL

At the time instance where the "white arrow" is, branch_i perfectly becomes zero at the rising edge of the clock. So that, according to riscv_prefetch_buffer FSM, the instr_addr_o = fetch_addr when the CS=WAIT_RVALID. But in synthesized design, with realistic delays, branch_i never becomes 0 at this time instance as shown in following figure: Proof_Syn

The combinatorial delay of FSM slightly stretches branch_i and as a consequence, instr_addr_o = addr_i instead of fetch_addr as shown. This affects the CS and NS as well. From this point onwards, instr_addr_o stucks at 1a110804 which is not the intended behavior.

Since branch_i is produced at riscv_if_stage wrt to the rising edge of the clk and checking its status again at prefetch_buffer in the same clk cycle won't result the same RTL level simulation outputs....

ranaya123 commented 5 years ago

So I solved the issue. The trick is to replace the clock gating cells (applied by the synthesizer) by their behavioral models to avoid hold violations. Otherwise, data is sampled at two different clock edges from peripheral and core sides.....

vikramjain236 commented 5 years ago

Hi @ranaya123, how did you replace the clock gating cells with behav models? Also, have you tried to synthesize bigger design, for example the soc_domain? I have trying to do this but getting into a lot of issues. One thing that I saw was that the clock_en_i pin to riscv_core remains unconnected which might be a problem.

ranaya123 commented 5 years ago

@vikramjain236 I haven't had a chance to synthesize the bigger system. Will be looking to that in coming weeks. Regarding the clock gating cells, you can first synthesize the design with clock_gating enabled and then replace those cells in synthesized netlist with their behavioral model (with latching).

To properly annotate the SDF, you have to skip those cells in .sdf file as well. So that data sampling would become synchronized !

Anuradha

renzoandri commented 5 years ago

Hi Vikram If you are looking for the behavioural model of the cluster clock gate, you can find it here: https://github.com/pulp-platform/tech_cells_generic/blob/master/src/cluster_clock_gating.sv Regards, Renzo

vikramjain236 commented 5 years ago

I get this Access to register error when I try to do post-synthesis simulation. Does anyone know what the problem might be? (I synthesize only the soc_domain.sv and all its sub modules.

# [TB]  177701ns - Halting the Core
# [TB]  236501ns - Writing the boot address into dpc
# ** Error: Access to register 07b1 failed with error X
#    Time: 280601 ns  Scope: jtag_pkg.debug_mode_if_t.wait_command File: /volume1/users/vjain/pulpissimo/sim/../rtl/tb/jtag_pkg.sv Line: 770
# [TB]  280601ns - Loading L2
# [JTAG] Loading L2 with pulp tap jtag interface
# [pulp_tap_if] WRITE32 burst @1c000000 for        1024 bytes.
# [pulp_tap_if] WRITE32 burst @1c000400 for        1024 bytes.
ranaya123 commented 5 years ago

@vikramjain236 Hi, few things to check before gate level simulation:

  1. Are interfaces between sub modules preserved during the synthesis? i.e. you may disable boundary optimization. Take a look at following synthesis script for riscv_core (only): https://pastebin.com/Lm4TfGkD

  2. It wasn't only the global cluster clock gating, but also the clock gating local to sub modules had to be replaced too. So the behavioural modelling of the clock gate cells have to be adopted for your preferred clock gating style (used during the synthesis). If it's default style, then the cluster clock gating RTL model should work.

Anuradha

vikramjain236 commented 5 years ago

@FrancescoConti @renzoandri @ranaya123

I have been trying to run a post synthesis simulation on the pulpissimo environment.

  1. I started by synthesizing the soc_domain module and all its sub-modules. Attached is the files list that is read into the synthesis tool. (rtl_files.tcl)
  2. boundary_optimization, auto ungroup set to false and clock gating set to true
  3. I have also included SRAM macros into the design, for interleaved banks and private banks ( I replaced this with generic_memory later because the mask for the SRAM macro is 32 bits and for the generic memory was 4 bit, and when i try to replicate and concatenate the 4 bit to 32 bit, the tool removes the other bit signals)
  4. After synthesis I replaced all the cluster_clock_gating cells and pulp_clock_gating cells with their respective behavior models
  5. I then replace the soc_domain.sv in rtl/pulpissimo/src_files.yml with the synthesized netlist (Also include technology behavioral models and SRAM models)
  6. ./generate_scripts and make clean build in root folder of pulpissimo and then run the helloworld example from pulp-rt-examples
  7. The post synthesis simulations has a few problems that I can recognize a. The instr_rdata_i coming from instruction cache seems to have these badacce5 initially which does not happens in normal simulation b. The illegal_instr_i signal is set high many times in the simulation

You can also find the vcd dump in my google drive: https://drive.google.com/open?id=102YbQ7HJ0prg3M4Es373vznWqamDxc03

rtl_files.tcl.zip

Hope to get some help from you! Thanks!

renzoandri commented 5 years ago

I haven't done post-synth simulation on the entire pulpissimo and also never used the behavioural cells. the badacce5 means bad access (don't know when this error appears). One thing you could check is: if the simulation is timed or untimed. It might be that you have hold violations, because some parts are timed and other parts of your designed is timed (e.g. the behavioural clock gating cells are fore sure untimed). And the memory models are obviously also a common source of problems to be checked.

FrancescoConti commented 5 years ago

Hi @vikramjain236 , I am a bit perplexed by swapping in & out behavioral models (maybe I misunderstood what you wrote). Just to be 100% explicit, the "correct" flow should be

  1. replace "generic tech" cells with your own technology... typically clock gating cells and SRAM models. Your tech library will have its own clock gating cells (you have to choose one and wrap it so that it replaces the functionality of cluster_clock_gating). The SRAM behavioral model generated by your memory compiler will have to replace all generic_mem instances. If your memory is different to what we assume in the generic one, you will have to use a different SRAM, combine multiple memory cuts or otherwise add logic so that the interface with the rest of the system is maintained.
  2. perform your simulation at RTL level with the correct tech libraries (gating cells and SRAM cuts). It must work at this level of abstraction before performing synthesis. You will have to add an IP or otherwise link the behavioral models of your tech libraries (much like you do for post-synthesis simulation)
  3. synthesize the soc_domain. Make sure that the correct modules are linked, especially for what concerns the SRAMs.
  4. perform post-synthesis simulation. Here as we know stuff gets complicated, especially for what concerns timing (which is not necessarily fully fixed at the post-synthesis stage). In both the timed and untimed case, you have to check that the default cell delays are not used (in my experience they are consistently wrong) but you either have delay annotation with an SDF file (timed) or no delay at all (untimed).

On your specific error: badacce5 is generated by the SoC interconnect when you try to access an address that is not mapped. I think however this is more a symptom than the origin of your problem. I will try to have a look at the VCD, in the mean time let us know whether you followed the procedure above.