verilator / verilator

Verilator open-source SystemVerilog simulator and lint system
https://verilator.org
GNU Lesser General Public License v3.0
2.54k stars 607 forks source link

Verilator performance issue #4858

Closed davidharrishmc closed 4 weeks ago

davidharrishmc commented 9 months ago

Verilator is running my RISC-V processor testbench almost two orders of magnitude more slowly than commercial simulators. I've been told that Verilator is as fast or faster than commercial simulators on other Verilog designs. I presume there is something about my design that is hard for Verilator. How do I go about troubleshooting this performance issue?

The code of interest is https://github.com/openhwgroup/cvw

Unfortunately, it relies on the RISC-V riscof to build the tests cases, which involves an unreasonable amount of tool installation for most people to care to reproduce. I can share the compiled test cases separately if it is helpful.

Here is a transcript. Verilating takes about a minute, but I might not have to do that every time. Vtestbench takes 9 minutes to run. In contrast, both commercial tools I've tried take less than 20 seconds to build and run.

harris@chips:~/cvw/sim$ time verilator -GTEST="\"arch64i\"" --timescale "1ns/1ns" --timing --binary --top-module testbench "-I../config/shared" "-I../config/rv64gc" ../src/cvw.sv ../testbench/testbench.sv ../testbench/common/*.sv ../src/*/*.sv ../src/*/*/*.sv --relative-includes
make: Leaving directory '/home/harris/cvw/sim/obj_dir'

real    0m58.226s
user    0m55.120s
sys     0m3.018s

harris@chips:~/cvw/sim$ time ./obj_dir/Vtestbench
TEST is arch64i
rv64i_m/I/src/add-01.S succeeded.  Brilliant!!!
rv64i_m/I/src/addi-01.S succeeded.  Brilliant!!!
...
rv64i_m/I/src/xori-01.S succeeded.  Brilliant!!!
SUCCESS! All tests ran without failures.
%Error: ../testbench/testbench.sv:332: Verilog $stop
Aborting...
Aborted (core dumped)

real    9m1.833s
user    9m1.705s
sys     0m0.071s
wsnyder commented 9 months ago

Mileage always varies, but that much difference means there's something likely broken in either coding style (e.g. #delay on every assignment), or Verilator missing an optimization, or most likely both.

Please use the benchmark tools described in https://veripool.org/guide/latest/simulating.html#code-profiling to see what's slow and the related Verilog code.

udif commented 9 months ago

Adding my 2 cents, my experience was ~4x slowdown against the current commercial simulator used here, that did this with x10-x100 faster compilation (depending on verilator/gcc/llvm flags) and did not sacrifice 4-state simulation to get that speed . I still wonder what's the reason. Is there any tutorial or example on profiling Verilator?

wsnyder commented 9 months ago

If the code is standard synthesizable code without #delays or other ugly items, we would expect equal or better performance (say 0.5x to 2x), so 4x is concerning and worth benchmarking.

There may be user tutorials on profiling (if someone reads this and has one please post a pointer). The documentation noted above describes the basics, we'd certainly welcome pull requests to make this documentation clearer and/or more a "how to": https://veripool.org/guide/latest/simulating.html#code-profiling

wsnyder commented 9 months ago

Note above assumes comparing single threaded execution, there's a lot more variables once multithreading is involved.

udif commented 9 months ago

I forgot to mention the commercial simulator used a single core license and for Verilator I used anything I could (multi threaded).

wsnyder commented 9 months ago

Using threads on a small model may cause large slowdowns. The Gantt report should give guidance.

davidharrishmc commented 9 months ago

This is the profiler output. It’s running single-threaded with most of the time in funct nba. I’m not sure how to tie that to specific parts of Verilog code. The commercial simulator is also running single-threaded.

The flip-flops have #1 delays to help visualize the output. The testbench also contains some # delays to generate the input clock. I’ve experimented with removing all the #delays from the non-testbench code and the sim time drops from 9:01 to 7:32. Also, the profiler jumps up to 99.96% of the time in funct nba. There is still something very slow. Any hints on how to isolate the troublesome Verilog?

***@***.***:~/cvw/sim$ verilator_gantt profile_exec.dat
Verilator Gantt report

Argument settings:
  +verilator+prof+exec+start+1
  +verilator+prof+exec+window+2

Summary:
  Total elapsed time = 78750 rdtsc ticks
  Parallelized code  = 0.00% of elapsed time
  Total threads      = 1
  Total CPUs used    = 1
  Total mtasks       = 0
  Total yields       = 0

CPU info:
   Id | Time spent executing MTask | Socket | Core | Model
      | % of elapsed ticks / ticks |        |      |
  ====|============================|========|======|======

Section profile:
 Total    | Self    | Total    | Relative   | Section
 time     | time    | entries  | entries    |  name  
==========|=========|==========|============|========
  100.00% |   0.46% |        1 |       1.00 | *TOTAL*
   99.54% |   0.23% |        2 |       2.00 |   eval
   99.31% |   0.57% |        2 |       1.00 |     loop nba
   68.80% |  68.80% |        2 |       1.00 |       func nba
   29.94% |  17.37% |        4 |       2.00 |       loop act
    7.54% |   7.54% |        2 |       0.50 |         func act
    5.03% |   5.03% |        6 |       1.50 |         trig act

Writing profile_exec.vcd

On Jan 24, 2024, at 9:40 AM, Wilson Snyder @.***> wrote:

Note above assumes comparing single threaded execution, there's a lot more variables once multithreading is involved.

— Reply to this email directly, view it on GitHub https://github.com/verilator/verilator/issues/4858#issuecomment-1908620585, or unsubscribe https://github.com/notifications/unsubscribe-auth/AR4AA34ATQA446DV5JOGEBTYQFBR7AVCNFSM6AAAAABCI52L5KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMBYGYZDANJYGU. You are receiving this because you authored the thread.

wsnyder commented 9 months ago

Please repost with using --prof-cfuncs

davidharrishmc commented 9 months ago

Took a while, but this is working now. I had been ending simulation with $stop rather than $finish (to keep the GUI open at the end of a Questa sim), and there was no gmon.out generated without $finish. Also, I had to figure out that the command is "prof ./obj_dir/Vtestbench gmon.out > gmon.log." and then "verilator_profcfunc gmon.log > gmon.log2". Probably obvious to developers, less obvious to hardware person. The might be helpful hints in the profiling instructions.

The output looks fishy, with 503% of the time in Vtestbench less -415% rounding error. The profiling seems to indicate that most of the time is in testbench at line 271, listed below. This is part of an always block that calls a task to manipulate an associative array. I'd like to try disabling that testbench code to see if it speeds things up. It's woven into a number of places, so the experiment will take a bit of work.

Overall summary by type:
  % time  type
   11.36  C++
   28.64  Common code under Vtestbench
    0.15  VLib
  475.10  Verilog Blocks under Vtestbench
  -415.25  Unaccounted for/rounding error

Overall summary by design:
  % time  design
   11.36  C++
    0.15  VLib
  503.74  Vtestbench
  -415.25  Unaccounted for/rounding error

...
  %   cumulative   self              
 time   seconds   seconds      calls   design      type      filename and line number
  0.48 1304066656.00 1304066656.00          1   -           C++       VlCoroutineHandle::~VlCoroutineHandle() <cycle 1> [32]
  0.99 1835250306.00 531183650.00          1   -           C++       VlCoroutineHandle::resume() <cycle 1> [15]
  3.23 2232031526.70 396781220.70  476946756   Vtestbench  VBlock    fctrl:253
...
452.65 3898073681.10 451513.13     451139   Vtestbench  VBlock    testbench:271

270:  always @(posedge clk) begin
271:    if(SelectTest) begin
      if (riscofTest) memfilename = {pathname, tests[test], "/ref/ref.elf.memfile"};
      else            memfilename = {pathname, tests[test], ".elf.memfile"};
      if (riscofTest) begin
        ProgramAddrMapFile = {pathname, tests[test], "/ref/ref.elf.objdump.addr"};
        ProgramLabelMapFile = {pathname, tests[test], "/ref/ref.elf.objdump.lab"};
      end else begin
        ProgramAddrMapFile = {pathname, tests[test], ".elf.objdump.addr"};
        ProgramLabelMapFile = {pathname, tests[test], ".elf.objdump.lab"};
      end
      // declare memory labels that interest us, the updateProgramAddrLabelArray task will find 
      // the addr of each label and fill the array. To expand, add more elements to this array 
      // and initialize them to zero (also initilaize them to zero at the start of the next test)
      updateProgramAddrLabelArray(ProgramAddrMapFile, ProgramLabelMapFile, ProgramAddrLabelArray);
    end
wsnyder commented 9 months ago

Thanks for the docs suggestions, I'll fold those in tonight. The rounding is likely #3122 which was reported to the gprof team; if you want to rebuild gprof that may fix it. There appears to be a gcc bug with profiling coroutines, it's missing functions. I tried clang instead and it did much better, so try that (./configure CXX=clang++) From the data you sent it does seem like you still have a huge number of #delays. Try to remove them or ideally use --no-timing to strip them all out.

solomatnikov commented 9 months ago

@davidharrishmc did you try something like this?

`ifdef VERILATOR
  `define CLK2Q
`else
  `define CLK2Q #1
`endif

This was used to speed up simulation for other simulators.

wsnyder commented 4 weeks ago

I think this was answered earlier and resulting sub-issues resolved, if there's still something left for us to do please feel free to reopen.

davidharrishmc commented 4 weeks ago

Verilator is running as part of our regular regression and performance is satisfactory. Thank you!