Closed davidharrishmc closed 4 weeks ago
Mileage always varies, but that much difference means there's something likely broken in either coding style (e.g. #delay on every assignment), or Verilator missing an optimization, or most likely both.
Please use the benchmark tools described in https://veripool.org/guide/latest/simulating.html#code-profiling to see what's slow and the related Verilog code.
Adding my 2 cents, my experience was ~4x slowdown against the current commercial simulator used here, that did this with x10-x100 faster compilation (depending on verilator/gcc/llvm flags) and did not sacrifice 4-state simulation to get that speed . I still wonder what's the reason. Is there any tutorial or example on profiling Verilator?
If the code is standard synthesizable code without #delays or other ugly items, we would expect equal or better performance (say 0.5x to 2x), so 4x is concerning and worth benchmarking.
There may be user tutorials on profiling (if someone reads this and has one please post a pointer). The documentation noted above describes the basics, we'd certainly welcome pull requests to make this documentation clearer and/or more a "how to": https://veripool.org/guide/latest/simulating.html#code-profiling
Note above assumes comparing single threaded execution, there's a lot more variables once multithreading is involved.
I forgot to mention the commercial simulator used a single core license and for Verilator I used anything I could (multi threaded).
Using threads on a small model may cause large slowdowns. The Gantt report should give guidance.
This is the profiler output. It’s running single-threaded with most of the time in funct nba. I’m not sure how to tie that to specific parts of Verilog code. The commercial simulator is also running single-threaded.
The flip-flops have #1 delays to help visualize the output. The testbench also contains some # delays to generate the input clock. I’ve experimented with removing all the #delays from the non-testbench code and the sim time drops from 9:01 to 7:32. Also, the profiler jumps up to 99.96% of the time in funct nba. There is still something very slow. Any hints on how to isolate the troublesome Verilog?
***@***.***:~/cvw/sim$ verilator_gantt profile_exec.dat
Verilator Gantt report
Argument settings:
+verilator+prof+exec+start+1
+verilator+prof+exec+window+2
Summary:
Total elapsed time = 78750 rdtsc ticks
Parallelized code = 0.00% of elapsed time
Total threads = 1
Total CPUs used = 1
Total mtasks = 0
Total yields = 0
CPU info:
Id | Time spent executing MTask | Socket | Core | Model
| % of elapsed ticks / ticks | | |
====|============================|========|======|======
Section profile:
Total | Self | Total | Relative | Section
time | time | entries | entries | name
==========|=========|==========|============|========
100.00% | 0.46% | 1 | 1.00 | *TOTAL*
99.54% | 0.23% | 2 | 2.00 | eval
99.31% | 0.57% | 2 | 1.00 | loop nba
68.80% | 68.80% | 2 | 1.00 | func nba
29.94% | 17.37% | 4 | 2.00 | loop act
7.54% | 7.54% | 2 | 0.50 | func act
5.03% | 5.03% | 6 | 1.50 | trig act
Writing profile_exec.vcd
On Jan 24, 2024, at 9:40 AM, Wilson Snyder @.***> wrote:
Note above assumes comparing single threaded execution, there's a lot more variables once multithreading is involved.
— Reply to this email directly, view it on GitHub https://github.com/verilator/verilator/issues/4858#issuecomment-1908620585, or unsubscribe https://github.com/notifications/unsubscribe-auth/AR4AA34ATQA446DV5JOGEBTYQFBR7AVCNFSM6AAAAABCI52L5KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMBYGYZDANJYGU. You are receiving this because you authored the thread.
Please repost with using --prof-cfuncs
Took a while, but this is working now. I had been ending simulation with $stop rather than $finish (to keep the GUI open at the end of a Questa sim), and there was no gmon.out generated without $finish. Also, I had to figure out that the command is "prof ./obj_dir/Vtestbench gmon.out > gmon.log." and then "verilator_profcfunc gmon.log > gmon.log2". Probably obvious to developers, less obvious to hardware person. The might be helpful hints in the profiling instructions.
The output looks fishy, with 503% of the time in Vtestbench less -415% rounding error. The profiling seems to indicate that most of the time is in testbench at line 271, listed below. This is part of an always block that calls a task to manipulate an associative array. I'd like to try disabling that testbench code to see if it speeds things up. It's woven into a number of places, so the experiment will take a bit of work.
Overall summary by type:
% time type
11.36 C++
28.64 Common code under Vtestbench
0.15 VLib
475.10 Verilog Blocks under Vtestbench
-415.25 Unaccounted for/rounding error
Overall summary by design:
% time design
11.36 C++
0.15 VLib
503.74 Vtestbench
-415.25 Unaccounted for/rounding error
...
% cumulative self
time seconds seconds calls design type filename and line number
0.48 1304066656.00 1304066656.00 1 - C++ VlCoroutineHandle::~VlCoroutineHandle() <cycle 1> [32]
0.99 1835250306.00 531183650.00 1 - C++ VlCoroutineHandle::resume() <cycle 1> [15]
3.23 2232031526.70 396781220.70 476946756 Vtestbench VBlock fctrl:253
...
452.65 3898073681.10 451513.13 451139 Vtestbench VBlock testbench:271
270: always @(posedge clk) begin
271: if(SelectTest) begin
if (riscofTest) memfilename = {pathname, tests[test], "/ref/ref.elf.memfile"};
else memfilename = {pathname, tests[test], ".elf.memfile"};
if (riscofTest) begin
ProgramAddrMapFile = {pathname, tests[test], "/ref/ref.elf.objdump.addr"};
ProgramLabelMapFile = {pathname, tests[test], "/ref/ref.elf.objdump.lab"};
end else begin
ProgramAddrMapFile = {pathname, tests[test], ".elf.objdump.addr"};
ProgramLabelMapFile = {pathname, tests[test], ".elf.objdump.lab"};
end
// declare memory labels that interest us, the updateProgramAddrLabelArray task will find
// the addr of each label and fill the array. To expand, add more elements to this array
// and initialize them to zero (also initilaize them to zero at the start of the next test)
updateProgramAddrLabelArray(ProgramAddrMapFile, ProgramLabelMapFile, ProgramAddrLabelArray);
end
Thanks for the docs suggestions, I'll fold those in tonight.
The rounding is likely #3122 which was reported to the gprof team; if you want to rebuild gprof that may fix it.
There appears to be a gcc bug with profiling coroutines, it's missing functions. I tried clang instead and it did much better, so try that (./configure CXX=clang++
)
From the data you sent it does seem like you still have a huge number of #delays. Try to remove them or ideally use --no-timing to strip them all out.
@davidharrishmc did you try something like this?
`ifdef VERILATOR
`define CLK2Q
`else
`define CLK2Q #1
`endif
This was used to speed up simulation for other simulators.
I think this was answered earlier and resulting sub-issues resolved, if there's still something left for us to do please feel free to reopen.
Verilator is running as part of our regular regression and performance is satisfactory. Thank you!
Verilator is running my RISC-V processor testbench almost two orders of magnitude more slowly than commercial simulators. I've been told that Verilator is as fast or faster than commercial simulators on other Verilog designs. I presume there is something about my design that is hard for Verilator. How do I go about troubleshooting this performance issue?
The code of interest is https://github.com/openhwgroup/cvw
Unfortunately, it relies on the RISC-V riscof to build the tests cases, which involves an unreasonable amount of tool installation for most people to care to reproduce. I can share the compiled test cases separately if it is helpful.
Here is a transcript. Verilating takes about a minute, but I might not have to do that every time. Vtestbench takes 9 minutes to run. In contrast, both commercial tools I've tried take less than 20 seconds to build and run.