Question: Understanding verilator performance metrics - Githubissues

verilator / verilator

Verilator open-source SystemVerilog simulator and lint system

https://verilator.org

GNU Lesser General Public License v3.0

2.43k stars 582 forks source link

Question: Understanding verilator performance metrics #1827

Closed veripoolbot closed 10 years ago

veripoolbot commented 10 years ago

Author Name: Jonathon Donaldson Original Redmine Message: 1404 from https://www.veripool.org

I have a simple video driver design written in SystemVerilog that I'm compiling with verilator. Everything is working very well (as usual) but the performance is significantly slower than I expected. So now I'm trying to figure out if it's my expectations that were too high or if I'm doing something silly that is causing the design to run very slow. My testbench is written in SystemC and it's the first time that I've ever used the language so it's possible that I'm doing something I shouldn't be that is causing the design to run slowly.

I am calling verilator as follows:

  -Wno-lint \
  --profile-cfuncs \
  -sc \
  -CFLAGS "-pg" \
  -LDFLAGS "-pg" \
  ...etc...

I immediately noticed a couple things that seem strange to me (but maybe they're not?) in the output of gprof and verilator_profcfunc. In the gprof output I see:

  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
 48.52     13.91    13.91                             VL_POW_III(int, int, int, unsigned int, unsigned int)
 23.96     20.78     6.87                             _mcount_private
  5.72     22.42     1.64                             VL_EXTENDSIGN_Q(int, unsigned long long)
  3.80     23.51     1.09                             VL_EXTENDS_QQ(int, int, unsigned long long)
  2.27     24.16     0.65                             VL_POWSS_III(int, int, int, unsigned int, unsigned int, bool, bool)
  2.23     24.80     0.64                             VL_LTS_III(int, int, int, unsigned int, unsigned int)
...blah blah blah...

For the gprof output I understand that _mcount_private is just part of the profiling code that was inserted and that there's nothing I can do about it. But what is VL_POW_III(int, int, int, unsigned int, unsigned int) and why is it taking 50% of the execution time? I assume it's something to do with calculating the power of something? I don't perform any mathematical 'N-to-power-of-M' operations in my SystemVerilog design. The only thing I'm doing is a bunch of adding and bit-shifting.

And in the verilator_profcfunc output (which I call vprof) I see:

Overall summary by type:
  % time  type
     1.08  C++
     0.87  Common code under Vvideo
     5.82  Verilog Blocks under Vvideo
    92.23  Unaccounted for/rounding error
...blah blah blah...

Is there any way to figure out what the Unaccounted for/rounding error is? Having a metric of 92% unknown is...ummm...not very useful. Haha. Or maybe I'm doing something wrong in the profiling generation? The rest of the vprof report simply says that all "known" parts of my verilog design are taking 1.5% or less of the execution time - which also isn't very useful because really those percentages are all just being shrunk way down due to this 92% of "unknown" code.

I did find a post saying that the "Unaccounted for/rounding error" could be due to very wide ports in my SystemC testbench. But I don't have anything like that. All of my ports are either bool or uint32_t. In fact, the widest port in my actual verilog code is only 16-bit with the rest being either 8-bit or 1-bit. For some reason verilator likes to make any vectored ports uint32_t regardless of their size. Not sure why that is but I doubt it has anything to do with the performance problems I'm having.

Any help would be much appreciated! THANKS! :)

veripoolbot commented 10 years ago

Original Redmine Comment Author Name: Wilson Snyder (@wsnyder) Original Date: 2014-08-29T01:31:03Z

This is saying 50% of your time is in the power operator, read "verilated.h" to see what these functions do. That's quite suspicious unless your code really is only doing power all over the place. Note "constant" powers, e.g. 2**n, should not matter as they are strength reduced. A more indepth gprof trace will tell you where it's being called from, then look at the code to see if it is right and where the code came from based on the comments.

This is also suspicious because the VL_EXTENDS etc routines should be inlined. I suspect you aren't using -O2 or at least -O with your code (maybe just to profile you removed -O2)?

"Unaccounted for" is probably counting the VL_* routines, that is probably a bug. A fix would be appreciated. (I would suggest also binning all _mcount_private into "profiler overhead".)

veripoolbot commented 10 years ago

Original Redmine Comment Author Name: Wilson Snyder (@wsnyder) Original Date: 2014-08-29T02:12:00Z

If you pull from git I improved the verilator_profcfunc time attribution. It now uses "VLib" for common libraries.

0.00 0.00 0.00 106 VLib VL_RAND_RESET_I(int)

veripoolbot commented 10 years ago

Original Redmine Comment Author Name: Jonathon Donaldson Original Date: 2014-08-29T03:43:01Z

Wilson Snyder wrote:

This is saying 50% of your time is in the power operator, read "verilated.h" to see what these functions do. That's quite suspicious unless your code really is only doing power all over the place. Note "constant" powers, e.g. 2n, should not matter as they are strength reduced. The only place I use 2n in the SV design is in a 'function' for determining the width of a port vector at elaboration time (when calculating ceiling of log-base-2 of a value to get number of bits required). Which I think is common to about 99.9% of HDL designs on planet earth. Haha.

I seriously hope that verilator isn't calculating these in real time over and over?? :( I would think you would perform the calculation at compilation time, replace that calculation with a numeric literal, and then never execute it again. Right?

A more indepth gprof trace will tell you where it's being called from, then look at the code to see if it is right and where the code came from based on the comments. What switches should I specify in gprof to get a more detailed trace? THere is a -l switch but it's only supported in much older version of gcc which I'm not using.

There are only two other places in the gprof output that VL_POW is mentioned and they don't tell you anything useful. I have attached the gprof output for you to see. Also, VL_POW isn't mentioned anywhere in the verilator vprof output at all (also attached).

This is also suspicious because the VL_EXTENDS etc routines should be inlined. I suspect you aren't using -O2 or at least -O with your code (maybe just to profile you removed -O2)? Correct, I don't have any optimizations on in order to be able to get better visibility into the design for the profiling. If you want me to turn a particular optimization on and run again I would be more than happy to. Just let me know.

"Unaccounted for" is probably counting the VL_* routines, that is probably a bug. A fix would be appreciated. (I would suggest also binning all _mcount_private into "profiler overhead".) Unfortunately, I wouldn't have a clue how to do that at the moment. But I can look at the code and see if I can figure something out.

If you pull from git I improved the verilator_profcfunc time attribution. It now uses "VLib" for common libraries.

Coolness! I'll give that a shot and report the results.

veripoolbot commented 10 years ago

Original Redmine Comment Author Name: Jonathon Donaldson Original Date: 2014-08-29T04:44:18Z

Ok, so I turned on -O3 and am getting entirely different profiling results. Now VL_POW_III isn't even mentioned anywhere in any of the reports. So somehow -O3 optimized VL_POW_III away completely? Don't understand that. But here are the new results:

  %   cumulative   self              
 time   seconds   seconds    calls   type      filename and line number
 42.52      6.89     6.89 11172145   VBlock    oscope:126
 40.48     13.45     6.56 11172145   VBlock    oscope:122

The above is even more confusing to me. Here is what exists on those lines in the oscope module (both lines are performing the exact same operation just on two different signals).

wire [clogb2(C_OSCOPE_H_PIX)-1:0] sig1 = {~sig1_winc[clogb2(C_OSCOPE_H_PIX)-1],sig1_winc[clogb2(C_OSCOPE_H_PIX)-2:0]};
wire [clogb2(C_OSCOPE_H_PIX)-1:0] sig2 = {~sig2_winc[clogb2(C_OSCOPE_H_PIX)-1],sig2_winc[clogb2(C_OSCOPE_H_PIX)-2:0]};

The 'clogb2' function is the function I mentioned in my previous post for determining the number of bits required for a vector. It is a constant function that I use in all of my synthesizable designs. The value of clogb2(C_OSCOPE_H_PIX) is 7 since the value of parameter C_OSCOPE_H_PIX is 128. So the above lines should reduce to the following with numeric literals at compilation time:

wire [7-1:0] sig1 = {~sig1_winc[7-1],sig1_winc[7-2:0]};
wire [7-1:0] sig2 = {~sig2_winc[7-1],sig2_winc[7-2:0]};

Which further reduces to:

wire [6:0] sig1 = {~sig1_winc[6],sig1_winc[5:0]};
wire [6:0] sig2 = {~sig2_winc[6],sig2_winc[5:0]};

So gprof/vprof is telling me that inverting the most significant bit of two vectors is taking 82% of the execution time? That can't be right.......can it?

I have significantly more complex bit twiddling that I perform in other parts of the design. It makes no sense to me that inverting a single bit is eating up all my CPU time. Forgive my ignorance if there is some obvious reason for this that I'm not considering...

veripoolbot commented 10 years ago

Original Redmine Comment Author Name: Jonathon Donaldson Original Date: 2014-08-29T05:41:27Z

Out of curiosity I replaced all occurrences of "clogb2(C_OSCOPE_H_PIX)" with the numeric literal '7' and the design is running at least 20x faster! :-o Why is this?? This clearly tells me that the clogb2 function is indeed getting calculated over and over and over again in real time! Yikes!

Can't verilator take the value of any constant function and simply replace it with a numeric literal at compilation time just like any other synthesis/simulator tool does? If there is simply some switch I need to turn on then just let me know (I didn't see anything in the manual). If verilator simply doesn't do this yet then implementing this feature would clearly increase verilator's performance for most designs by HUGE amounts!!!

veripoolbot commented 10 years ago

Original Redmine Comment Author Name: Wilson Snyder (@wsnyder) Original Date: 2014-08-29T11:22:47Z

As mentioned above Verilator does constant propagation. However it is suspicious that even GCC could not figure out constant propagation, so chances are there's something about your clogb2 function that makes it very difficult to figure out and both Verilator and GCC are falling back to runtime. If you file a bug showing the VL_POW poping out I'll see what I can do.

BTW the proper way to write this code is:

localparam LOGWIDTH = $clog2(C_OSCOPE_H_PIX); wire [LOGWIDTH-1:0] sig1 = {~sig1_winc[LOGWIDTH-1],sig1_winc[LOGWIDTH-2:0]};

This will double fix it, as $clog2 will surely constify, and also having a localparam by the standard requires it to be reduced to a constant (without this Verilator is actually breaking the rules if it did constant propagate, as I suspect you didn't declare clogb2 as a pure function). Either/both of these will likely make your code faster on other simulators too.

veripoolbot commented 10 years ago

Original Redmine Comment Author Name: Jonathon Donaldson Original Date: 2014-08-29T17:46:56Z

Okey-doke. I will try switching to specifying a localparam first and let you know what happens. In the meantime, here is my clogb2 function which is defined in a 'package' and 'import'ed at the top of the module:

function automatic int clogb2 (input int value);
begin
  int i;

  if (value < 2)
     clogb2 = 1;
  else
     for (i = 0; 2**i < value; i++) 
       clogb2 = i + 1;

end
endfunction : clogb2

If you see something wrong with the above function I'd be happy to fix it. I've been using the above code for nearly 10 years in both verilog and systemverilog and I've never had any error/warnings from any synthesis/simulation tools. But maybe they were all just being nice to me? Haha. :)

veripoolbot commented 10 years ago

Original Redmine Comment Author Name: Wilson Snyder (@wsnyder) Original Date: 2014-08-29T18:59:36Z

I've been using the above code for nearly 10 years in both verilog and systemverilog and I've never had any error/warnings from any synthesis/simulation tools. But maybe they were all just being nice to me?

They were. It's just a performance bug, most tools don't warn on performance issues so how would you know?

I benchmarked VCS. It doesn't constant propagate either. Changing clogb2 to $clog2 makes the below about 40 times faster.

  for (times = 0; times < 100*1000*1000; times=times+1) begin
      i = clogb2(0);
      total += i;
      i = clogb2(1);
      total += i;
      i = clogb2(2);
      total += i;
   end

Using a localparam makes it faster still.

Again I agree Verilator should do better, but it's not unusual in missing this optimization.

veripoolbot commented 10 years ago

Original Redmine Comment Author Name: Jonathon Donaldson Original Date: 2014-08-29T19:27:56Z

Very interesting stuff!! I have definitely learned something very valuable today.

So just to be clear, will I get the maximum speed up benefit by simply creating the localparam? Or is it the combination of using $clog2 and assigning to a localparam that gives the most benefit?

Just FYI, I never use $clog2 for a number of reasons, including the fact that it wasn't supported until verilog-2005 so is not backwards compatible with verilog-2001, and because some synthesis/simulation tools actually implement the function completely incorrectly (e.g. calculating the natural log instead of the log base 2). So I've always found it better to just use my own. So I'm hoping that the answer is that simply assigning the value of my clogb2() function to a localparam will provide the maximum benefit so I don't have to start using $clog2.

Thanks again!! I'm looking forward to much faster simulations in the future! :-D

veripoolbot commented 10 years ago

Original Redmine Comment Author Name: Jonathon Donaldson Original Date: 2014-08-30T22:30:05Z

Will using my clogb2 function have the same detrimental performance effect when it's used in a port listing?

For example:

module my_module #(
  parameter C_SCRN_SIZE_X = 640,
  parameter C_SCRN_SIZE_Y = 480
)(
  input [clogb2(C_SCRN_SIZE_X)-1:0] xcoord,
  input [clogb2(C_SCRN_SIZE_Y)-1:0] ycoord
);
// blah blah
endmodule

Or does it only have a negative impact when used in a statement which needs to be constantly evaluated every clock cycle?

veripoolbot commented 10 years ago

Original Redmine Comment Author Name: Wilson Snyder (@wsnyder) Original Date: 2014-09-01T01:17:28Z

Using clogb2 as a param/localparam is sufficient, as the standard says it must be converted to a constant at elaboration time. Using it to determine a data type width, as in "input [cblogb2(x):0]" is also fine, as they are also sized at elaboration time per the standard.

veripoolbot commented 10 years ago

Original Redmine Comment Author Name: Jonathon Donaldson Original Date: 2014-09-01T05:07:31Z

This is such awesome and useful information!! Thanks so much for your help!!

veripoolbot commented 10 years ago

Original Redmine Comment Author Name: Jonathon Donaldson Original Date: 2014-09-01T07:03:12Z

Hmm...another related question that just came to mind. One systemverilog feature that I love to use is the $bits() function for sizing vectors. For example:

always_ff @(posedge clk)
  if (something_happens)
     myvector <= {$bits(myvector){1'b0}};
  else
     myvector <= {$bits(myvector){1'b1}};

Will this also likely result in a performance decrease over creating another localparam that holds the values of $bits()?

localparam VEC_WIDTH = $bits(myvector);
always_ff @(posedge clk)
  if (something_happens)
     myvector <= {VEC_WIDTH{1'b0}};
  else
     myvector <= {VEC_WIDTH{1'b1}};

veripoolbot commented 10 years ago

Original Redmine Comment Author Name: Wilson Snyder (@wsnyder) Original Date: 2014-09-01T11:29:45Z

$bits in your example needs to be elaboration time determined in all simulators because it determines the size of an expression. If you said "x = $bits(y)" then it is likely done in all simulators at compile time but it doesn't have to be.

$bits and most all $ functions (except e.g. $scanf which needs input) are constant propagated in verilator. Your function isn't because it contains a while loop that is hard to figure out as it depends on the values inside the loop versus having a constant iterator.

veripoolbot commented 10 years ago

Original Redmine Comment Author Name: Jonathon Donaldson Original Date: 2014-09-01T19:36:04Z

Excellent! Thanks!