stellar2012wxg / gperftools

Automatically exported from code.google.com/p/gperftools
BSD 3-Clause "New" or "Revised" License
0 stars 0 forks source link

Inconsistent output from CPU Profiler (main() has 20-30%) #384

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. It's really hard to reproduce with small test case, but I can't share the 
real application.

What is the expected output? What do you see instead?
I would expect 100% or something close to 100% for main() function, but I see 
20-30% instead.
Also, there are functions with ~140%

What version of the product are you using? On what operating system?

Google perftools 1.6
OS:   RHEL 4.8 and RHEL 5.5

Please provide any additional information below.
Checked with different pprof output modes (text, callgrind, pdf) All gave the 
same results.

Original issue reported on code.google.com by pavl...@gmail.com on 9 Dec 2011 at 5:00

GoogleCodeExporter commented 9 years ago
I'm afraid this will be impossible to debug without a testcase to look at.  I 
don't know offhand what might be causing this, though perhaps other people do.

I'm closing this CannotReproduce, but feel free to reopen if you can come up 
with a test case that displays the problem.

Original comment by csilv...@gmail.com on 9 Dec 2011 at 7:12

GoogleCodeExporter commented 9 years ago
I understand, it's impossible to reproduce. I will try to create one, looks 
like I have some glues, which might help me to understand better how internally 
works profiler and create a test-case.

One thing that I realized it's connected with recursive functions I have in my 
application. Looks like recursive calls increase the number of samples 
artificially, which eventually bring such results. Can you please confirm that 
my speculations below are correct?

If we have recursive calls than in single stack sample we will have more than 
one appearance of the same function. How will profiler handle this case? 
For example in call stack sample #3 below it will count 3 for function c() or 
just 1 ? 

Is the table below correct for this scenario?

-----------------------------------------------------

                      c()
           c()        c()
c()        c()        c()
b()    b()        b()
a()        a()        a()
main()     main()     main()
------     -------    -------   
sample 1   sample 2   sample 3    ...

function  | # of appearance in the stack
---------------------------------------
main      |   3
a         |   3
b         |   3
c         |   6

Thanks.

Original comment by pavl...@gmail.com on 9 Dec 2011 at 9:51

GoogleCodeExporter commented 9 years ago
Yes, it's likely recursive calls are involved, now that you mention it.

I'm not exactly sure how the profiler handles recursive calls actually, but if 
I were to give my best guess, it would be the same as the scenario you describe.

Original comment by csilv...@gmail.com on 9 Dec 2011 at 10:01

GoogleCodeExporter commented 9 years ago
Yes, I reproduced the issue with a small recursive test-case. 
So is this a bug or feature? 
As a user, I'd prefer to get 100% for main().

I will post the test-case during weekend.

Original comment by pavl...@gmail.com on 9 Dec 2011 at 11:08

GoogleCodeExporter commented 9 years ago
I would say it's unfortunate but there may be no better solution.  Suppose you 
have mutually recursive functions, and a stacktrace that looks like
   A -> B -> C -> D -> E -> C -> D -> E -> C -> D

How much time do you attribute to each of the functions here?  How do you 
capture the fact that the code is "in" C and D more than it's "in" E?

Original comment by csilv...@gmail.com on 9 Dec 2011 at 11:31

GoogleCodeExporter commented 9 years ago
This is really hard to represent, there may be no easy solution, but I would 
approach this from timeline point, the first function which starts the cycle is 
C, so whatever started from C is in C. 

I tried my test case with gprof, which reported what I expected, 100% for 
main(). I know that they use different profiling techniques, but results can be 
represented in similar way. 
For mutually recursive functions they use an abstraction called <cycle # as a 
whole>
http://www.cs.utah.edu/dept/old/texinfo/as/gprof.html 

Another possible solution might be adjusting (+1) number of samples for rest of 
functions in stacktrace. I think this approach is more logical, because all 
those functions are in stacktrace, when we make a recursive call, so in some 
sense they are spending CPU time during recursive calls.

Test case also included.

Original comment by pavl...@gmail.com on 12 Dec 2011 at 11:25

Attachments:

GoogleCodeExporter commented 9 years ago

Original comment by csilv...@gmail.com on 13 Dec 2011 at 12:41

GoogleCodeExporter commented 9 years ago
Can you clarify a bit what the test case is showing?  The pprof profile you 
show in the attached file looks very reasonable to me: triangle taking up 98+% 
of the cumulative CPU, and main taking 100%.  Is it supposed to be showing a 
problem in some way?

I just want to make sure there's a way to reproduce the problem, to give 
something concrete to talk about.  Once that's done, I'll ask the pprof authors 
what they think.

Original comment by csilv...@gmail.com on 13 Dec 2011 at 12:44

GoogleCodeExporter commented 9 years ago
Sure, I will try to clarify. 
I attached two kcachegrind screenshots, both screenshots  and the text 
output(in test-case) were generated from the same profile.  
One of screenshots shows relative time and second one absolute time. As you can 
see on screenshots inclusive time for main() is 0.15%, and self time is 0.00%.

If I understand inclusive time correctly it suppose to be 100% for main(). 

Also, how kcachegrind results correlate to text output? there is no 0.15% in 
text output at all.

Original comment by pavl...@gmail.com on 13 Dec 2011 at 5:44

Attachments:

GoogleCodeExporter commented 9 years ago
Yes, I'm confused by this too.  Can you run ./pprof --raw <your binary> <your 
profile file> and attach the result here?  I'll see if I can reproduce what 
you're seeing.

Original comment by csilv...@gmail.com on 14 Dec 2011 at 2:01

GoogleCodeExporter commented 9 years ago
raw output from pprof attached.

Original comment by pavl...@gmail.com on 14 Dec 2011 at 7:42

Attachments:

GoogleCodeExporter commented 9 years ago
Thanks!  That's very helpful.  I've passed this along to the pprof authors to 
get their feedback on what we might do about this.

Original comment by csilv...@gmail.com on 14 Dec 2011 at 11:14

GoogleCodeExporter commented 9 years ago
I asked and they said that this is a known issue, but doesn't bother them 
enough to do anything about it.  "If there was a suitably non-invasive change 
to fix it, we'd be willing to take a look," was their general feeling.  So, 
patches accepted!  Though no promises on them...

Original comment by csilv...@gmail.com on 21 Dec 2011 at 7:10

GoogleCodeExporter commented 9 years ago
I think I found the source of issue. Looks like it's size of stacktrace. 
Currently it's defined as 64 in source code. In test-case it's much more that 
64, for none-recursive functions it might be OK (we cut the stacktrace and lose 
details), but with recursive functions we get this weird result. 
I changed it to 1024 and results are as expected (100% for main()), except 
callgrind output, it shows 53%  for main().

1. Could you please confirm that my guesses are correct, and we can fix the 
callgrind output to get valid results?

2. Also, the text output doesn't have header, It will be useful to have a 
header for it.

New results are attached (with stacktrace =1024)

Original comment by pavl...@gmail.com on 6 Jan 2012 at 10:58

Attachments:

GoogleCodeExporter commented 9 years ago
I don't know anything about callgrind -- I think that entire bit of code was 
provided by a user -- so I can't say why it's still at 53% rather than 100%.  
I'm glad the text output is at 100% now!  Your theory about the stacktrace 
being cut off is very reasonable.

One reason we limit the stacktrace size is that that we need to store that 
array on the stack, and stack size can be limited (especially in threaded 
apps), so we have to limit how large we make the array.  64 was the sweet spot 
we found between taking up space and catching full stack information in most 
cases.  I'm not sure we can safely change it.  Of course, you're welcome to 
change it for your own use, and I'm glad that helps you.

Original comment by csilv...@gmail.com on 7 Jan 2012 at 2:38

GoogleCodeExporter commented 9 years ago
Is this something that you would like to eventually see addressed or can this 
issue be closed off now? Seems to be a pretty pathological case.

Original comment by chapp...@gmail.com on 7 Feb 2012 at 3:21

GoogleCodeExporter commented 9 years ago
I think the main issue solved by increasing stacktrace size. But the outputs 
are not consistent; --callgrind is different from others (54% vs 100% for 
main). Ideally I would expect consistent outputs for all pprof output modes, 
and a header for pprof --test output.

Original comment by pavl...@gmail.com on 7 Feb 2012 at 6:46

GoogleCodeExporter commented 9 years ago
1. I guess my main concern is that there will potentially be another user in 
the future where a stack size of 100 isn't sufficient, etc. Plus, there seems 
to be some performance and memory implications to tuning the stacktrace size 
that would need a careful analysis. Perhaps this could be best implemented as a 
tunable build setting where the default would remain as it is now but for those 
encountering this issue it could be tweaked?

2. Is the --callgrind issue a general issue or is it isolated to this 
particular case.

3. I would be happy to take in a patch for header issue if you want to draft 
one up.

Original comment by chapp...@gmail.com on 2 Mar 2012 at 4:43

GoogleCodeExporter commented 9 years ago
Just pinging for further update/comment. I am actively working on patch work 
over the next few days and would like to clear up as many issues as possible.

Original comment by chapp...@gmail.com on 21 Apr 2012 at 6:17

GoogleCodeExporter commented 9 years ago
1. Agree, stacktrace size will add extra overhead, probably 64 is best default 
value. Making it as a tunable build setting is a good idea.

2. The --callgrind issue is general for any recursive case, where stack size 
exceeds 64. For recursive algorithms it's pretty common. In that case you will 
get weird, confusing results, like inclusive time for main() is 30%.

3. I will give you a patch for header.

Original comment by pavl...@gmail.com on 23 Apr 2012 at 4:03

GoogleCodeExporter commented 9 years ago
Any luck with that patch?

Original comment by chapp...@gmail.com on 11 Mar 2013 at 2:21

GoogleCodeExporter commented 9 years ago

Original comment by chapp...@gmail.com on 11 Mar 2013 at 2:21

GoogleCodeExporter commented 9 years ago
I've been doing some profiling using files and not functions, where the 
recursivity problem is felt even more (it's common that functions in one file 
call functions in other files that call functions in the first file). When this 
happens, the number of calls in the arcs become confusing and provides little 
useful information. 
To circumvent this I experimented changing in pprof the way that the values in 
the arcs are counted, so that if the same element appears further ahead in a 
callstack the arc from that element to the next is not increased. With this, 
looking at the arcs exiting the boxes in the call graph you see how many 
samples exist that are finished somewhere else.

Original comment by vitor.s....@gmail.com on 23 Dec 2014 at 5:50

GoogleCodeExporter commented 9 years ago
"confusing graphs" is not same as inconsistent output. I think that recent 
changes to cpu profiler might actually fixed this ticket's problem.

Regarding your improvements, consider posting them somehow (but not on this 
ticket please).

Original comment by alkondratenko on 23 Dec 2014 at 6:44

GoogleCodeExporter commented 9 years ago
so I posted here because some old messages on this issue had something I was 
dealing with but, indeed, confusing is not the same as inconsistent. 
I do see inconsistent counting of samples in some cases on cpu profiling. After 
figuring out if it's a real issue, and not some mistake of mine, I'll post it 
as a new issue. 

Original comment by vitor.s....@gmail.com on 23 Dec 2014 at 7:02