hwms / jrfonseca

Automatically exported from code.google.com/p/jrfonseca
0 stars 0 forks source link

[gprof2dot] Calculate true total time with perf #77

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
For perf, the input includes complete call chains for every single sample, so 
it is not necessary to estimate the total times heuristically.  Instead, total 
time is simply the proportion of samples for which the function was either 
executing or was found on the stack.  This is the approach suggested by Mike 
Dunlavey at stackoverflow.com/questions/1777556/alternatives-to-gprof.

The patch is against the 2012-11-25 rev of gprof2dot.py.  Also attached is a 
small test program to demonstrate the impact of the change.  Before the change, 
cycfn1 shows only 1/3 total time and the outbound arrow has no label.  After 
the change, cycfn1 shows 2/3 total time and the outbound arrow shows 1/3.  The 
latter was the expected result as clarified by the header comments of the test 
program.

Original issue reported on code.google.com by dfla...@nist.gov on 21 Dec 2012 at 2:41

Attachments:

GoogleCodeExporter commented 9 years ago
David,

I've been too busy lately, but I'm finally looking at this now.

Thanks for your patch.

This patch seems targeted at improving gprof parsing, however I was not able to 
reproduce your results with the val3 example and gprof.

I did

  gcc -g -fno-omit-frame-pointer -pg -o val3 -std=gnu99 val3.c
  ./val3
  gprof val3 > val3.prof

and obtained 
https://code.google.com/p/jrfonseca/source/browse/tests/val3.prof?repo=gprof2dot
 output.

With or without your change I still get cycfn1 showing 2/3.

Original comment by Jose.R.F...@gmail.com on 8 Apr 2013 at 1:31

GoogleCodeExporter commented 9 years ago
The patch affects only the perf input format of Gprof2Dot.  I've rerun the
before and after with both perf and gprof and reproduced the original
complaint in the attached tarball.

Now using val3 as appears in the Linux spin of Version 1.1 of the very small
validation suite at http://www.nist.gov/itl/ssd/cs/software-performance.cfm,
plus
Linux kernel / perf version 3.8.5
gcc version 4.7.2
GNU gprof (Linux/GNU Binutils) 2.22.52.0.2.20120424

For perf:

gcc -std=gnu99 -Wall -Wextra -pedantic -O2  -o val3 val3.c
perf record -e cpu-cycles -c 1000000 -o val3_data -g dwarf ./val3
17440530503365195776 5654866789.972589
[ perf record: Woken up 477 times to write data ]
[ perf record: Captured and wrote 119.058 MB val3_data (~5201703 samples) ]
perf report -i val3_data > val3_perf_report.txt
perf script -i val3_data > val3_script.txt
gprof2dot.py      -w -f perf -o val3_perf_after.dot  < val3_script.txt
gprof2dot.py-orig -w -f perf -o val3_perf_before.dot < val3_script.txt

For gprof:

gcc -std=gnu99 -Wall -Wextra -pedantic -O1 -pg  -o val3 val3.c
./val3
17440530503365195776 5654866789.972589
mv gmon.out val3_data
gprof val3 val3_data > val3_gprof_report.txt
gprof2dot.py      -w -f prof -o val3_gprof_after.dot  < val3_gprof_report.txt
gprof2dot.py-orig -w -f prof -o val3_gprof_before.dot < val3_gprof_report.txt

What I'm seeing:  All four results correctly report the self time as 2/3 for
cycfn2 and 1/3 for cycfn1 and the 100% total time of cycfn2, but only
perf-after reports the expected 2/3 total time for cycfn1, as determined from
actual call stacks rather than heuristics.

Building the gprof one with -O0 -fno-omit-frame-pointer makes no difference
in the result, except to make the run time longer.

Explanation of the expected results from val3.c header comments:

   With default values of FN1LOOP and FN2LOOP, the expected distribution
   of time is approximately
     main    total 100 % self  0 %
     cycfn2  total 100 % self 67 %
     cycfn1  total  67 % self 33 %
   resulting from the following call chains and proportions of samples:
     2  main->cycfn2
     1  main->cycfn2->cycfn1
     2  main->cycfn2->cycfn1->cycfn2
     1  main->cycfn2->cycfn1->cycfn2->cycfn1

HTH,
DWF

Original comment by dfla...@nist.gov on 8 Apr 2013 at 2:22

Attachments:

GoogleCodeExporter commented 9 years ago
My bad. I mixed up "perf" with "prof"...

Thanks for the update

Original comment by Jose.R.F...@gmail.com on 8 Apr 2013 at 2:29

GoogleCodeExporter commented 9 years ago
David,

Does perf guarantee that the whole stack trace is always captured?

If not, what will happen when eldest ancestor is not captured?  Won't this 
start giving bogus results (e.g., `main` function with less than 100%).

Original comment by Jose.R.F...@gmail.com on 8 Apr 2013 at 4:12

GoogleCodeExporter commented 9 years ago
This can be seen with tests/xwax.perf : the top node goes from 99.96% to 
96.26%, because at several points the call stack is incomplete. Ditto for 
glxgears.perf

I'm afraid that simply taking total time as callstack ratio is not very robust. 
I can't take the patch as is.

There must be a way of reconciling both approaches: a) that totals are 
proportional to to call stack samples, and b) the integration of the graph 
along the edges must be obbeyed.

But a combined approach would be quite a complex endeavor.

A simpler compromise would be a sanity check: only take total iff all stack 
traces share a single ancestor (for each thread). I could accept your patch 
like that, as it would cause no regressions.

Another simpler approach would be a command line option that selects both 
behaviors.

Original comment by Jose.R.F...@gmail.com on 8 Apr 2013 at 4:34

GoogleCodeExporter commented 9 years ago
With kernel/perf 3.6.11 I was able to reproduce a behavior with PEBS (-e
cpu-cycles:pp) where some stack traces would go to [unknown] and as a result
main would get less than 100%.  I attach an example output with the patch
applied.  IMO, the behavior on this input is correct:  the tree has multiple
roots that together sum to 100% total time (modulo any insignificant ones
that were below the cutoff).

As of kernel/perf 3.8.5 I can no longer reproduce that behavior even with
cpu-cycles:pp.

Nevertheless, it should be easy to enable switching between the two methods
of calculating total time.

Original comment by dfla...@nist.gov on 8 Apr 2013 at 4:41

Attachments:

GoogleCodeExporter commented 9 years ago
Correction, that was kernel/perf 3.7.1 that demonstrated the problem.  For a 
minute there I thought it was related to the use of frame pointers in 3.6.x.

Original comment by dfla...@nist.gov on 8 Apr 2013 at 5:01

GoogleCodeExporter commented 9 years ago

Original comment by Jose.R.F...@gmail.com on 8 Apr 2013 at 5:35

GoogleCodeExporter commented 9 years ago
Please disregard comments about the PEBS problem not reproducing under 3.8.5.
I was forgetting a define that was critical to reproducing the problem.

The attached patch against git as of a few hours ago adds a switch to select
different methods of calculating total time.  Although it currently is
effective only for perf format, this general switch seemed wiser than
starting down the road of having several versions of a format that behave
differently.  Some unrelated typos are fixed as well.

Having done that, however, I discovered that the behavior of giving less than
100% to main is not a regression after all.  The same thing happens with the
original implementation of the perf format.  Comparable results are included
in compare.tgz (the total time for val1 main is 93.6% either way).

Original comment by dfla...@nist.gov on 8 Apr 2013 at 6:33

Attachments:

GoogleCodeExporter commented 9 years ago
Thanks for the patch.

linux 3.2.x's perf definitely has a limit on the call stack depth of 255 
entries.

  http://lxr.linux.no/linux+v3.2.42/include/linux/perf_event.h#L518

Latest kernel has a limit of 127,

  http://lxr.linux.no/linux+v3.8.6/include/uapi/linux/perf_event.h#L597

but it also allows one to pass raw regs and stack blobs to userspace, so it 
might behave quite differently.

I believe it would be useful to have a test case with recursion depth of 256. 
IIUC, the VTune parser invariably relies on the stack frequencies. Just like 
perf, I'd be surprised if VTune didn't have similar limitations on stack depth.

Original comment by Jose.R.F...@gmail.com on 9 Apr 2013 at 8:14

GoogleCodeExporter commented 9 years ago
> IIUC, the VTune parser invariably relies on the stack frequencies. Just like 
perf, I'd be surprised if VTune didn't have similar limitations on stack depth.

Nevermind. AXEParser uses total times as reported by VTune, so it's VTune's 
problem.

Original comment by Jose.R.F...@gmail.com on 9 Apr 2013 at 8:26

GoogleCodeExporter commented 9 years ago
Attached is a modified val5.c to stress the recursions limit, and the output 
one gets with total times from stacks.

I rather have results which are approximate but robust, than have results which 
are very accurate but totally bogus if call depth exceeds a magic number.

So I'm going to apply your patch, but retain the current behavior of heuristics 
by default.

Of course, we can improve upon this.  For example, we could have a default 
"--total=automatic" option, which would detect whether it was safe to derive 
total times from call stacks or not.

Original comment by Jose.R.F...@gmail.com on 9 Apr 2013 at 1:20

Attachments:

GoogleCodeExporter commented 9 years ago
Ack.  Based on script output, the effective limit for kernel 3.8.5 is 
apparently 512 on the stack plus one for the currently executing function, 
although issues begin to appear when the nominal depth is only 503 (not 
counting interrupt handlers).

Original comment by dfla...@nist.gov on 9 Apr 2013 at 1:53

GoogleCodeExporter commented 9 years ago
Commited now.

Original comment by Jose.R.F...@gmail.com on 9 Apr 2013 at 2:50