Closed GoogleCodeExporter closed 9 years ago
David,
I've been too busy lately, but I'm finally looking at this now.
Thanks for your patch.
This patch seems targeted at improving gprof parsing, however I was not able to
reproduce your results with the val3 example and gprof.
I did
gcc -g -fno-omit-frame-pointer -pg -o val3 -std=gnu99 val3.c
./val3
gprof val3 > val3.prof
and obtained
https://code.google.com/p/jrfonseca/source/browse/tests/val3.prof?repo=gprof2dot
output.
With or without your change I still get cycfn1 showing 2/3.
Original comment by Jose.R.F...@gmail.com
on 8 Apr 2013 at 1:31
The patch affects only the perf input format of Gprof2Dot. I've rerun the
before and after with both perf and gprof and reproduced the original
complaint in the attached tarball.
Now using val3 as appears in the Linux spin of Version 1.1 of the very small
validation suite at http://www.nist.gov/itl/ssd/cs/software-performance.cfm,
plus
Linux kernel / perf version 3.8.5
gcc version 4.7.2
GNU gprof (Linux/GNU Binutils) 2.22.52.0.2.20120424
For perf:
gcc -std=gnu99 -Wall -Wextra -pedantic -O2 -o val3 val3.c
perf record -e cpu-cycles -c 1000000 -o val3_data -g dwarf ./val3
17440530503365195776 5654866789.972589
[ perf record: Woken up 477 times to write data ]
[ perf record: Captured and wrote 119.058 MB val3_data (~5201703 samples) ]
perf report -i val3_data > val3_perf_report.txt
perf script -i val3_data > val3_script.txt
gprof2dot.py -w -f perf -o val3_perf_after.dot < val3_script.txt
gprof2dot.py-orig -w -f perf -o val3_perf_before.dot < val3_script.txt
For gprof:
gcc -std=gnu99 -Wall -Wextra -pedantic -O1 -pg -o val3 val3.c
./val3
17440530503365195776 5654866789.972589
mv gmon.out val3_data
gprof val3 val3_data > val3_gprof_report.txt
gprof2dot.py -w -f prof -o val3_gprof_after.dot < val3_gprof_report.txt
gprof2dot.py-orig -w -f prof -o val3_gprof_before.dot < val3_gprof_report.txt
What I'm seeing: All four results correctly report the self time as 2/3 for
cycfn2 and 1/3 for cycfn1 and the 100% total time of cycfn2, but only
perf-after reports the expected 2/3 total time for cycfn1, as determined from
actual call stacks rather than heuristics.
Building the gprof one with -O0 -fno-omit-frame-pointer makes no difference
in the result, except to make the run time longer.
Explanation of the expected results from val3.c header comments:
With default values of FN1LOOP and FN2LOOP, the expected distribution
of time is approximately
main total 100 % self 0 %
cycfn2 total 100 % self 67 %
cycfn1 total 67 % self 33 %
resulting from the following call chains and proportions of samples:
2 main->cycfn2
1 main->cycfn2->cycfn1
2 main->cycfn2->cycfn1->cycfn2
1 main->cycfn2->cycfn1->cycfn2->cycfn1
HTH,
DWF
Original comment by dfla...@nist.gov
on 8 Apr 2013 at 2:22
Attachments:
My bad. I mixed up "perf" with "prof"...
Thanks for the update
Original comment by Jose.R.F...@gmail.com
on 8 Apr 2013 at 2:29
David,
Does perf guarantee that the whole stack trace is always captured?
If not, what will happen when eldest ancestor is not captured? Won't this
start giving bogus results (e.g., `main` function with less than 100%).
Original comment by Jose.R.F...@gmail.com
on 8 Apr 2013 at 4:12
This can be seen with tests/xwax.perf : the top node goes from 99.96% to
96.26%, because at several points the call stack is incomplete. Ditto for
glxgears.perf
I'm afraid that simply taking total time as callstack ratio is not very robust.
I can't take the patch as is.
There must be a way of reconciling both approaches: a) that totals are
proportional to to call stack samples, and b) the integration of the graph
along the edges must be obbeyed.
But a combined approach would be quite a complex endeavor.
A simpler compromise would be a sanity check: only take total iff all stack
traces share a single ancestor (for each thread). I could accept your patch
like that, as it would cause no regressions.
Another simpler approach would be a command line option that selects both
behaviors.
Original comment by Jose.R.F...@gmail.com
on 8 Apr 2013 at 4:34
With kernel/perf 3.6.11 I was able to reproduce a behavior with PEBS (-e
cpu-cycles:pp) where some stack traces would go to [unknown] and as a result
main would get less than 100%. I attach an example output with the patch
applied. IMO, the behavior on this input is correct: the tree has multiple
roots that together sum to 100% total time (modulo any insignificant ones
that were below the cutoff).
As of kernel/perf 3.8.5 I can no longer reproduce that behavior even with
cpu-cycles:pp.
Nevertheless, it should be easy to enable switching between the two methods
of calculating total time.
Original comment by dfla...@nist.gov
on 8 Apr 2013 at 4:41
Attachments:
Correction, that was kernel/perf 3.7.1 that demonstrated the problem. For a
minute there I thought it was related to the use of frame pointers in 3.6.x.
Original comment by dfla...@nist.gov
on 8 Apr 2013 at 5:01
Original comment by Jose.R.F...@gmail.com
on 8 Apr 2013 at 5:35
Please disregard comments about the PEBS problem not reproducing under 3.8.5.
I was forgetting a define that was critical to reproducing the problem.
The attached patch against git as of a few hours ago adds a switch to select
different methods of calculating total time. Although it currently is
effective only for perf format, this general switch seemed wiser than
starting down the road of having several versions of a format that behave
differently. Some unrelated typos are fixed as well.
Having done that, however, I discovered that the behavior of giving less than
100% to main is not a regression after all. The same thing happens with the
original implementation of the perf format. Comparable results are included
in compare.tgz (the total time for val1 main is 93.6% either way).
Original comment by dfla...@nist.gov
on 8 Apr 2013 at 6:33
Attachments:
Thanks for the patch.
linux 3.2.x's perf definitely has a limit on the call stack depth of 255
entries.
http://lxr.linux.no/linux+v3.2.42/include/linux/perf_event.h#L518
Latest kernel has a limit of 127,
http://lxr.linux.no/linux+v3.8.6/include/uapi/linux/perf_event.h#L597
but it also allows one to pass raw regs and stack blobs to userspace, so it
might behave quite differently.
I believe it would be useful to have a test case with recursion depth of 256.
IIUC, the VTune parser invariably relies on the stack frequencies. Just like
perf, I'd be surprised if VTune didn't have similar limitations on stack depth.
Original comment by Jose.R.F...@gmail.com
on 9 Apr 2013 at 8:14
> IIUC, the VTune parser invariably relies on the stack frequencies. Just like
perf, I'd be surprised if VTune didn't have similar limitations on stack depth.
Nevermind. AXEParser uses total times as reported by VTune, so it's VTune's
problem.
Original comment by Jose.R.F...@gmail.com
on 9 Apr 2013 at 8:26
Attached is a modified val5.c to stress the recursions limit, and the output
one gets with total times from stacks.
I rather have results which are approximate but robust, than have results which
are very accurate but totally bogus if call depth exceeds a magic number.
So I'm going to apply your patch, but retain the current behavior of heuristics
by default.
Of course, we can improve upon this. For example, we could have a default
"--total=automatic" option, which would detect whether it was safe to derive
total times from call stacks or not.
Original comment by Jose.R.F...@gmail.com
on 9 Apr 2013 at 1:20
Attachments:
Ack. Based on script output, the effective limit for kernel 3.8.5 is
apparently 512 on the stack plus one for the currently executing function,
although issues begin to appear when the nominal depth is only 503 (not
counting interrupt handlers).
Original comment by dfla...@nist.gov
on 9 Apr 2013 at 1:53
Commited now.
Original comment by Jose.R.F...@gmail.com
on 9 Apr 2013 at 2:50
Original issue reported on code.google.com by
dfla...@nist.gov
on 21 Dec 2012 at 2:41Attachments: