Closed GoogleCodeExporter closed 9 years ago
Thank you for the patches and the clear bug report! (3) makes sense to me, and
it
looks like (1) is a problem for you because you're emitting heap_v2-style
profiles,
which is not what pprof tests against (it probably should, and in fact the heap
profiler should really be rewritten to use v2 sampling).
As for (2), the reason we prefer addr2line is that it can properly handle
inlining
(when used with -i), which nm can't do, so even in --functions mode it's worth
running. Is there a problem using addr2line, that you want to remove the call
to it?
Also, can you sign the CLA at
http://code.google.com/legal/individual-cla-v1.0.html?
This will allow us to accept your patches as-is.
Original comment by csilv...@gmail.com
on 12 Apr 2010 at 6:29
(2) (the addr2line change) was primarily intended as a performance improvement.
addr2line takes about 15 minutes for some of the binaries I'm working with, so
this
makes a pretty big difference in usability for me. Is there perhaps another
approach
to shutting off addr2line that you would find more acceptable?
I signed the CLA.
Original comment by jas...@canonware.com
on 12 Apr 2010 at 11:15
} addr2line takes about 15 minutes for some of the binaries I'm
} working with, so this makes a pretty big difference in usability for
} me.
Wow, that's a surprisingly long time. I take it you don't need the
functionality that correctly attributes memory use to inlined
functions?
There are several options I can think of off the top of my head. One
is to add a flag saying not to use addr2line, in the same way this
line does it already for OS X:
$obj_tool_map{"addr2line"} = "false"; # no addr2line
Or perhaps we could add a flag that lets folks specify values for all
of obj_tool_map on the commandline.
You can also get the same result right now, without any code changes
at all, though it requires a bit of setup on your end. You have to
set up a directory with symlinks to objdump, nm, and (perhaps) c++filt
in it, but *not* addr2line. You then pass in this directory to the
--tools flag of pprof (or optional set the env var PPROF_TOOLS).
Would this work for you? That's my favorite solution, I think, if
it's practicable.
Original comment by csilv...@gmail.com
on 13 Apr 2010 at 12:10
Any more word on this? How do these proposed solutions work for you?
Original comment by csilv...@gmail.com
on 7 Jun 2010 at 11:24
The --tools and PPROF_TOOLS options aren't ideal, because they require setting
things up in advance. Adding a way to specify $obj_tool_map on the command
line would be slightly better, but all-or-nothing tools configuration isn't
very convenient for my use case. I'd prefer to add support for something like
--tool=addr2line:false so that individual tools can be specified. This seems
sufficiently general to meet other use cases as well. What do you think?
Original comment by jas...@canonware.com
on 8 Jun 2010 at 12:11
Thanks for the feedback. I think there's a feeling, generally, that the pprof
commandline is already too complicated, so adding a new flag like that may not
go over so well. But I'll suggest it to other pprof developers and see what
they think.
Original comment by csilv...@gmail.com
on 8 Jun 2010 at 6:05
Original comment by csilv...@gmail.com
on 8 Jun 2010 at 6:49
This should be fixed in perftools 1.6, just released (by augmenting the syntax
of the --tools flag).
Original comment by csilv...@gmail.com
on 5 Aug 2010 at 8:50
Thanks for augmenting the --tools syntax!
Unfortunately, it looks like the divide-by-0 error and the performance
improvement for ExtractSymbols in the patch attached to this issue were not
integrated.
Original comment by jas...@canonware.com
on 6 Aug 2010 at 12:26
You are right, I totally forgot about those. :-( I am leaving on vacation
shortly, but will add those in when I return, and push them to svn. They'll
also be part of the next release.
Original comment by csilv...@gmail.com
on 7 Aug 2010 at 2:41
OK, I've applied the divide-by-0 fix -- though I don't understand why it's
necessary to protect against n1 being 0 but not n2.
But I'm confused by the performance improvement. As I understand it, you're
just sorting the libraries and the symbols, and going through them that way.
But I don't understand the point of @pcs2. Once you see $pc > $finish, you
know it's not going to be in any library (since your current library is one
with the greatest finish). So why keep these values in @pcs2, and add them
back onto @pcs? When would this a pc in @pcs2 ever actually be used?
Also, why sort the libraries in reverse order?
If I were implementing this, I'd do something like this:
@pcs = sort ...;
@libs = sort ...;
for $lib (@libs) {
$start = ...;
$finish = ...;
$start_pc_index = <first index into @pcs >= $start>;
$finish_pc_index = <first index into @pcs < $finish>;
@contained = splice(@pcs, $start_pc_index, $finish_pc_index - $start_pc_index);
Map...();
}
Does this seem like it would work, to you?
Original comment by csilv...@gmail.com
on 31 Aug 2010 at 1:06
n1 is the current object count, which can be 0, but n2 is a cumulative count,
which is strictly greater than 0.
The old version of the pc-->lib mapping code was effectively an O(L*P)
algorithm (for each lib, iterate over pc's and find all that are contained by
lib), whereas the new version is O(L+P) (plus the cost of sorting, O(L*lg L) +
O(P*lg P)). The new version uses sorted pc's so that only a subset of pc's has
to be examined for each lib. My notes indicate that for the binaries I was
testing with, ExtractSymbols accounted for ~24% of execution time before this
change, and something less than 2.5% afterwards (it dropped out of the top 15
functions in the CPU profile).
My memory is fuzzy now on the reason for reverse sorting (and for @pcs2), but
it had something to do with the address ranges for libs overlapping with that
reported for the main binary. I tested the changes on 32-bit Ubuntu and 64-bit
CentOS, and for one of those platforms pc's were preferentially associated with
the binary rather than the libraries they came from (and symbol lookups would
then fail) unless libraries were processed from high to low addresses. Without
@pcs2, it was possible for pc's to be dropped without mapping them to symbols,
even though symbols were available from the main binary.
Original comment by jas...@canonware.com
on 1 Sep 2010 at 5:56
OK, that makes some sense (overlapping libraries). Does the splice
implementation seem reasonable to you?
I don't think your algorithm is really O(L+P), because you could in theory do
|P| work for each library addings pcs back onto pcs2. But I think in practice
pcs2 will usually be empty, so I can believe it would be fast in practice.
I've written another version of the patch which I think does the same thing but
I found a bit easier to follow. Can you take a look and see if you agree?
Also, it would be great if you could test it on the same test suite you used
for your own patch, and verify it doesn't cause any problems (I tested it a
bit, but it looks like your test suite is more thorough).
Original comment by csilv...@gmail.com
on 1 Sep 2010 at 11:49
Attachments:
Another thought: what if you sorted the libraries by $finish (a[2]) rather than
$start (a[1])? Would that solve the problem you were seeing with overlapping
binaries? Then we really could just do a 'merge' sort, with one last cleanup
step at the end for the final binary.
Original comment by csilv...@gmail.com
on 1 Sep 2010 at 11:52
The splice implementation seems reasonable, but it potentially passes the same
pc into multiple MapToSymbols calls. That may actually be the right thing to
do (and if so, the sorting order can be changed as you suggest), but I don't
know enough about the results of the nm and addr2line processing to know what
the result is. At a glance, it looks to me like it is possible for $symbols to
contain multiple resolutions for a single pc, but a comment ("Prepend to
accumulated symbols for pcstr") implies that getting the order wrong will
result in incorrect results.
If it is *not* okay to pass the same pc to MapToSymobls multiple times, then I
don't think it's safe to change the way sorting is done. I vaguely remember
trying the sorting strategy you suggested, and some pc's were not resolved to
symbols.
Original comment by jas...@canonware.com
on 2 Sep 2010 at 12:08
I'm missing something I think. How can splice pass the same pc into multiple
MapToSymbol calls? splice() removes the items from the input list that it
returns; were you taking that into account?
Original comment by csilv...@gmail.com
on 2 Sep 2010 at 12:29
Nope, I had incorrect splice semantics in mind.
In your patch, you determine $start_pc_index by iterating forward from the
beginning of @pcs. This will tend to be less efficient than determining
$finish_pc_index first, then iterating backward from there to determine
$start_pc_index.
Original comment by jas...@canonware.com
on 2 Sep 2010 at 12:58
True. OK, rewritten and attached. Can you test this on your various OS
environments and make sure it's behaving as expected?
Original comment by csilv...@gmail.com
on 2 Sep 2010 at 8:54
Attachments:
I had to make a fix to your patch, but with that change in place, I got correct
results for my tests, with similar performance to the version of ExtractSymbols
that I submitted.
Original comment by jas...@canonware.com
on 8 Sep 2010 at 10:28
Attachments:
Ok, I think this is all in perftools 1.7, just released.
Original comment by csilv...@gmail.com
on 5 Feb 2011 at 12:21
Original issue reported on code.google.com by
jas...@canonware.com
on 12 Apr 2010 at 5:54Attachments: