Closed GoogleCodeExporter closed 9 years ago
Original comment by csilv...@gmail.com
on 16 Oct 2008 at 11:31
This probably fixes at least the symptom of Issue 50:
http://code.google.com/p/google-perftools/issues/detail?id=50
but it's hard to know whether it is for the right reason.
Original comment by dvi...@gmail.com
on 16 Oct 2008 at 11:32
This is a big patch, so I'm going to try to deal with it one file at a time. I
decided to start with pprof. :-) Most of it is straightforward enough. I have
a few
questions though:
1) You added this check for cygwin:
+ # On Windows, the map portion of the profile includes a region
+ # for the executable, so it's already taken care of.
I'd like to do this check more programatically. My plan is to see what regions
are
read from the profile, and figure out if any is a whole-program region. Can
you send
some output from the cygwin ldd or equivalent, so I can see what this
already-included region looks like?
Alternately, here is the code I want to put in, in the main ParseLibraries()
loop:
push(@{$result}, [$lib, $start, $finish, $offset]);
if ($start le $zero_offset && $finish ge $max_pc &&
$offset eq $zero_offset) {
# The entry for the main program will encompass all possible pc's
$seen_entry_for_main_program = 1;
}
Will this work?
2) I'm confused about the necessity for the new addr2line_pdb and nm_pdb
entries in
objtools. It seems to me that either the system will come with addr2line and
nm, in
which case we should use the system version, or they won't, in which case we
should
use the version you supply in your patch (that is, require the user to install
those
binaries in the same place they install pprof). In the first case,
$obj_tool_map{"addr2line"} (and "nm") will already be set up correctly. In the
second case, shouldn't we just change $obj_tool_map{"addr2line"} to point to
$0/addr2line? Why the need for a separate $obj_tool_map{"addr2line_pdb"} entry?
If the need is just because the commandline interface is different: it seems to
me we
control the commandline interface, since we're providing the source for this
binary.
Can't we just make it take a compatible set of commandline flags?
Original comment by csilv...@gmail.com
on 17 Oct 2008 at 1:34
ack, sorry about repling to the group instead of the issue :).
I thought of a potential issue this morning with naming nm_pdb nm.exe (and same
with
addr2line): users might get it confused with the cygwin nm, especially if it
gets in
their PATH. I've attached a patch to pprof that expects the executables to be
called
nm_pdb.exe and addr2line_pdb.exe.
Original comment by dvi...@gmail.com
on 17 Oct 2008 at 2:16
Attachments:
Ok, based on your comments in the newsgroup (sorry, I accidentally redirected
discussion there instead of the bug report), I will take out the cygwin check:
it
looks like the code works fine if that region gets inserted twice.
I will also change the binary-to-look-for to be nm_pdb and addr2line_pdb. I may
change the logic around there a little, but not in any meaningful way. One
question
I have is this:
} if ($file_type =~ /MS Windows/) {
[file_type is what is returned by the 'file' command.] I'm guessing "MS
Windows" ==
PDB format? Is there another string we should be looking for to make sure it's
PDB
in particular, or is "MS Windows" enough?
Original comment by csilv...@gmail.com
on 17 Oct 2008 at 9:56
Sorry, one more question:
You added the second condition on this "if":
} if (exists $symbols->{$address[$i]} &&
} exists $symbols->{$address[$i-1]}) {
I'm worried about the change in behavior (we leave out the arc rather than, as
before, assuming addresses[$i-1] is 0). Was there a particular problem you were
seeing, that motivated this change? If so, do you happen to have an example
pprof
file that executes the new logic you added, so we can understand better what
effects
this change will have?
Original comment by csilv...@gmail.com
on 17 Oct 2008 at 10:12
On the first question:
I think `file' returns that for all executables that work on windows. The file
output does not in any way reflect the format of the debugging information.
On the second question:
perl was producing a large number of uninitialized variable warnings. I
assumed this
was a bug independent of my changes (I think I've seen the warnings on linux
too). I
assumed using the uninitialized value was unintentional, but if it is
intentional,
then perhaps we should initialize the slot to zero if it does not exist (there
were
several secondary warnings down the line that were also fixed by this change).
In
addition to uninitialized variable warnings, I think this fixed an issue where
--text
--lines would report a line with a completely blank source location (not even
??:0).
I admit I did not take the time to understand the root of this problem, and I
should
have. Thoughts?
Original comment by dvi...@gmail.com
on 18 Oct 2008 at 2:01
1. Is there any way to tell, on windows, what the file format is? Or maybe it
doesn't matter that much? I'd just feel more comfortable if we ran a
PDB-specific
nm/addr2line on stuff we knew was actually PDB (if that even makes sense; I
don't
know about PDB at all).
2. It would be nice to get to the root cause here. We don't see uninitialized
var
warnings using pprof at google, but maybe it has to do with the nature of the
executable? Any chance you can find an executable + profile that causes this
warning? Maybe we can figure out what's going on.
I don't know what the "right" thing to do here is, either. I don't really
understand
much about what might cause anything here to be uninitialized, in practice, so I
don't know the best way to recover when it happens.
Original comment by csilv...@gmail.com
on 18 Oct 2008 at 2:23
Even if we knew whether the executable used PDB info, we would still need to
check
the dlls. As it stands, cygwin's nm fails for PDB files and we fallback on the
other
one.
I'll see about tracking down the uninitialized variable. I think what's
happening
here is the $address[$i-1] entry of $symbols doesn't exist. So this concats
uninit
strings:
my $destination = $symbols->{$address[$i-1]}->[1] . ":" .
$symbols->{$address[$i-1]}->[0];
Original comment by dvi...@gmail.com
on 18 Oct 2008 at 1:55
I agree that $address[$-1] doesn't exist. But I'm not sure why, or what the
right
thing to do when it doesn't exist.
Original comment by csilv...@gmail.com
on 20 Oct 2008 at 5:46
OK, while we're figuring out the $address[$i-1] thing, I've been starting on the
other files. I'll ask the high-level questions I noticed right away first;
there may
be more detailed questions later. Sorry these jump around so much:
1) heap-profiler:
Why did you need to change FDWrite to use handles and windows functions instead
of
file descriptors and libc functions? I'd like to avoid all these extra ifdefs
if
possible. Also, having the windows kernel allocate memory in CreateFile is
indeed
potentially dangerous -- that's why we had to write FDWrite to begin with
rather than
using fwrite -- so I'd be happy to avoid the whole thing if I can, and just
leave
FDWrite as it is now. What was wrong with the way it was before?
2) tcmalloc.cc:
Why the changes to pad_? Does pad_[0] cause MSVC to barf? It's been working
so far,
so I'm confused by this. What problem are you trying to sovle here?
} /* dvitek: You can't do this unless you can assure me that this is
} * the very last destructior called in the process.
} */
} //UnpatchWindowsFunctions();
In fact, we can assure this, because destructors run in the opposite order of
construction, and we go through a lot of effort to make sure this constructor is
called first in the program. If it's not, lots of trouble happens (which maybe
you've been seeing).
I'm not comfortable taking this out, and with the other, related changes you
make to
leak memory in tcmalloc, and to silently drop attempts to free memory that was
allocated by libc but freed by tcmalloc. The last particularly worries me. It
could
mask serious bugs and be almost impossible for a user to debug.
So I'd rather address this issue by figuring out how to make sure this *is* the
last
destructor called in a process. In unix-land, we do that by requiring that
tcmalloc
be the last library linked into an application. (We already add relevant
dependencies -- including some code you've added -- to ensure that within
libtcmalloc, this is the first constructor run.) Is there something equivalent
we
can do in windows-land? I'm happy to document that requirement.
Anyway, it would be great if you could try to do that, and uncomment this
change and
the other related ones (the !span check near tcmalloc.cc:2775, the SpinLock
destructor in port.h, maybe others?) to see if you can get the code working.
} /* This is dangerous on windows if printf isn't working for one
} * reason or another.
Can you elaborate on this? Is it that printf needs to call malloc? What
problems
were you seeing here? I have no problem turning off large-malloc reporting,
btw. I
may want to do it for everyone in any case. It's kinda distracting.
Thanks for adding _msize and _expand, btw! I didn't know about these
functions. I'm
thinking of moving their definitions to patch_functions.cc, though, since they
are
windows-specific. Will that work ok, do you think, or will there be other
problems?
3) addr2line.c:
Instead of downloading to c:\websymbols, should we download to
TMPDIR\websymbols?
Doesn't windows have some tmpdir function?
Also, whenever we create a new file, I worry about security concerns. I don't
know
how this works in windows-land, but what I would normally do is have something
like
char* dir = mkdtemp("TMPDIR\websymbols_XXXXXX") || fail;
and then download to dir\websymbols
I think mkstemp() would work as well. tmpfile() would be best of all, though
not
usable in this context it looks like. I don't know if there are windows
equivalents
for any of these.
This avoids attack vectors where an attacker makes c:\\websymbols a symlink to
some
location (like command.com), and then when you run this binary, it overwrites
the
pointed-to location with your permissions.
4) port.cc:
Some of the stacktrace code reminds me of code in stacktrace_x86-inl.h. Does
it make
sense to try to merge them somehow? Maybe it's too much work.
In general, I'm not a windows expert enough to evaluate this code. I'll try to
find
some folks around here who are.
5) port.h:
} #if defined(_M_IX86)
} /* Without this, the profile dumper will try to use the 64-bit
} * directive, which causes junk to be written to the profile.
} */
} #ifndef PRIxPTR
} #define PRIxPTR "lx"
} #endif
} #endif
Can you explain this a bit more? What do you mean by "try to use the 64-bit
directive?" (And where do they get that directive from, whatever it is?) And
why do
you do this just for _M_IX86? What do we do in other situations (what other
situations might there be?)
6) patch_functions.cc:
It seems like you've totally rewritten this. I'm sure it's much better!, but
again,
I don't know enough to evaluate it. I see that you spend a lot of effort to
deal
with having multiple libc's in a single executable. Does that really happen?
Gosh.
One thing that occurs to me is that instead of having lots of arrays that look
like
foo[MAX_LIBCS]
it would make more sense to have a struct that included all the pointers, and
have an
array of that struct. That would get rid of the need for the macros you have, I
think for 0-8, and also let you extract out some common code into functions
easier.
Are there any problems you forsee with that rewrite?
As a more general note, the code you write is in its own, consistent style, but
a
style that's quite different from the rest of perftools. My plan, as I
incorporate
this functionality in, is to rewrite most of it to existing perftools style.
Does
that sound ok to you?
Original comment by csilv...@gmail.com
on 22 Oct 2008 at 7:14
(1)
Microsoft has slightly different symbols for everything, even using libc:
http://msdn.microsoft.com/en-us/library/z0kc8e3z(VS.71).aspx
As you can see, you basically need underscores before every identifier. When I
saw
the old code I thought you were probably using file descriptors (instead of
something
like iostream) to avoid acquiring locks and allocating memory. If you look at
the MS
CRT implementation (source comes with the non-free versions of vc++) it
acquires
locks inside _open at the very least. It seemed safer to use something closer
to a
direct syscall given that it needed to be changed anyway.
(2)
MS VC does barf on [0]. This may have been happening when I had added a field
to
spinlock (which I later removed). It seems more maintainable in any case if it
can
withstand these sorts of changes without compilation errors.
I tried to find a way to get the tcmalloc constructors to go first, but I gave
up.
The closest thing I could find to a solution was:
http://blogs.msdn.com/vcblog/archive/2006/10/20/crt-initialization.aspx
However, looking at the MS libc source (which comes with VC++ pro), I can see
that
environ (and probably other things too) are malloced before any of those
initialization functions are called. See Vc7/crt/src/crt0.c for the important
bits.
putenv crashing was the first issue I ran into before I altered do_free and
realloc
to be forgiving. There are other problems -- I suspect DLLMain events can
happen
before static constructors and after destructors.
I have an alternate proposal to fix the unsavoriness of ignoring strange
pointers.
We can hook delete and delete[] in the same fashion that we hook free, which
would
allow us to fall back on the CRT implementation of free/realloc/delete/delete[]
inside do_free when we see that we don't know anything about the pointer in
question.
This is only slightly unsavory because I would be hardcoding the mangled
operator
delete identifier for VC into the source. do_free would take an extra
parameter
(similar to realloc) in order to achieve this.
Moving _msize and _expand to port.cc is fine so long as all the classes and
globals
they need are in scope there (and not privately declared in tcmalloc.cc).
There are processes that set stdout to NULL, which will cause printf attempts
to null
pointer dereference. It can also lead to the windows equivalent of SIGPIPE
under
some circumstances, if your process has been disconnected for some reason.
(3)
You want to make sure you use the same directory every time you run --
downloading
symbols is not always a quick process. NTFS only supports directory to
directory
links (it calls them junctions). They are probably closer to hardlinks than
symlinks. The dbghelp library will fail to download any pdb file if a file
with the
same name already exists (it just tries to read the existing file as if it is a
pdb
file). If there is something preventing the dbghelp library from creating one
of the
directories it wants to create, then it will silently create a 'sym' directory
as a
sibling to symsrv.dll and put things there (surprising, I know).
(4)
Some of it is similar, but the heuristics (and knowledge of stack top/bottom)
do make
it different enough that I'm not sure it's worth it.
(5)
basictypes.h has this, which is not correct on 32-bit platforms:
#ifndef PRIxPTR
#define PRIxPTR PRIx64
#endif
My addition should probably check all arch symbols that indicate 32-bitness,
_M_IX86
is the one I know about. The other two windows platforms are 64-bit (x86_64
and
ia64).
(6)
Multiple libcs don't happen with a single executable, but they do happen with a
single process. Executables and DLLs are bound to a specific libc at
build-time.
Many DLLs shipped by MS are bound to msvcrt.dll -- for instance, if you link
with
ADVAPI32.DLL (exports functions for working with the windows registry such as
RegCreateKey), it links with WINTRUST.DLL, which links with msvcrt.dll.
msvcrt.dll
is the VC6 CRT.
Using a struct of the libc functions sounds like a good idea to me. I'm not
entirely
certain that you can completely eliminate the need for 8 invocations of some
macro
without using more machine-code-generating code, but maybe I'm not seeing it.
Change
the style at will.
Let me know if I neglected to respond to any of your questions.
Original comment by dvi...@gmail.com
on 22 Oct 2008 at 4:42
Thanks for your responses. I'm afraid some of these may take a bit of back and
forth, but I think we're getting closer to resolving many of them.
} It seemed safer to use something closer to a
} direct syscall given that it needed to be changed anyway.
Why did it need to be changed anyway? Were you seeing any problems that
prompted
this change, or was it just prophylactic?
It's fine for the OS to acquire locks in this code -- what we're worried about
is
allocating memory (since that might cause a recursive call into tcmalloc). As
long
as _open/_read/_etc don't allocate memory, we're fine (and possibly even if
they do,
depending on how they do it).
You shouldn't need to worry about changing the open to _open and the read to
_read;
that should all be done for you in port.h. So ideally this code shouldn't
require
any changes at all. If it does, hopefully you can make those changes in port.h
rather than in this file. Can you give that a try?
} MS VC does barf on [0]
This surprises me greatly, because even before your patch, we were running
tcmalloc
on windows, and this code compiled fine. Why would it suddenly start to break?
Can
you try changing this code back and seeing if it works now? If not, can you
report
the error?
} I tried to find a way to get the tcmalloc constructors to go first, but I
gave up.
Hmm, windows is *so* annoying here. :-(
libc doesn't use new does it, so we only need to patch free, not delete? As
long as
we can guarantee tcmalloc runs before any other user libraries, we should be
fine. I
think with the crt.$init stuff we can at least guarantee that. (Though maybe
not, if
other libraries are also using the same trick?)
We're already using the crt-init in port.cc to set up locking, I think, so it
might
not be so bad to hook in there.
Anyway, I like your alternate idea of hacking free/delete to pass back to msvc's
free/delete if they don't know the pointer. We still have to modify tcmalloc to
not-die if we see free() on a pointer we don't recognize. I'm worried that
will slow
down free in the common case that we *do* recognize the pointer. We're doing an
extra comparison now in every free call, right? I'm not sure if that will be
tough
to get buy-in for, or not. I agree we'll want a separate do_free function that
returns a bool if the free succeeded or not, for windows.
} There are processes that set stdout to NULL, which will cause printf attempts
to
} null pointer dereference.
Is that the only problem? If so, can we change the printf statements to be
if (stdout) printf(...)
?
} You want to make sure you use the same directory every time you run --
downloading
} symbols is not always a quick process.
OK, maybe we don't need to (or can't) do fancy permissions stuff here. But I
still
would feel better downloading this to TEMPDIR than to the root directory. Does
that
not make sense?
} basictypes.h has this, which is not correct on 32-bit platforms:
Right. These basictypes definitions should hopefully never be used, because the
system should define all these things for us. But windows doesn't define any of
these for us. Hmm, I'm surprised PRIxPTR is all you need. Don't you have
trouble
with say, PRId64 as well, which I see is used in heap-profiler.cc? Or is that
already defined for us somewhere?
I think the right fix is to make basictypes.h do something better. Is there a
__WORDSIZE macro or something equivalent in MSVC header files, that you know
of?
It's safer to use that than to try to switch on architecture versions.
} Executables and DLLs are bound to a specific libc at build-time.
} Many DLLs shipped by MS are bound to msvcrt.dll
In practice, does anyone link to a different libc than msvcrt.dll? That is,
how much
do we need to worry about this in real life?
} Let me know if I neglected to respond to any of your questions.
That's certainly enough to start on, thanks!
Original comment by csilv...@gmail.com
on 23 Oct 2008 at 12:13
For one reason or another the $address[$i-1] thing isn't manifesting on the
small
tests I've tried today. I'm having it run some more expensive tests, but I
won't
have a definitive answer until next week on that, assuming nothing happens.
Maybe
the warnings were ultimately caused by the carriage returns, which are now
fixed.
) _open and CreateFile
So I had not gone into the _open source to confirm that MS is doing something
dirty
before, but sure enough, they are. Two problems with using _open:
1. _open calls _alloc_osfhnd which calls malloc (while holding a global lock!!).
2. _open calls CreateFile which calls HeapAlloc.
Here is a potential deadlock:
Thread 1 enters _open and acquires an _open lock
Thread 2 enters RecordAlloc and aquires heap_lock
Thread 2 enters _open on behalf of tcmalloc and waits on the _open lock
Thread 1 enters RecordAlloc on behalf of _open and deadlocks
Using CreateFile takes the _open locks out of the picture, but we will still
see the
recursive call to RecordAlloc (there is no documented API that will avoid
this). If
this is a problem, then I suggest introducing:
static int recursive_depth = 0;
into the RecordAlloc function, in addition to continued use of CreateFile.
) MS VC does barf on [0]
The result of the modulus changed because I had added a field to spinlock (I
believe
I said this in my previous response?). The modulus result probably did not
work out
to zero before, nor does it right now. Isn't it a good thing that the code
will
compile even if someone modifies one of the structures in the future?
) As long as we can guarantee tcmalloc runs before any other user libraries, we
should be fine.
More problems with unpatching:
1. At least one libc may have been unloaded via FreeLibrary, in which case the
attempt to unpatch will segv
2. DLLMain exit messages come in after the destructors are done.
3. Other libraries may use these magic CRT sections.
Now, many users can probably live with these issues. But what is the practical
benefit of unpatching things? It seems like a risk without any upside.
) I like your alternate idea of hacking free/delete....
The extra comparison would be the cost, but only on windows. Beats crashing
though.
Even with the magic crt sections, I think we need this to work in the presence
of
DLLMain messages and calls to putenv.
) if (stdout) printf(...)
I can think of some other issues:
1. SIGPIPE could happen
2. the buffer could get full and block forever
3. the output could get interleaved with other stderr output
Maybe we could make the fprintf calls optional (I know we would want them off
by
default here)? It isn't really a platform specific problem.
) OK, maybe we don't need to (or can't) do fancy permissions stuff here.
I can think of a few options:
1. Drink the MS kool-aid and use C:\Documents and Settings\dvitek\Local
Settings\Temp\pprof_symbols. This may be a problem because of path length
limitations on NTFS and spaces in the path though. It is entirely trivial to
query
the OS for that location (need to read the registry).
2. Check if $TEMP or $TMP is set. I don't think a clean windows install sets
any of
these. We would need to fall back on something else.
3. Use c:\temp or c:\windows\temp
4. Stick with c:\websymbols, which is that Microsoft's documentation suggests
using.
The advantage of this is that many developers already have a symbol cache here,
and
we can share it. A user can already override c:\websymbols by setting up
_NT_SYMBOL_PATH (or whatever the variable is) properly.
) PRIxPTR
PRId64 probably is broken too. Although for some reason it isn't actually
causing
noticable problems. Probably just getting lucky with the positioning on the
stack of
the sprintfs.
I don't know of any WORDSIZE macro. We always use configure scripts
internally. It
is possible that one may exist, but I'm not sure how to find it in the
documentation.
) does anyone link to a different libc than msvcrt.dll
I can think of several:
1. msvcrtd.dll - VC6 debug version of libc
2. msvcr70.dll - VC7
2. msvcr71.dll - VC7.1
2. msvcr80.dll - VC8
2. msvcr80.dll - VC8 with service pack 1 (yes, they have the same name)
I almost always use msvcr80.dll. VC6/msvcrt.dll is pretty ancient.
Original comment by dvi...@gmail.com
on 23 Oct 2008 at 2:37
Ahem, where I said:
It is entirely trivial to query
the OS for that location (need to read the registry).
I meant to say that it is not trivial :).
Original comment by dvi...@gmail.com
on 23 Oct 2008 at 2:41
} For one reason or another the $address[$i-1] thing isn't manifesting on the
small
} tests I've tried today.
No rush. It'll be a while before we're all done with this process anyway. :-}
Just
let us know when you have something to report.
} Using CreateFile takes the _open locks out of the picture
OK, that makes sense. I'll comment that when applying this part of the patch.
} we will still see the recursive call to RecordAlloc
That's fine -- we already have the 'dumping' variable to prevent recursive
calls to
DumpProfileLocked.
} The result of the modulus changed because I had added a field to spinlock
Ah right, you did say that before. I should be able to fix this a different way
(using templates to get rid of pad_ entirely in the case size is a multiple of
64).
} But what is the practical
} benefit of unpatching things? It seems like a risk without any upside.
Interesting question. If we have code that delegates free() to crt-free() when
appropriate, maybe there's no upside. Right now, we need to unpatch so
microsoft
library destructors can call the right free() when they destroy.
} The extra comparison would be the cost, but only on windows.
Ok, fair enough. Let's go that route.
} Maybe we could make the fprintf calls optional
I'm looking at the documentation for this, and it looks like we should only be
reporting allocs larger than 1G. That is, it probably shouldn't trigger much,
if at
all, in real life. We could make the default even larger, but I'm tempted to
just
not worry about this. I don't think it should be a problem in practice. Have
you
been seeing it?
The same printf problem presumably exists when the heap-profiler prints to
stdout
(stderr?) every time it writes another heap file. Would we need to solve the
same
problem there?
} Stick with c:\websymbols, which is that Microsoft's documentation suggests
using.
Ah, I didn't realize this was the official recommended microsoft location. That
makes sense to use, then.
btw, for getting a temporary directory, I was thinking you'd call GetTempPathA.
Does
that not work?
} PRId64 probably is broken too. Although for some reason it isn't actually
causing
} noticable problems.
Actually, it looks like the definition we give for PRId64 is right for windows.
I'll
just change PRIxPTR to be right for windows as well. Nobody else should be
using
this part of basictypes.h anyway, since they get these values via inttypes.h.
Original comment by csilv...@gmail.com
on 23 Oct 2008 at 7:09
I believe open issues currently:
1) Still waiting to figure out if $address[$i-1] is an issue
2) Rewriting free (and delete?) to delegate to the CRT if need be
3) Deciding if we really want to quiet the big-alloc calls, given they only
trigger
when an alloc is >1G
Everything else that has been brought up so far, I'm currently working on and
plan to
fix. (1) is just waiting, and (3) I've decided I'll probably just up the
default
from 1G to like 8G. Then it should never(!) trigger by default, but you can
set it
lower with an environment flag. For (2), I think I'm waiting for a patch from
you
that tries this? No rush; just making sure we're on the same page.
I'm now working on addr2line-pdb and nm-pdb. A few questions:
In addr2line-pdb, you use SymFromAddr() to get the first line of output you
print.
The msdn documentation says this gives the "name" of the symbol. Is that the
function name this address is found in? (I hope it is!)
In nm-pdb, you have a comment "Everything leaked". What does that mean,
exactly?
Overall, these were pretty easy to follow. I'm making progress...
craig
Original comment by csilv...@gmail.com
on 25 Oct 2008 at 6:05
Ah, one more thing about nm-pdb. I tried compiling on VC 7.1 (which I'd like
to be
able to support if possible), and it didn't compile because many fields of
IMAGEHLP_MODULE64 are missing in VC 7.1 (such as SourceIndexed).
I looked at what output you actually use in the pprof patch, and it's only
image name
and loaded image name, which we *do* have in VC 7.1. So I've commented the
rest of
them out. Is that ok?
If there's some way to distinguish VC 8 and up from VC 7.1 and below, I'm happy
to
put that in so we only omit those fields on older compilers.
Original comment by csilv...@gmail.com
on 25 Oct 2008 at 6:40
We're on the same page on 1-3.
On SymFromAddr: Yes, I'm pretty sure it gets the symbol containing the code
address
you hand it.
"Everything leaked" refers to the fact that the utility is not calling free on
some
things it malloced. So if one were turning this code into a library or
something,
one would want to deal with that.
About VC7.1: Feel free to ditch any parts of that function that report basic
information about the module. It was for debugging. I am showing the user the
PDB
we use in case the dbghelp library chooses to use the wrong PDB (or finds none
at all).
Original comment by dvi...@gmail.com
on 25 Oct 2008 at 1:45
The $address[$i-1] thing appears to be unnecessary -- I'm guessing the carriage
returns were causing it since that's the only thing that changed since its
introduction.
Original comment by dvi...@gmail.com
on 28 Oct 2008 at 3:28
Great! I'll check that one off the list then.
I'm making good progress getting the patch reviewed by local windows experts.
Of
course, I've left the most challenging bits until last... One question about
nm-pdb.c: you have some of the printf's #if 0'ed out. What is the reason for
that?
Should I leave it #if 0'ed out (with a comment as to why), re-enable it, or
take it
out entirely?
Original comment by csilv...@gmail.com
on 29 Oct 2008 at 11:25
Just remove them
Original comment by dvi...@gmail.com
on 29 Oct 2008 at 11:27
OK, I'm puzzling over AdhocStackTrace now. I know it's commented out, but I'd
still
like to make sure the code is good.
Can you add comments to GetStackTrace (the one used by adhoc-stack-trace)? In
particular, I don't really understand what t and b are (are they like top and
bottom?
but different?), and what the for loop in the !sp case is all about.
You have this code:
} if (!sp) {
} uintptr_t diff = (uintptr_t)t - (uintptr_t)sp;
But isn't the whole point sp is 0? So diff will always be == t? What am I
missing?
You have a label looks_good, and code that reads:
} if (sp) {
} goto looks_good;
But puzzling through the logic, it looks like it would be equivalent to have a
'break' here. Is that true? If not, what am I missing?
It also looks like, in the PROBABLY_CODE case, you're adding to result twice:
once in
the PROBABLY_CODE case itself, and then once again after you break out of the
for
loop. Is that intentional? Maybe another comment here would be helpful.
Also, in RtlCaptureBackTrace, why are you calling DebugBreak if we can't find
the
right function to call? Shouldn't we just fall back to one of the other
options, or
else return NULL? I don't think breaking is a good idea in the case of
tcmalloc.
Maybe this code is cut and paste from some other context?
craig
Original comment by csilv...@gmail.com
on 30 Oct 2008 at 1:38
I've attached some comments and a fix to the uintptr_t diff thing, which was
introduced during some refactoring I guess. The diff check can potentially
speed
things up a bit, but certainly isn't crucial.
break; at the goto would break out of the for, hitting the other "break;" right
after
the for, which would leave the outer loop. (We need to break out of the
if(!sp) block)
The PROBABLY_CODE case only breaks out if we've accumulated enough frames, in
which
case we break out of the inner loop, and there is the second break out of the
outer
loop immediately succeeding the inner loop, so it never hits the other addition
to
result. See the comments I added.
DebugBreak is just a poor man's abort here. Feel free to use RAW_CHECK or
something
else that is more appropriate, but safe to use in this context (i.e., won't
recurse
into malloc).
Original comment by dvi...@gmail.com
on 30 Oct 2008 at 2:14
Attachments:
OK, I understand the control flow a little better now. I may try to refactor
it to
be less confusing.
I still don't know what t and b are. Is it possible to use more descriptive
names?
} DebugBreak is just a poor man's abort here. Feel free to use RAW_CHECK or
something
} else that is more appropriate, but safe to use in this context (i.e., won't
recurse
} into malloc).
I don't think we should abort here. Typically when we can't get the stacktrace
for
some reason, we just return 0. I'll make that change unless there's a good
reason
not to.
Original comment by csilv...@gmail.com
on 30 Oct 2008 at 2:32
Hmm, I take it back: I'm still confused by the control flow. If I'm reading
this
right, then in the PROBABLY_CODE case, we'll add ptr to the result array, and
then
continue on with the for loop. That means if we have four pointers in a row
that
look like the point to code, all four will be added to result. Is that right?
The
comment you added says:
} // Start scanning memory for a code pointer until we either find
} // one or hit the top of the stack.
which leads me to believe we should be breaking out of the for loop after the
first
PROBABLY_CODE pointer. But the code seems to act otherwise.
Also, consider the case where traversing the stack looks like this:
<ptr into code><ptr into code><something that looks like an sp><pointer into code>
The way the code works now, the first two will get put into result, then we'll
see sp
and follow that like a normal stack pointer. Is that right?
I guess the rationale is that in the above case, the first <ptr into code>
comes at
the end of the top stack frame, the next comes at the next stack frame, and
then the
sp comes at the next stack frame. The first two functions are using FPO and
don't
have an sp, but the third function is not and does have an sp. Is that
accurate?
Finally, I have concluded that t and b stand for top_of_stack and
bottom_of_stack
respectively. :-) But I only see you setting this on the i386 case. In the
x86_64
case, you only seem to set sp. Is that right?
Original comment by csilv...@gmail.com
on 30 Oct 2008 at 2:45
OK, based on my understanding of what is going on, here is my proposed version
of
GetStackTrace for the adhoc method. Can you take a look and see if it is
correct?
Mostly I just changed the control flow by making us go back to the outer loop
every
time we see an FPO return value, even though we could, correctly, stay in the
inner
loop. I think by leaving the inner loop every time we assign to result, the
control
flow is a bit clearer.
My comments may be totally off-base; please let me know if they are, and how to
make
them comprehensible and correct.
I'll just put the code inline rather than attaching:
---
int GetStackTrace(void** result, int max_depth, int skip_count) {
void **sp, **top_of_stack, **bottom_of_stack;
#if defined(_MSC_VER) && defined(_M_IX86)
// Stack frame format:
// sp[0] pointer to previous frame
// sp[1] caller address
// sp[2] first argument
// ...
sp = (void **)&result - 2;
// Get the top and bottom of the stack.
__asm {
push eax
mov eax, FS:[0x04]
mov [top_of_stack], eax
mov eax, FS:[0x08]
mov [bottom_of_stack], eax
pop eax
};
#elif defined(_MSC_VER) && defined(_M_X64)
DWORD c_rbp;
// Move the value of the register %rbp into the local variable rbp.
__asm mov [c_rbp], rbp
// Arguments are passed in registers on x86-64, so we can't just
// offset from &result
sp = (void **) c_rbp;
// TODO: fill top_of_stack and bottom_of_stack as well
#else
#error "Figure out how to implement stack end computation for this architecture"
#endif
MaybeInitializeCode();
int n = 0;
void **prev = sp;
while (n < max_depth) {
void** return_address_ptr = NULL;
// If the current frame has a stack pointer, things are easy: sp+1
// is the return address, that we want to store, and *sp is the
// next stack frame. If the current frame does *not* have a stack
// pointer, then probably it's for a function that was compiled
// with FPO. In that case, we take any old pointer we see that
// looks like it's pointing into code, and take that as the return
// address. Then we assume that ptr+1 is the next stack frame (what
// else can we do?)
if (!sp) {
uintptr_t diff = (uintptr_t)top_of_stack - (uintptr_t)prev;
if (diff < 0x30 && !*(prev+1)) {
// Probably really is the oldest frame if we are this close to
// the top end of the stack. Stop now to avoid using the code
// scanning heuristic, below.
return n;
}
// Scan memory starting at the top of the current stack frame
// until we a) run out of stack, b) find something that looks
// like a stack pointer, or c) find a pointer-into-code. In
// case (c), we assume this is a return address that ends the
// current frame, which was compiled with FPO.
for (void** ptr = prev + 1; ptr < top_of_stack; ptr++) {
sp = NextStackFrame(ptr, t, b);
if (sp) {
break; // great, found something that looks like a stack frame!
} else if (PROBABLY_CODE(*ptr)) {
return_address_ptr = ptr;
break;
}
}
}
// At this point we should either have an sp or an explicit return_address
if (sp) {
if (skip_count > 0) {
skip_count--;
} else {
result[n++] = *(sp+1);
}
prev = sp;
sp = NextStackFrame(sp, t, b);
} else if (return_address_ptr) {
if (skip_count > 0) {
skip_count--;
} else {
result[n++] = *return_address_ptr;
}
prev = return_address_ptr;
sp = NULL;
} else {
break; // didn't find anything? time to give up
}
}
return n;
}
Original comment by csilv...@gmail.com
on 30 Oct 2008 at 2:57
) which leads me to believe we should be breaking out of the for loop after the
first
) PROBABLY_CODE pointer. But the code seems to act otherwise.
The comment isn't correct -- it just keeps going on the assumption that we just
have
multiple FPO frames in sequence until it hits something like a non-fpo frame.
Your
understanding of what's going on sounds solid.
For x86_64, I left a "#error" in about implementing stack end detection (I
think?).
The modifications look OK, although I would certainly test them :).
Original comment by dvi...@gmail.com
on 30 Oct 2008 at 3:12
} For x86_64, I left a "#error" in about implementing stack end detection (I
think?).
Maybe I'm not understanding windows preprocessor defines? You have this:
#elif defined(_MSC_VER) && defined(_M_X64)
DWORD c_rbp;
// Move the value of the register %rbp into the local variable rbp.
__asm mov [c_rbp], rbp
// Arguments are passed in registers on x86-64, so we can't just
// offset from &result
sp = (void **) c_rbp;
#else
#error ...
Don't we need to fill b and t the same time we set sp here?
Original comment by csilv...@gmail.com
on 30 Oct 2008 at 5:58
The original diff at
http://google-perftools.googlecode.com/issues/attachment?aid=753171857806843247&
name=google-perftools-0.99.2.msvc8_i386_hprof.patch
has:
+#elif defined(_MSC_VER) && defined(_M_X64)
+ // __builtin_frame_address(0) can return the wrong address on gcc-4.1.0-k8
+ DWORD c_rbp;
+ // Move the value of the register %rbp into the local variable rbp.
+ __asm mov [c_rbp], rbp
+ // Arguments are passed in registers on x86-64, so we can't just
+ // offset from &result
+ sp = (void **) c_rbp;
+
+#error "Figure out how to implement stack end computation"
+#endif
Did I post another one later? at any rate, it needs a #error above the #else.
Original comment by dvi...@gmail.com
on 30 Oct 2008 at 6:02
Ah, I understand. I think I moved the #error when I was editing, not realizing
your
intent. OK, i've changed it back.
Original comment by csilv...@gmail.com
on 30 Oct 2008 at 6:25
Here is a patch (should be composed with my previous patch) that makes free and
company fall back on the appropriate libc instance.
I added some handling of more kinds of jump instructions that I encountered to
the
patcher. There are also some bug fixes that improve things when using
statically
linked libc and a libc dll in the same process.
It may be desirable to move some prototypes for functions from tcmalloc.cc into
a
header file that patch_functions.cc can see.
Original comment by dvi...@gmail.com
on 31 Oct 2008 at 12:34
Attachments:
Thanks for the new patch! I will look through it when I have a chance. I am
still
trying to work through the stacktrace code, but I may put that aside since I
think
this part of the system is more useful to fix.
Original comment by csilv...@gmail.com
on 31 Oct 2008 at 1:08
OK, I've finally had some time to look at the windows-patching code. One
question I
have is why you are patching new, new[], delete, and delete[]. I didn't have
to do
that before -- I think because windows lets you override them (but not malloc
and
friends) just by defining a new version. So why patch? Is it because you are
worried about multiple libc's? Is new defined in mscvrt?
I ask because you don't seem to completely patch new, which is an overloaded
operator. We need to support overloading the nothrow version of new as well.
There
may be other versions as well (though I'm happy not to worry about them because
tcmalloc doesn't). It's easier if we can just avoid the whole deal. Can we?
Original comment by csilv...@gmail.com
on 8 Nov 2008 at 4:14
Here is an example that demonstrates why just defining the operators won't work:
/* killit_dll.cpp -- compile me into killit.dll */
extern "C" {
__declspec(dllexport) char *killit( char *p )
{
delete p;
return new char;
}
}
/* test.cpp -- compile me into test.exe */
#include <stdio.h>
#include <assert.h>
#include <windows.h>
typedef char *(*killit_fn_t)(void *p);
void* operator new(size_t s)
{
printf( "New %u!\n", s );
return (void*)0x12345678;
}
void operator delete(void *p)
{
assert( p == (void*)0x12345678 );
printf( "Delete %p!\n", p );
}
int main(int argc, char *argv[])
{
delete new char;
HMODULE libby = LoadLibrary( "killit.dll" );
assert( libby );
killit_fn_t killit = (killit_fn_t)GetProcAddress( libby, "killit" );
assert( killit );
delete killit(new char);
return 0;
}
The invocations of new and delete from inside the DLL will not be intercepted
by the
user defined operators.
As for the nothrow versions, no version of libc seems to implement them on
windows.
The only exported symbols for new are "operator new" and "operator new[]". You
can
use the depends.exe utility on any libc dll to see what symbols it exports.
Original comment by dvi...@gmail.com
on 9 Nov 2008 at 4:20
Right, I believe that if you overload new/delete in an .exe, it won't be
exported to
the dll. But what matters for perftools is the other way around: whe you
overload
new/delete in a .dll and try to use it in the .exe (or in another dll). In unix
those two situations are different, so I wouldn't be surprised if that was true
for
windows too.
The other point that if windows new/new[] just calls malloc(), and delete just
calls
free(), then we don't need to override new/delete, because we already override
malloc
and free.
tcmalloc_minimal_unittest tests that tcmalloc correctly intercepts new and
delete
calls. Right now that test is passing on windows, even with the current
version of
tcmalloc that doesn't try to patch new/delete. So I suspect it is not
necessary to
patch them. Have you seen a situation where it's actually necessary -- that is,
where the current tcmalloc isn't intercepting new/delete?
The way we're going to do some of the windows_patch.cc changes, life will be a
lot
easier if we don't have to worry about new and delete, so I'd like to make sure
it's
really necessary to patch them.
Original comment by csilv...@gmail.com
on 9 Nov 2008 at 6:52
) ... overload new/delete in a .dll and try to use it in the .exe (or in
another dll)
Unless the user has control of the link lines for every dll in their process
that
uses new or delete, then it is not possible to intercept all calls to new/delete
without patching them. This can be a big problem if you use any third-party
components, especially with inline c++ classes.
Even if we are willing to rely on the fact that libc happens to implement new by
calling malloc, the FPO optimized frames introduced by libc's new frame(s) may
screw
up call stacks.
Original comment by dvi...@gmail.com
on 9 Nov 2008 at 7:34
But FPO optimized frames are going to screw up call stacks no matter what,
right? Or
maybe this kind of screw-up is worse somehow? I don't know much about how FPO
affects stack frames.
I'm really pushing back on overloading new because right now patch-functions.cc
reimplements a lot of the code in tcmalloc.cc, and I want to move away from
that.
The more we patch, the harder the task gets. Re-implementing the C++ operators
will
present its own challenges, I suspect, and I really would be much happier if we
could
get away without doing it.
How bad do you think it will be if we don't patch new and delete?
Original comment by csilv...@gmail.com
on 10 Nov 2008 at 11:04
) FPO optimized frames are going to screw up call stacks no matter what, right?
If we patch new and delete, then they are only screwed up when the callee of
new or
delete is FPO optimized. The other way, the stack will just end somewhere
inside
libc even with debug code -- you won't even be able to identify the module
immediately calling new.
) reimplements a lot of the code in tcmalloc.cc
The bits of code for dealing with new and delete in patch_functions.cc are just
1 or
2 line wrapper functions dispatching to tcmalloc.cc. Example:
static void *Perftools_new(
size_t s ) {
void *p = cpp_alloc(s, false);
MallocHook::InvokeNewHook(p, s);
return p;
}
It isn't duplicating any code from tcmalloc.cc, it is just calling it.
) How bad do you think it will be if we don't patch new and delete?
I do not think I would adopt the change here. It might be OK for a restricted
audience working on smaller applications, but for large applications with lots
of
dlls I wouldn't make the compromise. The thing about windows is, it's easy to
suck a
lot of dlls into your process without realizing it.
There is another issue with needing to link all dlls with the perftools dll: It
means
that I cannot create a DLL that will function correctly in both processes that
do and
do not use tcmalloc. This can be a usage problem since it means we need to
link two
copies of many things and remember which one to use where. I prefer to avoid
needing
to make everyone potentially using tcmalloc here understand this complication.
As
is, a lot of processes are transparently getting tcmalloc and it "just works"
without
any extra work from our developers.
Original comment by dvi...@gmail.com
on 10 Nov 2008 at 11:42
The code "cpp_alloc" + InvokeMallocHook is a copy of the code for operator new
in
tcmalloc.cc. Sure it's simple code, but it's still very possible the code will
evolve over time, and now we'll have to remember to keep such changes in sync
between
tcmalloc.cc and patch_functions.cc. This will be especially tricky since the
two
implementations are used on different platforms, so we can't write a test or
anything
to warn us if they get out of sync.
That said, it sounds like patching new/delete is important. I'll see what we
can
figure out to make it happen.
Original comment by csilv...@gmail.com
on 11 Nov 2008 at 12:01
OK, I'm making progress on patch_functions again.
Can you clarify a bit why we need to define _expand? Here is what I've written
now:
---
void* Perftools__expand(void *ptr, size_t size) __THROW {
// We need to define this because internal windows functions like to
// call into it(?). _expand() is like realloc but doesn't move the
// pointer. We punt, which I guess the internal callers can handle.
return NULL;
}
---
Is this correct? Is there any more info I can add? Do we have any reason to
believe
clients can handle a NULL return ok (as opposed to, say, just dying or throwing
an
exception), or is it just that it seems to work in practice?
Original comment by csilv...@gmail.com
on 13 Nov 2008 at 3:26
Sorry, one more question: I see you patch LoadLibraryExW. Do we also need to
patch
at library-unload time? If not, can you explain briefly why not, so I can
understand
this stuff better? Thanks. :-)
Original comment by csilv...@gmail.com
on 13 Nov 2008 at 3:31
_expand is identical to realloc except that it cannot relocate the memory
block. It
returns NULL if the size of the block cannot be altered without moving it. I
do not
see how any client could assume that this function will succeed unless it knows
the
new size is exactly identical to the old size, in which case there would be no
reason
for it to call this function in the first place. All uses I've encountered so
far
fall back on realloc or malloc when it fails.
See
http://msdn.microsoft.com/en-us/library/wfzt8b7y.aspx
There is a danger in not hooking FreeLibrary if either of these things happen:
(1) We exhaust MAX_LIBCS because the user repeatedly loads and unloads libc
(2) Libc gets unloaded and then reloaded at the same address it used to be
located
at. Because it is at the same spot, we will think it is already patched when
it is
not, which means the crt heap gets used.
There is an assertion that will detect LoadLibrary calls after FreeLibrary of
libc
unless (2) is happening:
CHECK_EQ(libcs[i].still_loaded,true);
We will never call into a freed library since our little Perftools_* thunks will
never get called once the corresponding libc has been freed.
Original comment by dvi...@gmail.com
on 13 Nov 2008 at 4:29
Ok, thanks for these clarifications. It's all making sense now. One more
question
before I go to bed:
You have this comment in update_libc():
---
/* Count down so that the global namespace functions get hooked
* last. We do this because as we count, we may discover that the
* global namespace functions are duplicates. The global namespace
* functions are always in slot 0.
*/
---
Can you talk a bit more about the problems you experienced, that prompted this
comment? I ask because you're not hooking the global namespace functions last;
you're just hooking them last out of all the modules that exist at program
start-time. When a new LoadLibraryExW event is triggered, it might load in new
modules with new libc's, but by that point the main routines have already been
patched.
It seems to me that's a problem situation the way you have the code now: at
LoadLibraryExW time, we'll call update_libc(), which will go through the new
modules and patch them. If one of them uses the same addresses as the global
namespace (for malloc, new, etc), then this code will trigger:
libcs[0].Windows_ ## f = NULL; \
libcs[0].origstub_ ## f = \
libcs[i].origstub_ ## f; \
But is that safe? Especially given that libcs[0] has already been patched.
My guess is in practice this doesn't come up, because modules loaded via
LoadLibraryExW aren't going to use the same addresses for their functions that
the
global namespace does. But that's just a guess, which is why I'm asking for
your
experience, and what prompted this code segment.
One practical question that comes out of this: would it make sense/be more
correct to
only do this check against libcs[0] when orig_num_libcs is 0?
o
Original comment by csilv...@gmail.com
on 13 Nov 2008 at 10:46
Your guess is correct: A module loaded by LoadLibraryEx isn't going to alias
the
global namespace functions. This is because the library that implements the
global
namespace functions will always be loaded by the OS before the program starts
executing, so the very first call to update_libcs() will see them both. Unless
there
is a second patcher at work (some other piece of software similar to tcmalloc),
I
don't think you will ever see a jump from one library into another.
) One practical question that comes out of this: would it make sense/be more
correct
to only do this check against libcs[0] when orig_num_libcs is 0?
This sounds OK.
Original comment by dvi...@gmail.com
on 13 Nov 2008 at 3:31
Original comment by csilv...@gmail.com
on 14 Nov 2008 at 2:05
OK, one more question about overriding new and delete. In the file
src/windows/vc7and8.def, there's a comment that says it's possible to override
them
just by defining our own version dllexport (via the .def file). I guess that's
not
accurate? Given that, and given that we're patching these things in
patch_functions.cc now, does this mean we can get rid of vc7and8.def, as no
longer
being needed?
Original comment by csilv...@gmail.com
on 14 Nov 2008 at 7:02
Yeah, I think so.
Original comment by dvi...@gmail.com
on 14 Nov 2008 at 11:09
ok, I've got a version of patch_functions.cc working. It ended up being quite
different from the version that you submitted in your patch, though it's based
on
that. (For instance, I use templates rather than macros, and do more in static
initializers.) I wanted to give you a chance to take a look at it, and see if
it
looks like it should work for you.
Unfortunately, it's not a drop-in replacement for the current
patch-functions.cc,
because it depends on some changes to tcmalloc.cc (the new
do_realloc_with_callback
function, for instance, which lets you pass in the function to call when the
realloc
fails). Unfortunately, those changes are built on top of other changes, some of
which aren't quite working right, so I can't include the whole megillah.
Instead,
I'm just attaching this one file to look at. What do you think?
This reminds me: it would be great to have a test program, if we can, that
exercises
the code that patches different libc's from different shared libraries. Is that
plausible to do?
Original comment by csilv...@gmail.com
on 14 Nov 2008 at 11:40
Attachments:
While doing some further testing in release mode, I found some problems with the
patch_functions.cc I attached in the last comment. I'm attaching a newer
version
here; it replaces the last one.
Original comment by csilv...@gmail.com
on 15 Nov 2008 at 3:57
Attachments:
Original issue reported on code.google.com by
dvi...@gmail.com
on 16 Oct 2008 at 11:27Attachments: