Closed GoogleCodeExporter closed 9 years ago
In our experience, (4) is often the reason for discrepancies with top.
Occassionally, you'll also see discrepancies due to internal fragmentation
within
tcmalloc.
The way to diagnose these is to emit the output of
MallocExtension::instance()->GetStats() (from malloc_extension.h). This will
return
a lot of details about memory use, and should cover all the cases you're
suggesting
counters for. In any case, give it a try and see how well it works for you.
Original comment by csilv...@gmail.com
on 3 Nov 2009 at 3:30
Original comment by csilv...@gmail.com
on 3 Nov 2009 at 3:30
Reviving this thread, as this is still happening and we're trying to
collect more insight. Version being used is 1.4 release.
Line from 'top':
PID USER PR NI VIRT RES S %CPU %MEM TIME+ P nFLT COMMAND
5681 root 15 0 8938m 7.7g S 233 12.2 2008:09 7 0 adindexer
Final part of the summary from 'getStats()' follows; note that the
full 'getStats()' output is attached.
>255 large * 52 spans ~ 869.2 MB; 1322.9 MB cum; unmapped: 863.6 MB;
875.4 MB cum
------------------------------------------------
DevMemSysAllocator: failed_=0
SbrkSysAllocator: failed_=1
MmapSysAllocator: failed_=0
------------------------------------------------
MALLOC: 6282067968 ( 5991.0 MB) Heap size
MALLOC: 3685432760 ( 3514.7 MB) Bytes in use by application
MALLOC: 1387171840 ( 1322.9 MB) Bytes free in page heap
MALLOC: 1195182432 ( 1139.8 MB) Bytes free in central cache
MALLOC: 514816 ( 0.5 MB) Bytes free in transfer cache
MALLOC: 13766120 ( 13.1 MB) Bytes free in thread caches
MALLOC: 1085167 Spans in use
MALLOC: 20 Thread heaps in use
MALLOC: 75366400 ( 71.9 MB) Metadata allocated
------------------------------------------------
As you can see, about 2GB of RSS are completely "missing". Together
with the page heap and the central cache, we're using about 7.9GB of
RSS for a 3.5GB total heap -- a little wasteful :)
I'll be collecting a lot more data and updating the thread, but I'd
love any pointers or guesses.
Current suspicions:
- I'm curious if the 'failed_ = 1' for the SbrkSysAllocator is normal
- The PageHeap summary line is also suspicious:
>255 large * 52 spans ~ 869.2 MB; 1322.9 MB cum;
unmapped: 863.6 MB; 875.4 MB cum
Perhaps the 'unmapped' stuff isn't showing up in the final heap
size, and didn't actually get unmapped properly? I suspect
that the 'madvise()' call by TCMalloc isn't unmapping properly on our system.
Original comment by mrab...@gmail.com
on 22 Jan 2010 at 1:07
Attachments:
FYI, this is preventing the usage of TCMalloc in production in 90% of our cases
(almost
every long-running server is affected by this same issue). We really want to
use it
though :)
Original comment by mrab...@gmail.com
on 22 Jan 2010 at 1:11
I'll ask some experts here if they have time to take a look at this data and
see if
they can figure out what's going on.
Original comment by csilv...@gmail.com
on 3 Feb 2010 at 11:20
I'm also doing more experimentation and am having trouble reproducing it in a
tiny
binary that I can ship over, but it happens in every single one of our complex
servers.
I'm going to do a test with tcmalloc_minimal to see if it has something to do
with
collecting samples and stacks. Out of curiosity, the heap sample data -- where
is it
stored? Does it show up anywhere in this MallocInstance::getStats() report?
if so, in
what field?
Original comment by mrab...@gmail.com
on 3 Feb 2010 at 11:42
The heap sample data does not show up in getStats, as far as I know. And I
believe
-- don't quote me on this -- it's stored in memory.
Original comment by csilv...@gmail.com
on 4 Feb 2010 at 12:24
Any update from the experts? Please let me know if there's any other data or
statistics
I can capture and provide, because we can easily reproduce this situation every
single
time on our servers.
Original comment by mrab...@gmail.com
on 8 Feb 2010 at 7:19
No, no word from anyone (I was hoping they'd respond directly to this bug
report).
What happens if you call MallocExtension::instance()->ReleaseFreeMemory() right
before (and/or after) dumping the stats? Do the numbers change? Is it
possible that
the issue is with your kernel not recognizing MADV_DONTNEED?
Original comment by csilv...@gmail.com
on 8 Feb 2010 at 11:24
OK, we recently discovered a problem with accounting that may explain some of
the
results you're seeing here. Basically, some parts of the accounting system are
using
the bytes requested, rather than the actual bytes allocated.
I don't really understand when this happens, well enough to know if it might
explain
the problem you're seeing. But it's easy to check: at the top of do_malloc (in
tcmalloc.cc), add this code:
---
if (size <= kMaxSize) {
size = Static::sizemap()->ByteSizeForClass(Static::sizemap()->SizeClass(size));
} else {
size = (size + kPageSize - 1) & ~(kPageSize - 1);
}
---
Does this affect the accuracy of the stats you see?
We should have a more principled fix for this problem in the next release.
Original comment by csilv...@gmail.com
on 7 Apr 2010 at 6:32
I've submitted our fix for sampling error to svn. Feel free to compile
perftools from
svn-head and see if that helps with the problems you've been seeing. If you do
that,
let us know how it works out!
Original comment by csilv...@gmail.com
on 12 Apr 2010 at 9:20
Any more word on this?
Original comment by csilv...@gmail.com
on 7 Jun 2010 at 10:47
unfortunately, this has since dropped from my radar. You can close for now.
Original comment by mrab...@gmail.com
on 8 Jun 2010 at 12:47
OK, hopefully the sampling fix was the issue. If you ever get back to this,
feel free to reopen the bug.
Original comment by csilv...@gmail.com
on 8 Jun 2010 at 6:49
I see that the fix above is now in tcmalloc.cc
Anyone else having this problem? I see a huge discrepancy between top and
results from GetStat ```MallocExtension::instance()->GetStats```
Original comment by kjell.he...@gmail.com
on 28 Sep 2013 at 3:24
My understanding is that ticket is about memory profiling, not GetStats()
method of MallocExtension.
I suggest you to open new ticket with all details you have.
Original comment by alkondratenko
on 29 Sep 2013 at 2:39
Original issue reported on code.google.com by
pet...@gmail.com
on 29 Oct 2009 at 5:01