Closed GoogleCodeExporter closed 9 years ago
x86_64 OS
Original comment by srikanth...@gmail.com
on 4 Jul 2012 at 4:15
Hi
This issue is pretty serious for us.. we will have major impact on our product
that used in production. We started using perftools as it gave us phenomenal
performance improvement.. but this issue hit us badly..
Any help is greatly appreciated.
Regards
Sundari
Original comment by sunda...@gmail.com
on 6 Jul 2012 at 1:08
Some questions:
* on which lock the SlowLock is called? Is it some CentralFreeList's lock or
Static::pageheap_lock?
* is your product application also using std::map? Are map key and value small?
* is your architecture NUMA?
* do you clear the maps completely and at the same time (is there a single
moment common for all threads when all maps are empty/deleted)?
* what happens if you call MallocExtension::ReleaseFreeMemory before refilling
the memory? is phenomenon still observable?
Original comment by pafi...@gmail.com
on 2 Nov 2012 at 8:36
[deleted comment]
Answers to your questions:
1) SlowLock is from CentralFreeList. Here is a stack trace part for a thread:
#0 0x0000000000abb23d in SpinLock::SlowLock ()
#1 0x0000000000ab669b in tcmalloc::CentralFreeList::Populate ()
#2 0x0000000000ab66f8 in tcmalloc::CentralFreeList::FetchFromSpansSafe ()
#3 0x0000000000ab6789 in tcmalloc::CentralFreeList::RemoveRange ()
#4 0x0000000000ab91b3 in tcmalloc::ThreadCache::FetchFromCentralCache ()
#5 0x0000000000ab4833 in (anonymous namespace)::cpp_alloc ()
#6 0x0000000000b40838 in tc_new ()
2) Yes our product is using std::map. Key is long and value is a struct (which
could be large).
However, we feel small/large may not have impact as we see the same SLOWNESS in
our product where map value is LARGE and sample application where map value is
SMALL
3) We use openSUSE 11.2 (x86_64) architecture. SuSE supports NUMA.
Node info details from our server:
CPU model: x86_64
CPU(s): 24
CPU frequency: 2659 MHz
CPU socket(s): 4
Core(s) per socket: 6
Thread(s) per core: 1
NUMA cell(s): 1
Memory size: 99191764 kB
Do you want us to check any other specific details?
4) Yes we clear the maps. Until all maps data cleared we don't start next
iteration and maps clearing time is MEASURED separately.
The time we are measuring between iterations, is JUST inserting data into map
ALONE for comparison (which should be running perfectly parallel for the first
time (confirmed with %CPU used) and subsequent times the same insertion is
taking considerable time)
5) ReleaseFreeMemory() with/without is exhibiting same behavior.
Original comment by srikanth...@gmail.com
on 5 Nov 2012 at 1:19
* I just wanted to know if map is small enough to be handled by ThreadCache and
CentralFreeList (<= 32KB). It seems yes, as can be seen in the stacktrace.
* You clear the maps completely and at the same time, so all mem pages are
fully cleared and moved from CentralFreeList to PageHeap. You also destroy the
threads, so ThreadCaches are flushed as well. At the end, the free pages in
PageHeap form a single large span.
* each loop performs exactly the same operations on CentralFreeList, and
CentralFreeList lock does the same thing each time.
* ReleaseFreeMemory makes single large span zeroed (madvise(MADV_DONT_NEED)).
* During re-fill no sbrk/mmap syscalls are needed, and page table is big from
the beginning - but it should not have a visible impact on performance. If
there was no ReleaseFreeMemory call, then minor page faults (copy-on-write) are
not needed and performance could be visibly higher (but this is probably
hiddent under your std::map::insert cost with no -O3)
* I was unable to reproduce your observations (>18 cores, but smaller amount of
memory)
* I see no reason for re-using memory to be slower. I don't believe it's
TcMalloc bug. I bet that there is something in your
environment/application/kernel/hardware that you are not aware of.
(BTW: your application should have better allocation performance with jemalloc
- the contention of a lock equivalent to CentralFreeList is smaller there)
Original comment by pafi...@gmail.com
on 9 Nov 2012 at 9:46
1) We are actually filling ~33GB memory using 18 threads in our product. Each
thread here at-least allocates 1.8 GB. I hope map is considered to be big.
Sample application of course might be handling smaller map. We however observed
slowness in both cases.
When we run the same tests using standard gcc allocator, though the first time
performance is bad, subsequent times performance is better. Do you suggest us
to try enable/disable any tcmalloc flags?
2) If we don't kill and re-create threads, do you feel the performance could be
better?
3) You were asking about NUMA. Do you suggest us to check any other
hardware/kernel/environment settings on our side?
Original comment by srikanth...@gmail.com
on 10 Nov 2012 at 8:15
* Correction to comment #6: "wanted to know if map is small enough" -> "wanted
to know if map entry is small enough".
* Re comment #5: CentralFreeList::Populate uses two locks. It is critical to
know which of them is contended. Are you able to check that (maybe by
profiling, or by addition of per-lock SlowLock calls counter, or at least by
wrapping lock access into some non-inlined function so that you can recognize
it by stacktrace)?
* Re comment #7: I'm quite sure thread recreation will not help you. I do not
see any TcMalloc flag that you could try. If ReleaseFreeMemory did not help,
then I have no clue about root cause. Please try Jemalloc and share the results.
Original comment by pafi...@gmail.com
on 13 Nov 2012 at 6:55
We already tried JEMalloc.
JEMalloc gave best possible results with sample application and product
application too. (Meaning same good performance for all iterations).
Our product is already integrated with TCMalloc and we want to keep using
TCMalloc if we could solve this issue.
Original comment by srikanth...@gmail.com
on 15 Nov 2012 at 9:36
I had some access to a proper machine and reproduced your problem (with older
gperftools).
First loop was fast with ~850% CPU utilization,
second loop was 5-6x slower with ~105% CPU utilization.
I still have no clue about the root cause :-(
Original comment by pafi...@gmail.com
on 20 Nov 2012 at 7:16
I think I know the root cause. If I'm right, your problem should disappear if
you do
"free(malloc(20*1024*1024*1024));"
before the first loop.
Original comment by pafi...@gmail.com
on 24 Nov 2012 at 10:46
At the end of the first loop of the test program the free memory was divided
into >2500 large spans.
Some additional coalescing occurred after ReleaseFreeMemory call, but number of
large spans remained high (probably ~2000) because they were fragmented by
metadata preallocations (there is 128k preallocation of span objects once per
10MB and also page map, but less frequently).
There is a linear search through spans bigger than 1MB (PageHeap::AllocLarge).
And this was the source of performance problems.
A single huge alloc proposed in comment #11 works well for the testing program
(it separates metadata from user memory). But it will not work for more complex
(long living?) programs in which memory fragmentation may prevent large spans
from coalescing.
I see the following solutions candidates:
A: do first fit instead of best fit when doing [small] allocation from large
spans (may increase memory fragmentation)
B: keep large spans in rb tree instead of dll (significant coding effort)
C: large spans unmapping (glibc allocator used to work that way?). We would
need to trust the OS that number of mappings will not grow too much. And we
would need to remove the address range from PageMap (or make sure that the same
address space is reused - but it boils down to solution A).
I think B is the most reasonable, but I do not volunteer to do it now.
Original comment by pafi...@gmail.com
on 26 Nov 2012 at 10:19
Thanks for looking into the issue. Hope you will find time to implement B.
Meanwhile we shall be moving to jemalloc till it is resolved in our product.
Appreciate your time and help.
Original comment by srikanth...@gmail.com
on 27 Nov 2012 at 10:28
I observed this issue while trying to tune tcmalloc to 447.dealII component
from SPECcpu2006. Although single-threaded, the benchmark uses a std::map<int>
with a lot of smalls allocation putting a lot of pressure in the central cache
to thread cache objects movement.
I end up solving it by removing the hard limit of 32 objects at
SizeMap::NumMoveSize.
I also tested example in a PPC64 environment and got an average of 60 seconds
per iteration without the patch and 14 s with the patch. The patched version is
also faster than the system glibc.
Based on this I think maybe by tuning the objects limits by a environment
variable might the way to get a better performance. I plant to work on a patch
for it.
Original comment by zatr...@gmail.com
on 10 Feb 2013 at 4:13
Interesting enough I got different values based on the size of internal page.
On a 32-core X86_64 box with default internal page size (4k) I got:
# ./MemoryReUse_TCMalloc-glibc
Loop : 0 --- Time taken in seconds : 61
Loop : 1 --- Time taken in seconds : 14
Loop : 2 --- Time taken in seconds : 14
Loop : 3 --- Time taken in seconds : 15
While with svn r-190 gperftools:
# ./MemoryReUse_TCMalloc-tcmalloc
Loop : 0 --- Time taken in seconds : 49
Loop : 1 --- Time taken in seconds : 105
Loop : 2 --- Time taken in seconds : 106
Loop : 3 --- Time taken in seconds : 106
By tuning the number of object to move from 32 to 2048 I could get an
improvement:
# ./MemoryReUse_TCMalloc-tcmalloc-2
Loop : 0 --- Time taken in seconds : 15
Loop : 1 --- Time taken in seconds : 34
Loop : 2 --- Time taken in seconds : 34
Loop : 3 --- Time taken in seconds : 35
Building with 32K page size (TCMALLOC_LARGE_PAGES) it shows:
# ./MemoryReUse_TCMalloc-tcmalloc
Loop : 0 --- Time taken in seconds : 35
Loop : 1 --- Time taken in seconds : 35
Loop : 2 --- Time taken in seconds : 34
Loop : 3 --- Time taken in seconds : 34
And tuning the number of object to move from 32 to 1024 I could get an
improvement:
# ./MemoryReUse_TCMalloc-tcmalloc
Loop : 0 --- Time taken in seconds : 12
Loop : 1 --- Time taken in seconds : 11
Loop : 2 --- Time taken in seconds : 11
Loop : 3 --- Time taken in seconds : 11
I am still investigating why large pages seems a big improvement in this
testcase in special. But based on this, I'd recommend you to use internal large
pages (TCMALLOC_LARGE_PAGES) plus the internal objects move size at
SizeMap::NumMoveSize.
Original comment by zatr...@gmail.com
on 11 Feb 2013 at 10:46
Based on my previous comments, I propose a patch to increase the number of
objects transferred between thread cache and central free list from 32 to
32768. On single-thread benchmarks (mostly SPECcpu2006) I noticed a performance
improvement compared to default value of 32. I also added a mechanism to change
its value based on an environment variable. I also adjusted the
tcmalloc_unittest, changing it to a script so it can test various sizes.
The patch passes the testcase without any regression on X86_64, X86 and PPC.
Original comment by zatr...@gmail.com
on 14 Feb 2013 at 5:33
Attachments:
Original comment by chapp...@gmail.com
on 10 Mar 2013 at 6:43
------------------------------------------------------------------------
r193 | chappedm@gmail.com | 2013-03-10 15:44:43 -0400 (Sun, 10 Mar 2013) | 8
lines
issue-443: Increase the number of objects transferred between thread cache and
central free list
This fix is a result of a performance degradation observed in multi-threaded
programs where large
amounts of memory (30GB) are consumed, released by a pool of threads in a
cyclic manner. This was
mainly due to the amount of time we were spending in the slow path
consolidating memory between
the thread cache and central free list. The default has been bumped up to 32768
and is now also
controllable through the TCMALLOC_TRANSFER_NUM_OBJ environment setting.
------------------------------------------------------------------------
Original comment by chapp...@gmail.com
on 10 Mar 2013 at 7:46
Original issue reported on code.google.com by
srikanth...@gmail.com
on 4 Jul 2012 at 9:19Attachments: