tcmalloc is slower when re-using memory in multi-threaded application

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?
1. In our application, we create 18 threads (we have 24+ CPU) and all threads 
allocate memory (upto 30+ GB for all threads) and application uses the memory 
and if we need to load new data we clear existing 30+ GB and we fill the data 
again. 

OUR OBSERVATION is as follows:
For the first time, tcmalloc is performing FASTER than malloc (like tcmalloc 
took 22 seconds and malloc takes 54 seconds).
However, when we clear the memory and start refilling the memory, we see a huge 
performance hit (like second time filling is taking ~200 seconds) and so on for 
sub-sequent cache fill (from stack trace, we observed that all threads are 
waiting in tcmalloc::SlowLock() function and our 18 threaded application is 
behaving like a few (2+ only) threads application.

2. We wrote a sample application when we just fill a STL map in each thread 
with 18 threads and we de-allocate memory and starts re-filling map again and 
observed the same slowness. Attaching the sample we wrote.
3.

What is the expected output? What do you see instead?
Even while re-using memory, we expect tcmalloc to perform faster. Malloc is 
actually doing that.

What version of the product are you using? On what operating system?
tcmalloc 1.9 and even tried 2.0 - Open SUSE 11.2

Please provide any additional information below.
This could be wrong, but we observed that upto certain number of threads (like 
6 or so) and/or upto (<10G) memory allocation, re-use/re-fill memory is 
performing faster. It seems larger memory and more threads is affecting the 
performance.

Original issue reported on code.google.com by srikanth...@gmail.com on 4 Jul 2012 at 9:19

Attachments:

MemoryReUse_TCMalloc.cpp

GoogleCodeExporter commented 9 years ago

x86_64 OS

Original comment by srikanth...@gmail.com on 4 Jul 2012 at 4:15

GoogleCodeExporter commented 9 years ago

Hi

This issue is pretty serious for us.. we will have major impact on our product 
that used in production. We started using perftools as it gave us phenomenal 
performance improvement.. but  this issue hit us badly..

Any help is greatly appreciated.

Regards
Sundari

Original comment by sunda...@gmail.com on 6 Jul 2012 at 1:08

GoogleCodeExporter commented 9 years ago

Some questions:
* on which lock the SlowLock is called? Is it some CentralFreeList's lock or 
Static::pageheap_lock?
* is your product application also using std::map? Are map key and value small?
* is your architecture NUMA?
* do you clear the maps completely and at the same time (is there a single 
moment common for all threads when all maps are empty/deleted)?
* what happens if you call MallocExtension::ReleaseFreeMemory before refilling 
the memory? is phenomenon still observable?

Original comment by pafi...@gmail.com on 2 Nov 2012 at 8:36

GoogleCodeExporter commented 9 years ago

[deleted comment]

GoogleCodeExporter commented 9 years ago

Answers to your questions:
1) SlowLock is from CentralFreeList. Here is a stack trace part for a thread:
#0  0x0000000000abb23d in SpinLock::SlowLock ()
#1  0x0000000000ab669b in tcmalloc::CentralFreeList::Populate ()
#2  0x0000000000ab66f8 in tcmalloc::CentralFreeList::FetchFromSpansSafe ()
#3  0x0000000000ab6789 in tcmalloc::CentralFreeList::RemoveRange ()
#4  0x0000000000ab91b3 in tcmalloc::ThreadCache::FetchFromCentralCache ()
#5  0x0000000000ab4833 in (anonymous namespace)::cpp_alloc ()
#6  0x0000000000b40838 in tc_new ()

2) Yes our product is using std::map. Key is long and value is a struct (which 
could be large).
However, we feel small/large may not have impact as we see the same SLOWNESS in 
our product where map value is LARGE and sample application where map value is 
SMALL

3) We use openSUSE 11.2 (x86_64) architecture. SuSE supports NUMA.
Node info details from our server:
CPU model:           x86_64
CPU(s):              24
CPU frequency:       2659 MHz
CPU socket(s):       4
Core(s) per socket:  6
Thread(s) per core:  1
NUMA cell(s):        1
Memory size:         99191764 kB
Do you want us to check any other specific details?

4) Yes we clear the maps. Until all maps data cleared we don't start next 
iteration and maps clearing time is MEASURED separately. 

The time we are measuring between iterations, is JUST inserting data into map 
ALONE for comparison (which should be running perfectly parallel for the first 
time (confirmed with %CPU used) and subsequent times the same insertion is 
taking considerable time)

5) ReleaseFreeMemory() with/without is exhibiting same behavior.

Original comment by srikanth...@gmail.com on 5 Nov 2012 at 1:19

GoogleCodeExporter commented 9 years ago

* I just wanted to know if map is small enough to be handled by ThreadCache and 
CentralFreeList (<= 32KB). It seems yes, as can be seen in the stacktrace.
* You clear the maps completely and at the same time, so all mem pages are 
fully cleared and moved from CentralFreeList to PageHeap. You also destroy the 
threads, so ThreadCaches are flushed as well. At the end, the free pages in 
PageHeap form a single large span.
* each loop performs exactly the same operations on CentralFreeList, and 
CentralFreeList lock does the same thing each time.
* ReleaseFreeMemory makes single large span zeroed (madvise(MADV_DONT_NEED)).
* During re-fill no sbrk/mmap syscalls are needed, and page table is big from 
the beginning - but it should not have a visible impact on performance. If 
there was no ReleaseFreeMemory call, then minor page faults (copy-on-write) are 
not needed and performance could be visibly higher (but this is probably 
hiddent under your std::map::insert cost with no -O3)
* I was unable to reproduce your observations (>18 cores, but smaller amount of 
memory)
* I see no reason for re-using memory to be slower. I don't believe it's 
TcMalloc bug. I bet that there is something in your 
environment/application/kernel/hardware that you are not aware of.

(BTW: your application should have better allocation performance with jemalloc 
- the contention of a lock equivalent to CentralFreeList is smaller there)

Original comment by pafi...@gmail.com on 9 Nov 2012 at 9:46

GoogleCodeExporter commented 9 years ago


1) We are actually filling ~33GB memory using 18 threads in our product. Each 
thread here at-least allocates 1.8 GB. I hope map is considered to be big. 
Sample application of course might be handling smaller map. We however observed 
slowness in both cases. 

When we run the same tests using standard gcc allocator, though the first time 
performance is bad, subsequent times performance is better. Do you suggest us 
to try enable/disable any tcmalloc flags?

2) If we don't kill and re-create threads, do you feel the performance could be 
better? 

3) You were asking about NUMA. Do you suggest us to check any other 
hardware/kernel/environment settings on our side?

Original comment by srikanth...@gmail.com on 10 Nov 2012 at 8:15

GoogleCodeExporter commented 9 years ago

* Correction to comment #6: "wanted to know if map is small enough" -> "wanted 
to know if map entry is small enough".
* Re comment #5: CentralFreeList::Populate uses two locks. It is critical to 
know which of them is contended. Are you able to check that (maybe by 
profiling, or by addition of per-lock SlowLock calls counter, or at least by 
wrapping lock access into some non-inlined function so that you can recognize 
it by stacktrace)?
* Re comment #7: I'm quite sure thread recreation will not help you. I do not 
see any TcMalloc flag that you could try. If ReleaseFreeMemory did not help, 
then I have no clue about root cause. Please try Jemalloc and share the results.

Original comment by pafi...@gmail.com on 13 Nov 2012 at 6:55

GoogleCodeExporter commented 9 years ago

We already tried JEMalloc. 

JEMalloc gave best possible results with sample application and product 
application too. (Meaning same good performance for all iterations).

Our product is already integrated with TCMalloc and we want to keep using 
TCMalloc if we could solve this issue.

Original comment by srikanth...@gmail.com on 15 Nov 2012 at 9:36

GoogleCodeExporter commented 9 years ago

I had some access to a proper machine and reproduced your problem (with older 
gperftools).
First loop was fast with ~850% CPU utilization,
second loop was 5-6x slower with ~105% CPU utilization.
I still have no clue about the root cause :-(

Original comment by pafi...@gmail.com on 20 Nov 2012 at 7:16

GoogleCodeExporter commented 9 years ago

I think I know the root cause. If I'm right, your problem should disappear if 
you do
"free(malloc(20*1024*1024*1024));"
before the first loop.

Original comment by pafi...@gmail.com on 24 Nov 2012 at 10:46

GoogleCodeExporter commented 9 years ago

At the end of the first loop of the test program the free memory was divided 
into >2500 large spans.
Some additional coalescing occurred after ReleaseFreeMemory call, but number of 
large spans remained high (probably ~2000) because they were fragmented by 
metadata preallocations (there is 128k preallocation of span objects once per 
10MB and also page map, but less frequently).

There is a linear search through spans bigger than 1MB (PageHeap::AllocLarge).
And this was the source of performance problems.

A single huge alloc proposed in comment #11 works well for the testing program 
(it separates metadata from user memory). But it will not work for more complex 
(long living?) programs in which memory fragmentation may prevent large spans 
from coalescing.

I see the following solutions candidates:
A: do first fit instead of best fit when doing [small] allocation from large 
spans (may increase memory fragmentation)
B: keep large spans in rb tree instead of dll (significant coding effort)
C: large spans unmapping (glibc allocator used to work that way?). We would 
need to trust the OS that number of mappings will not grow too much. And we 
would need to remove the address range from PageMap (or make sure that the same 
address space is reused - but it boils down to solution A).

I think B is the most reasonable, but I do not volunteer to do it now.

Original comment by pafi...@gmail.com on 26 Nov 2012 at 10:19

GoogleCodeExporter commented 9 years ago

Thanks for looking into the issue. Hope you will find time to implement B. 
Meanwhile we shall be moving to jemalloc till it is resolved in our product.

Appreciate your time and help.

Original comment by srikanth...@gmail.com on 27 Nov 2012 at 10:28

GoogleCodeExporter commented 9 years ago

I observed this issue while trying to tune tcmalloc to 447.dealII component 
from SPECcpu2006. Although single-threaded, the benchmark uses a std::map<int> 
with a lot of smalls allocation putting a lot of pressure in the central cache 
to thread cache objects movement.

I end up solving it by removing the hard limit of 32 objects at 
SizeMap::NumMoveSize.
I also tested example in a PPC64 environment and got an average of 60 seconds 
per iteration without the patch and 14 s with the patch. The patched version is 
also faster than the system glibc.

Based on this I think maybe by tuning the objects limits by a environment 
variable might the way to get a better performance. I plant to work on a patch 
for it.

Original comment by zatr...@gmail.com on 10 Feb 2013 at 4:13

GoogleCodeExporter commented 9 years ago

Interesting enough I got different values based on the size of internal page. 
On a 32-core X86_64 box with default internal page size (4k) I got:

# ./MemoryReUse_TCMalloc-glibc
Loop : 0 --- Time taken in seconds : 61
Loop : 1 --- Time taken in seconds : 14
Loop : 2 --- Time taken in seconds : 14
Loop : 3 --- Time taken in seconds : 15

While with svn r-190 gperftools:
# ./MemoryReUse_TCMalloc-tcmalloc
Loop : 0 --- Time taken in seconds : 49
Loop : 1 --- Time taken in seconds : 105
Loop : 2 --- Time taken in seconds : 106
Loop : 3 --- Time taken in seconds : 106

By tuning the number of object to move from 32 to 2048 I could get an 
improvement:
# ./MemoryReUse_TCMalloc-tcmalloc-2
Loop : 0 --- Time taken in seconds : 15
Loop : 1 --- Time taken in seconds : 34
Loop : 2 --- Time taken in seconds : 34
Loop : 3 --- Time taken in seconds : 35

Building with 32K page size (TCMALLOC_LARGE_PAGES) it shows:

# ./MemoryReUse_TCMalloc-tcmalloc
Loop : 0 --- Time taken in seconds : 35
Loop : 1 --- Time taken in seconds : 35
Loop : 2 --- Time taken in seconds : 34
Loop : 3 --- Time taken in seconds : 34

And tuning the number of object to move from 32 to 1024 I could get an 
improvement:
# ./MemoryReUse_TCMalloc-tcmalloc
Loop : 0 --- Time taken in seconds : 12
Loop : 1 --- Time taken in seconds : 11
Loop : 2 --- Time taken in seconds : 11
Loop : 3 --- Time taken in seconds : 11

I am still investigating why large pages seems a big improvement in this 
testcase in special. But based on this, I'd recommend you to use internal large 
pages (TCMALLOC_LARGE_PAGES) plus the internal objects move size at 
SizeMap::NumMoveSize.

Original comment by zatr...@gmail.com on 11 Feb 2013 at 10:46

GoogleCodeExporter commented 9 years ago

Based on my previous comments, I propose a patch to increase the number of 
objects transferred between thread cache and central free list from 32 to 
32768. On single-thread benchmarks (mostly SPECcpu2006) I noticed a performance 
improvement compared to default value of 32. I also added a mechanism to change 
its value based on an environment variable. I also adjusted the 
tcmalloc_unittest, changing it to a script so it can test various sizes.

The patch passes the testcase without any regression on X86_64, X86 and PPC.

Original comment by zatr...@gmail.com on 14 Feb 2013 at 5:33

Attachments:

issue-443-patch1.patch

GoogleCodeExporter commented 9 years ago

Original comment by chapp...@gmail.com on 10 Mar 2013 at 6:43

Changed state: Accepted

GoogleCodeExporter commented 9 years ago

------------------------------------------------------------------------
r193 | chappedm@gmail.com | 2013-03-10 15:44:43 -0400 (Sun, 10 Mar 2013) | 8 
lines

issue-443: Increase the number of objects transferred between thread cache and 
central free list

This fix is a result of a performance degradation observed in multi-threaded 
programs where large
amounts of memory (30GB) are consumed, released by a pool of threads in a 
cyclic manner. This was
mainly due to the amount of time we were spending in the slow path 
consolidating memory between
the thread cache and central free list. The default has been bumped up to 32768 
and is now also
controllable through the TCMALLOC_TRANSFER_NUM_OBJ environment setting.

------------------------------------------------------------------------

Original comment by chapp...@gmail.com on 10 Mar 2013 at 7:46

Changed state: Fixed

qchbai / gperftools

tcmalloc is slower when re-using memory in multi-threaded application #443