Open GoogleCodeExporter opened 9 years ago
Note that
https://github.com/jamesgolick/gperftools/commit/1a87924c05ce33eb9b2b5ba78c38ba9
d8653838b is needed to run the test program against trunk (to display the
length of the combined large lists).
Original comment by jamesgol...@gmail.com
on 22 May 2013 at 4:49
I was thinking about rbtree. I.e. either from libbsd or from STL (which would
require some carefulness with it's allocators)
Original comment by alkondratenko
on 6 Jul 2013 at 7:57
Original comment by alkondratenko
on 6 Jul 2013 at 11:12
Forgot to say: I'll take a look your stuff soon. Closest Saturday latest.
Thanks a lot for taking your time to do something about this issue.
Original comment by alkondratenko
on 8 Jul 2013 at 5:21
I think this is good enough. And I think it's quite strong candidate for 2.2
release.
I was thinking that maybe we should look at entire page spans allocator as a
whole. And implement efficient and scalable data structure for it. Our
competition is using rb-trees and b-trees. We can also consider doing simpler,
dl-malloc-like organization of page span size class bins.
And btw independently from tcmalloc and for some couchbase server experiment
I've implemented a simple rb-tree based allocator here:
https://github.com/alk/minimalloc
Anyway I think dealing with large_ spans is a good start and we can always
continue improving later.
Having said all that I think we should also double check if there are other
things we should consider improving instead or in addition to scalable large_
spans searching.
Particularly I'm wondering how come allocating more than 1 meg chunks can be a
notice-able bottleneck even after your optimization ? My understanding is given
that using that big chunk is probably going to be at least hundreds of
thousands of cycles, optimized malloc-ing of those chunks should not be even
notice-able on profile. Unless either I'm missing something (like big chunks
are allocated, yet only fraction of chunks are actually used) or perhaps there
are some other reasons of slowness.
Original comment by alkondratenko
on 13 Jul 2013 at 10:15
It's hard to know the answer to that question without seeing the test program
you're using.
I have some other ideas, like I think we should add some size classes above
1Mb, but this is still important and needs solving either way.
I have some patches that take a running sample of large allocations and attempt
to determine which additional size classes are worth adding to the thread
caches based on that data (and then adds them). I think there's something
there, but a lot more thought needs to go in to it.
I do think that this is a good solution for now, and we can think about adding
more stuff later, unless you've got a better idea that we can implement
immediately?
Incidentally, I'd like to get a lot more involved in this project. Is there
anything you need help with?
Original comment by jamesgol...@gmail.com
on 15 Jul 2013 at 11:16
I was wondering why in your reported case amount of time spent allocating span
out of large_ has dropped only to 0.4%.
It's good that you have a tool to record page span allocations. I think that
would be really valuable so that we can capture some particular allocation
pattern and then explore it offline using potentially different allocation
implementations.
So with that tool maybe we can dig more into why it's just 0.4% and not
something far smaller.
Regarding help. Take a look at tickets. There are few OSX tickets that may need
some help. Otherwise consider reviewing the things at the following threads:
https://groups.google.com/forum/#!topic/google-perftools/8AZz3GNxH-o and
https://groups.google.com/forum/#!topic/google-perftools/PQR4zQ6CQYs
Also given we're nearly at rc any further testing on any platforms you care
about is nice help as well.
Original comment by alkondratenko
on 16 Jul 2013 at 3:10
I don't think that it's still particularly a bottleneck, actually - it's just
that MySQL allocates a lot of big chunks.
For example, tcmalloc::SLL_Next(void*) is currently at 0.5% in perf top, and
PageHeap::AllocLarge isn't on the first page at all.
Original comment by jamesgol...@gmail.com
on 19 Jul 2013 at 6:35
Looks like we're having some misunderstanding.
I'm simply wondering why you saw such high percentage of time spent in this
particular code without patch. I.e. after allocating big buffer I expect
application to do something reasonably heavy with all of it. And why with patch
it only went to 0.4%.
I'm wondering because who knows if there's something else that's asking for
attention in this particular case.
Original comment by alkondratenko
on 24 Jul 2013 at 2:50
Been on vacation for a while, so sorry for the delay, but...
As for why GetBestFit can still be a bottleneck, perf top shows that most of
the time in that function is being spent in this comparison:
https://github.com/jamesgolick/gperftools/commit/011657b62e85102346f4dc2219bfa2d
22fa8a89d#L3R123
about 25% addressing the Span object:
27.05 │ mov 0x8(%rax),%rax
and 50% mispredicting the branch:
44.08 │ cmp -0x30(%rbp),%rax
Did my addition of a pointer to the Span objects or the size of the skip lists
noes cause some regression with respect to them fitting in cache or something?
I'm not sure. Ideas?
Original comment by jamesgol...@gmail.com
on 28 Sep 2013 at 6:26
Interesting result. Is there any reasonably simple test I can run myself to dig
deeper ?
Original comment by alkondratenko
on 29 Sep 2013 at 2:12
I was using something like this
https://gist.github.com/jamesgolick/9a02b796693b044d21e2 to create some
fragmentation and inflate the size of the large lists - perhaps just modify the
thread_loops to run forever?
Original comment by jamesgol...@gmail.com
on 29 Sep 2013 at 5:42
Been thinking about this a lot, and I'm really not sure how to write a test to
expose this issue.
Is there any data that I can gather from my running mysqld that you think might
give us some clues?
Original comment by jamesgol...@gmail.com
on 30 Sep 2013 at 5:04
Perhaps you can try to record allocation/free requests via hooks somehow. The
challenge is obviously going to be to efficiently and correctly record temporal
relations between different threads.
Original comment by alkondratenko
on 30 Sep 2013 at 6:19
Any other ideas for how we can design a test to expose this? This ticket
addresses a major scalability issue and it would be great to find a way to be
able to get the patch ready for acceptance.
Original comment by jamesgol...@gmail.com
on 3 Oct 2013 at 1:35
Maybe you have some particular MySQL benchmark that exposes this condition ?
Original comment by alkondratenko
on 3 Oct 2013 at 2:44
The following demonstrates O(n) very well too:
https://gist.github.com/alk/6956552
I've tried your patch (rebasing it against master). And:
*) it fails to compile (due to missing stdio.h include's for fprintf)
*) make check and program above crash with segmentation fault
Let's get your code in commit-ready shape (no fprintfs please and not crashes)
and try to exercise it as much as we can.
Original comment by alkondratenko
on 13 Oct 2013 at 12:31
Ok - I'm on it.
Original comment by jamesgol...@gmail.com
on 16 Oct 2013 at 10:00
Okay, fixed. New patch is here:
https://github.com/jamesgolick/gperftools/compare/trunk...skiplist-trunk
I'm also working on implementing a left-leaning red black tree to compare
performance.
Original comment by jamesgol...@gmail.com
on 30 Oct 2013 at 4:22
Note that you can take one from jemalloc. Or from libbsd (as I did for my
minimalloc experiment).
Unless of course you actually want to implement it yourself.
BTW jemalloc author has blob post somewhere on performance of various rbtree
implementation options. Might be worth checking out.
Original comment by alkondratenko
on 30 Oct 2013 at 4:36
So, I used the jemalloc code, and got a 20% improvement over my skip list using
your random_mallocer as a benchmark. I would imagine that the reduced space
complexity would make an even bigger difference in a more realistic test
because there would be so much less cache contention walking the tree.
https://github.com/jamesgolick/gperftools/compare/trunk...llrb
The only thing is that we probably don't want to use the jemalloc code since
it's implemented as an 800-line set of macros... If we want to go in this
direction, we can: 1) expand the jemalloc macros and clean up the code, 2) find
another rbtree to use, 3) actually implement our own.
Original comment by jamesgol...@gmail.com
on 1 Nov 2013 at 4:33
I'll need clean patches without traces of "internal" fixes (like fprintf).
Also please do rebase against latest master:
git rebase --onto googlecode/master jamesgolick/trunk
Original comment by alkondratenko
on 17 Nov 2013 at 3:05
llrb functions as well as rb.h should be included in .cc not .h
Original comment by alkondratenko
on 17 Nov 2013 at 3:33
There are also fixable warnings:
src/page_heap.h: In constructor 'tcmalloc::PageHeap::PageHeap()':
src/page_heap.h:303:11: warning: 'tcmalloc::PageHeap::scavenge_counter_' will
be initialized after [-Wreorder]
int64_t scavenge_counter_;
^
src/page_heap.h:243:10: warning: 'size_t
tcmalloc::PageHeap::large_lists_size_' [-Wreorder]
size_t large_lists_size_;
^
src/page_heap.cc:64:1: warning: when initialized here [-Wreorder]
PageHeap::PageHeap()
^
In file included from src/static_vars.h:43:0,
from src/page_heap.cc:43:
src/page_heap.h:306:7: warning: 'tcmalloc::PageHeap::release_index_' will be
initialized after [-Wreorder]
int release_index_;
^
src/page_heap.h:246:8: warning: 'bool tcmalloc::PageHeap::using_large_llrb_'
[-Wreorder]
bool using_large_llrb_;
^
src/page_heap.cc:64:1: warning: when initialized here [-Wreorder]
PageHeap::PageHeap()
^
Original comment by alkondratenko
on 17 Nov 2013 at 3:35
In general I do like the code. Just please, make it super clean. Without
unrelated changes etc. I believe c-macro-based red-black tree code from
jemalloc is fine.
BTW what LLRB means?
Also I'd not bother with linked list and kLargeLLRBThreshold. Consider making
everything simpler by always keeping large spans in red-black tree.
I think we're close to merge-able code. Thanks a lot for your effort here.
Original comment by alkondratenko
on 17 Nov 2013 at 3:46
There's compilation error:
In file included from src/page_heap.cc:41:0:
src/page_heap.cc: In member function 'tcmalloc::Span*
tcmalloc::PageHeap::AllocLarge(Length)':
src/page_heap.cc:219:10: error: 'bestNormal' was not declared in this scope
ASSERT(bestNormal == NULL);
^
src/internal_logging.h:112:9: note: in definition of macro 'CHECK_CONDITION'
if (!(cond)) { \
^
src/page_heap.cc:219:3: note: in expansion of macro 'ASSERT'
ASSERT(bestNormal == NULL);
^
In file included from src/llrb.h:41:0,
from src/page_heap.h:45,
from src/static_vars.h:43,
from src/page_heap.cc:43:
This happens when -DNDEBUG is not passed to preprocessor
Original comment by alkondratenko
on 17 Nov 2013 at 3:52
Sorry to take so long on this - been busy with other stuff.
To address your feedback, I've fixed that assertion issue, and the init order
warnings on PageHeap's constructor.
The warning in src/internal_logging.h wasn't related to my branch and seems to
have been fixed since then.
It's currently impossible to include rb.h in llrb.cc because we need some of
the macros for typedefs that need to be in llrb.h. If you want me to try to
hack up the rb.h stuff, I can.
Let me know if there's anything else you want changed.
I've rebased against master.
https://github.com/jamesgolick/gperftools/compare/googlecode-master...llrb
Original comment by ja...@packagecloud.io
on 17 Jun 2014 at 4:01
Will take a look. BTW what you think about TLSF? It might solve (almost) same
problem with possibly less complexity.
Original comment by alkondratenko
on 17 Jun 2014 at 7:00
Just skimmed through the TLSF paper. That's a really interesting model - I
think it's something that's definitely worth investigating, but it seems like
it would be a substantial restructuring of tcmalloc to do it properly.
Original comment by ja...@packagecloud.io
on 17 Jun 2014 at 8:03
I was thinking about applying it only for page spans. I.e. instead of tree. But
given you're done with your code I'm not sure anymore.
Original comment by alkondratenko
on 17 Jun 2014 at 8:05
Ya - I think we should try to get my code merged in and then investigate this
for a future release. I can probably take it on sometime in the next month or
so. It's a cool data structure.
Original comment by ja...@packagecloud.io
on 17 Jun 2014 at 8:15
But let me note that tree approach has additional advantage of being able to
implement address-ordered best-fit instead of just "almost-best-fit".
Original comment by alkondratenko
on 17 Jun 2014 at 8:15
Indeed.
Original comment by ja...@packagecloud.io
on 17 Jun 2014 at 8:20
Hey - did you have any time to look at this? I've got some more time to work on
it tomorrow.
Original comment by ja...@packagecloud.io
on 19 Jun 2014 at 7:59
Haven't had time to look at it yet. Saturday is the time when I'll for sure be
able to handle it. And I cannot promise anything else. Main work and other life
constraints.
Original comment by alkondratenko
on 19 Jun 2014 at 8:03
Cool - no prob. Thanks!
Original comment by ja...@packagecloud.io
on 19 Jun 2014 at 8:25
Indeed testing it is a big problem. It's likely that llrb code path is not even
exercised once during make check.
So I've produced another test program to try to test it (in fact I forgot about
my previous program). And with that program I'm still able to crash it. Program
can be found here: https://gist.github.com/alk/6d8a907fccc8f18b414d. Note that
it doesn't fit into 32-bit address space and requires 64-bit system.
Here's other feedback.
I think it's not useful to have chain of commits reflect evolution of this
work. It'll just add confusion to folks doing git history reading. One combined
commit looks fine. Or if you can split it in a way that improves reviewing your
work and _reading_ history, feel free to do it as well (but it usually requires
a bit of git-fu that is seemingly not comfortable for most).
I'm also seeing quite few compilation warnings introduced by this patch. It
seems to be static functions introduced by llrb macros so they are harmless,
but they're still warnings.
There are also few places where trailing whitespace is introduced. Surely it'll
be trivial to fix.
I also think that switching llrb path on only on certain free list size is just
adding complexity and greatly complicates testing. Are you sure that it is
really needed? Why not just use llrb always?
Original comment by alkondratenko
on 21 Jun 2014 at 9:49
I fixed the compiler warnings using __attribute__((unused)).
When you are ready to accept the patch, I will squash all the commits in to 1
or you can do it when you merge.
The trailing whitespace is not really there - just looks like that in the
github compare view.
I pushed a commit that removes the switching to llrb path. PageHeap uses llrb
all the time now.
I'm not able to repro your crash with your test program. I tried.
https://github.com/jamesgolick/gperftools/compare/googlecode-master...llrb
Original comment by ja...@packagecloud.io
on 26 Jun 2014 at 8:20
btw - I spent some time digging in to tlsf and even implemented a basic version
of the data structure in a branch, and my conclusion is that it doesn't really
fit in to the existing structure of tcmalloc.
The biggest thing is that it would be a pretty substantial tradeoff in terms of
fragmentation because of the necessary rounding up of allocation sizes and
because we can't address-order our best fit algorithm. Conceivably, we could
use some kind of log(n) data structure instead of linked lists for each "size
class" in the tlsf, but I'm not sure that would actually be worth the
additional complexity, unless we can prove that log(n) is still too slow.
That said, I think it might be fun to build an allocator around the tlsf code I
wrote, so I might do that if I can find some time.
Original comment by ja...@packagecloud.io
on 6 Jul 2014 at 7:27
I'm still seeing the crash on my boxes.
It is very consistent once I disable address space randomization (sysctl
kernel.randomize_va_space=0).
Original comment by alkondratenko
on 19 Jul 2014 at 8:36
Note that to run it on my 8 gigs laptop I had to lower history size to 256k.
Original comment by alkondratenko
on 19 Jul 2014 at 8:47
K - just gonna spin up a big VM to try to repeat it.
Original comment by jamesgol...@gmail.com
on 19 Jul 2014 at 8:48
Regarding tlsf. I was actually thinking to generalize PageHeap's spanlist into
tlsf. And make sure to give it more bits of precision. At least 8 so that 1-256
pages spans are located located precisely.
You might be still right of course.
Original comment by alkondratenko
on 19 Jul 2014 at 8:50
Actually, I've been thinking that using tlsf for thread caches might be
interesting.
Original comment by jamesgol...@gmail.com
on 19 Jul 2014 at 8:54
[deleted comment]
Program received signal SIGSEGV, Segmentation fault.
0x00007ffff7b8c210 in tcmalloc::DLL_Remove (span=span@entry=0xa02210) at
src/span.cc:80
80 span->next->prev = span->prev;
(gdb) bt
#0 0x00007ffff7b8c210 in tcmalloc::DLL_Remove (span=span@entry=0xa02210) at
src/span.cc:80
#1 0x00007ffff7b8ab14 in tcmalloc::PageHeap::RemoveFromFreeList
(this=this@entry=0xa82000, span=span@entry=0xa02210) at src/page_heap.cc:408
#2 0x00007ffff7b8ab4c in tcmalloc::PageHeap::Carve (this=0xa82000,
span=0xa02210, n=1964) at src/page_heap.cc:231
#3 0x00007ffff7b8b921 in tcmalloc::PageHeap::New (this=0xa82000,
n=n@entry=1964) at src/page_heap.cc:121
#4 0x00007ffff7b9b8c4 in do_malloc_pages (size=16089088, heap=0xb0d0c8) at
src/tcmalloc.cc:1081
#5 do_malloc_no_errno (size=16087463) at src/tcmalloc.cc:1114
#6 do_malloc (size=16087463) at src/tcmalloc.cc:1119
#7 do_malloc_or_cpp_alloc (size=16087463) at src/tcmalloc.cc:1039
#8 tc_malloc (size=16087463) at src/tcmalloc.cc:1578
#9 0x000000000040080a in main () at /root/alk-malloc-test.c:55
(gdb)
Original comment by alkondratenko
on 19 Jul 2014 at 9:10
Yup - I got it repeating. I'll get it fixed.
Original comment by jamesgol...@gmail.com
on 19 Jul 2014 at 9:10
The bug was in the test program. It crashes exactly the same way on master as
well because it double frees pointers. Fix is here:
https://gist.github.com/jamesgolick/91673ddc31ce6d7b371c/revisions
Original comment by jamesgol...@gmail.com
on 20 Jul 2014 at 1:40
Ah. It is subtle indeed. I'm sorry for my fault. Thanks for figuring this out.
Original comment by alkondratenko
on 21 Jul 2014 at 2:34
No prob. There's still one bug right now, where the test program attempts to
free an invalid pointer. I am trying to figure out whether it's another bug in
the test program or not right now. Running this test program against my llrb
branch takes a minute or two. Running against master takes 30+ hours :-), so
I'll let you know what the result is soon.
Original comment by jamesgol...@gmail.com
on 21 Jul 2014 at 2:36
Original issue reported on code.google.com by
jamesgol...@gmail.com
on 21 May 2013 at 11:38