ProcessBackgroundActions and max cache sizes tuning questions

lano1106 commented 3 months ago

Hi,

is the thread calling tcmalloc::MallocExtension::ProcessBackgroundActions() must be detached, or joinable?

If joinable is ok, can you do the following at program teardown:
```
tcmalloc::MallocExtension::SetBackgroundProcessActionsEnabled(false);
bkgThread.join();
```

by quickly scanning MallocExtension_Internal_ProcessBackgroundActions(), I have concluded that it was doing more than just release memory...

is this correct? because I am planning keep BackgroundReleaseRate to its zero default value to avoid generating madvise() system calls...

tcmalloc::MallocExtension::SetBackgroundProcessSleepInterval()

default value is 1 second.

What is the reasonable range for that parameter?

1 second - 1 hour ?

What are the tradeoffs of a long interval vs short interval?

I had in mind 30 seconds for my app.

Does the background housekeeping tasks induce lock contention to the threads using the allocator?

Which stats to look at to determine that increasing the max cache sizes could have a positive effect?

overflows / underflows ratio? anything else?

lano1106 commented 3 months ago

I have been running the gperftools benchmark to compare glibc, gperftools and this project... https://github.com/gperftools/gperftools/blob/master/benchmark/malloc_bench.cc

unless tweaking the params that I have requested assistance with change anything, it seems like there is an obvious winner...

there must be something that your project do that is better in some way and I need your help to appreciate what makes your lib shine because so far, I am unconvinced...

ckennelly commented 2 months ago

Assuming your timings are in nanoseconds, these all seem high for our fastpath on modern hardware in a microbenchmark. The main issue that likely springs to mind is glibc managing rseq registration and preventing TCMalloc's fastpath from being active.

All but two of our defaults in the Github repo reflect our production configuration defaults, as noted here.

lano1106 commented 2 months ago

Hi Chris,

yes the times are in nanoseconds

I have noted that your prod config differ. The attempt to mimick what you do is the motivation behind opening this issue.

As you know, I am perfectly aware of the possible rseq conflict with glibc. I did launch the tcmalloc bench with:

GLIBC_TUNABLES=glibc.pthread.rseq=0 ./malloc_bench

but I can be even more explicit by testing tcmalloc::MallocExtension::PerCpuCachesActive() from main() before executing the tests...

I'll do that and report back:

int main(void)
{
  printf("Trying to randomize freelists..."); fflush(stdout);
  randomize_size_classes();
  printf("done. PerCpuCachesActive:%s\n",
         tcmalloc::MallocExtension::PerCpuCachesActive()?"true":"false");

Trying to randomize freelists...done. PerCpuCachesActive:true

the provided tcmalloc execution time results are with the rseq perCpu cache enabled...

on a sapphirerapids CPU running kernel 6.10.4... binaries have been compiled with: bazel build --features=-default_compile_flags --copt "-march=sapphirerapids" --copt "-O3" --copt "-flto" --copt "-std=c++26" --copt "-fno-rtti" //benchmark:malloc_bench

benchmark/BUILD:

cc_binary(
    name = "malloc_bench",
    srcs = [
        "malloc_bench.cc", "run_benchmark.cc", "run_benchmark.h"
    ],
    deps = [
        "@com_google_tcmalloc//tcmalloc:tcmalloc",
    ],
    copts = [
        "-mtune=sapphirerapids",
    ],
    linkopts = [
        "-O3",
        "-flto=auto",
    ],
)

cc_binary(
    name = "pt2malloc_bench",
    srcs = [
        "pt2malloc_bench.cc", "run_benchmark.cc", "run_benchmark.h"
    ],
    copts = [
        "-mtune=sapphirerapids",
    ],
    linkopts = [
        "-O3",
        "-flto=auto",
    ],
)

lano1106 commented 2 months ago

I'll redo the test just after a reboot with nothing else running... just in case... but to be fair, pt2malloc and gperftools have been run in the exact same setup.

So I was comparing apples with apples in the first place...

I am just going to put all the odds for tcmalloc to shine...

for the record, the bench program does not appear to be multithreaded... My system has isolated NOHZ_FULL nodes... unless a thread/process is pinned on a particular core, all processes run on the same default CPU... but they can be context switched...

Update: I might have a small linking glitch... I am investigating this possibility...

Update 2: by running perf on the gperftools malloc_bench, I have discovered that it was using libc malloc despite linking with libtcmalloc_minimal.a. (the second time that I have rebuilt the project only. the provided numbers are valid)

This was caused by the --as-needed linker switch... I'll perform the same test with the google tcmalloc malloc bench binary but I think that this one was not impacted since I am not using the --as-needed switch in my bazel BUILD file... this switch was coming from a global setting on my system stored in /etc.

with -O3, std=c++26 and LTO, my last gperftools run is even faster than the first one!!!

Benchmark: bench_fastpath_throughput                        5.852132 nsec
Benchmark: bench_fastpath_throughput                        5.898217 nsec
Benchmark: bench_fastpath_throughput                        5.813867 nsec

lano1106 commented 2 months ago

it seems like google tcmalloc is slowed down by:

+   23.01%    23.00%  malloc_bench  malloc_bench          [.] tcmalloc::tcmalloc_internal::PageMap::sizeclass(tcmalloc::tcmalloc_internal::PageId) [clone .constprop.0]
+    8.29%     8.29%  malloc_bench  malloc_bench          [.] tcmalloc::tcmalloc_internal::IsNormalMemory(void const*)

those slowdowns does not appear to be present in the original gperftools version...

About the PageMap code... I am surprised that it show up in the perf report at all... just maybe, the template plumbing stops the compiler to inline this code... I use gcc 14.2.1

Since you have Chandler Carruth in your team, I assume that you do use clang... That would be very interesting that you build malloc_bench and you report the numbers you get

lano1106 commented 2 months ago

I have reproduced the benchmarks initially performed by Sanjay at https://gperftools.github.io/gperftools/tcmalloc.html

back in ~2006 when gperftools have been published, its tcmalloc was dominating glibc with every parameters. A lot of things have changed. glibc has improved a lot mops_vs_size mops_vs_threads

The takeaway that I see when looking at the results are:

glibc is the best allocator as soon as there 4 or more concurrent accesses
gperftools is the best option for very low contention scenarios and for bigger allocations
google tcmalloc is unfortunately the slowest allocator in all categories...

dvyukov commented 2 months ago

it seems like google tcmalloc is slowed down by:

23.01% 23.00% malloc_bench malloc_bench [.] tcmalloc::tcmalloc_internal::PageMap::sizeclass(tcmalloc::tcmalloc_internal::PageId) [clone .constprop.0]

8.29% 8.29% malloc_bench malloc_bench [.] tcmalloc::tcmalloc_internal::IsNormalMemory(void const*)

If you see these functions, you probably do debug build. IsNormalMemory is really a trivial one line helper that's always inlined in release build.

For tight new/delete benchmark you should see only 2 functions from tcmalloc: new and delete. These should not have a stack frame as well.

lano1106 commented 2 months ago

I provide the bazel build cmd line + the BUILD file used 2-3 entries ago...

something that I have added in the cmd line is --strip never so that perf can provide me the function names in its reports...

beside that, I am not seeing how I could be using the debug build...

the compile options are: -O3 -g

about the stack frame, bazel adds a bunch of crap by default including: FORTIFY_SOURCE and -fno-omit-frame-pointer

this is to not have all that force-feed crap that --features=-default_compile_flags is used...

you can quickly review what I do and tell me if there is something missing but I don't think that I have made anything that would give me a debug build...

ptmalloc/BUILD:

cc_binary(
    name = "t-test1",
    srcs = [
        "t-test1.c", "lran2.h", "t-test.h", "thread-st.h"
    ],
    deps = [
        "@com_google_tcmalloc//tcmalloc:tcmalloc",
    ],
    defines = [
        "USE_PTHREADS"
    ],
    copts = [
        "-mtune=sapphirerapids",
    ],
    linkopts = [
        "-O3",
        "-flto=auto",
    ],
)

cc_binary(
    name = "ptmalloc_test1",
    srcs = [
        "t-test1.c", "lran2.h", "t-test.h", "thread-st.h"
    ],
    defines = [
        "USE_PTHREADS"
    ],
    copts = [
        "-mtune=sapphirerapids",
    ],
    linkopts = [
        "-O3",
        "-flto=auto",
    ],
)

ckennelly commented 2 months ago

Are you running the background thread? Are the timings after things have warmed up?

lano1106 commented 2 months ago

no background thread... see my questions in the first post... I am seeking help to correctly setup the background thread...

Sanjay testing params are 1000000 iterations... plenty of time to warm up... beside, it is the exact same conditions for every allocators...

I am not sure what we are doing here... I have the impression some shady not well understood explanations are thrown at the test/build params as being responsible plausibly for the bad result instead of searching for the root of the problem.

if there are some mandatory settings needed for good performance on basic benchmarks, should it not be the default values so that tcmalloc is at least on par with its predecessor?

You can reproduce what I have done... All I did is run your allocator with gperftools testing programs... Give it a shot and if you are able to get different and much better results than I did, please document how you have achieve that...

ps.: t-test1.c got deleted from the project but you can retrieve it by digging the commits with git...

dvyukov commented 2 months ago

It matter how the tcmalloc library is compiled. If you add -O3 for cc_binary, it does not affect dependent libraries. Also -O3 should go into copts, not linkopts.

lano1106 commented 2 months ago

-O3 is included for the compiler and the linker. Passing -O3 to the linker is valid. from ld man page:

       -O level
           If  level  is  a numeric values greater than zero ld optimizes the output.  This might take significantly longer and therefore probably
           should only be enabled for the final binary.  At the moment this option only affects ELF shared library generation.  Future releases of
           the linker may make more use of this option.  Also currently there is no difference in the linker's behaviour  for  different  non-zero
           values of this option.  Again this may change with future releases.

it is also added to the bazel build cmd line... Optimization, AFAIK should be applied all over the place on every thing:

bazel build --features=-default_compile_flags --strip never --copt "-march=sapphirerapids" --copt "-O3" --copts "-g" --copt "-flto" --copt "-std=c++26" --copt "-fno-rtti" //benchmark:malloc_bench

lano1106 commented 2 months ago

it seems like google tcmalloc is slowed down by:

23.01% 23.00% malloc_bench malloc_bench [.] tcmalloc::tcmalloc_internal::PageMap::sizeclass(tcmalloc::tcmalloc_internal::PageId) [clone .constprop.0]
* 8.29%     8.29%  malloc_bench  malloc_bench          [.] tcmalloc::tcmalloc_internal::IsNormalMemory(void const*)
If you see these functions, you probably do debug build. IsNormalMemory is really a trivial one line helper that's always inlined in release build.

For tight new/delete benchmark you should see only 2 functions from tcmalloc: new and delete. These should not have a stack frame as well.

this is to be confirmed but I am pretty sure that profilers are able to tell in which inlined functions they are in when sampled in the corresponding code section when the debug symbols are available...

lano1106 commented 2 months ago

I have included jemalloc into the benchmark...

something suspicious is that all allocators appears to suffer from a lock contention except glibc... what could be the common denominator?

I have no clue on how the thread_local keyword is implemented but this could be this guy...

mops_vs_threads_jemalloc

update: I have just checked glibc/malloc/malloc.c, glibc also uses TLS so this is not it... maybe some false sharing when updating some global atomic stats variables...

lano1106 commented 2 months ago

there was a glitch in my measurement methodology. Here are the updated graphs: mops_vs_size_v2 mops_vs_threads_jemalloc_v2

google / tcmalloc

ProcessBackgroundActions and max cache sizes tuning questions #260