Consider using jemalloc as a replacement for the system allocator

The jemalloc memory allocator, introduced into the system in #35, can also be used as a replacement for the system memory allocator. This can offer a performance benefit in some applications, sometimes a substantial one, and is worth some data-driven experimentation. It might also mean that jemalloc's usage analytics could be available program-wide.

UPDATE: further reading into jemalloc shows that by default it might be overriding itself on MacOS as the default allocator using the zone allocation hooks. And it certainly also provides a C++ integration. So overriding the system allocator may be as simple as removing the --disable-cxx and --disable-zone-allocator flags as documented in the jemalloc INSTALL file.

However, replacing the system allocator is non-trivial and platform-specific. It will come at a fairly high complexity cost and therefore may only be worthwhile if the performance benefits are nontrivial. What follows are some notes taken when doing the initial research on this.

One important note is that jemalloc at configure time can build with either a user-supplied prefix for all functions (the default is je_) or not. If no prefix is provided the jemalloc functions will have the same signature as the system C functions, meaning for example je_malloc() becomes just malloc(). The unprefixed option is not available on MacOS, as the compiler complains about redefinition of the system allocator functions.

Three overall approaches to jemalloc usage are outlined in the Getting Started documentation:

Use dynamic linking to inject jemalloc at runtime, either directly (in the unprefixed case) or with proxy or wrapper functions.
Statically link jemalloc into your build and use the unprefixed functions to provide allocation. This approach won't work on MacOS because of the requirements to build a prefixed version of the library.
Statically link jemalloc and use the prefixed builds, using jemalloc distinctly from the system allocator.

If the goal is to replace the system allocator, and to work across all three operating systems, only the first seems like a viable approach. Recording some notes here about different examples of this approach I found while researching system allocator replacements:

Intel's Threaded Building Blocks code uses a tbbmalloc_proxy shared library to override system malloc, with specific instructions on MacOS, Windows, and Linux.
On MacOS it's also necessary to replace the default zone memory allocation functions. Mozilla Firefox (which uses jemalloc, at least for some uses) has some code to do this (note the very instructive comments about the different default zones on different versions of MacOS), as does TBBMalloc, although notable Heap-Layers a library created by the author of the Hoard memory allocator, does not. It's not clear if by overriding the default zone allocation functions we would not need to also do the symbol overriding via dylib injection, likely no, as the source code to malloc on MacOS seems to suggest.
As a side note, for experimentation it might be useful to try Heap-Layers as a convenient framework to compare different memory allocators on performance and reliability tests with Hadron.
Probably also worth digging further into Mozilla's malloc replacement code, as well as this ancient bug documenting the integration of jemalloc on the Mac.
There's a cautionary note about using a dynamic linker replacement for malloc and free, specifically when it comes to C++ allocator performance. The point is that for new and delete the compiler can make much better optimizations, specifically inlining, for statically linked malloc and free functions. Which may push the pendulum back to static linking, given that jemalloc provides overrides for new and delete.

hadron-sclang / hadron

Consider using jemalloc as a replacement for the system allocator #36