<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
Web site: http://www.nedprod.com/programs/portable/nedmalloc/
Enclosed is nedalloc, an alternative malloc implementation for multiple threads without lock contention based on dlmalloc v2.8.4 and a specialised user mode page allocator (Windows Vista or later only). It has the following features:
It is licensed under the Boost Software License which basically means you can do anything you like with it. This does not apply to the malloc.c.h file which remains copyright to others. Commercial support is available from ned Productions Limited.
It has been tested on win32 (x86), win64 (x64), Linux (x64), FreeBSD (x64) and Apple Mac OS X (x86). It works very well on all of these and is very significantly faster than the system allocator on Windows XP and FreeBSD <v7. If you are using >= 10.6 Apple Mac OS X or you are on Windows 7 or later then you probably won't see much improvement without modifying your source to use the v2 malloc API (and kudos to Apple and Microsoft for adopting excellent allocators).
The user mode page allocator returns jaw dropping real world performance improvements but requires running the process as the superuser. Without, it still offers sizeable gains on all older operating systems and through the v2 malloc API modest gains on all very recent operating systems, especially in these situations:
Table of Contents:
The quickest way is to drop nedmalloc.h, nedmalloc.c and malloc.c.h into your project. Call nedmalloc(), nedcalloc(), nedrealloc() and nedfree() instead of your normal allocator, or nedpmalloc(), nedpcalloc(), nedprealloc() and nedpfree() if you want to segment your memory usage into pools. Make sure that you call neddisablethreadcache() for every pool you use on thread exit, and don't forget neddisablethreadcache(0) for the system pool if necessary. Run and enjoy!
To test, compile test.c (C) and test.cpp (C++). Both will run a comparison between your system allocator and nedalloc and tell you how much faster nedalloc is. They also serve as examples of usage.
If you'd like nedalloc as a Windows DLL or POSIX ELF shared object, the easiest thing to do is to use scons which comes with a myriad of build options listed using scons -h. If you want to build some MSVC project files for use with Microsoft Visual Studio then what you do is (i) install python (ii) install scons (iii) open a Visual Studio Command Box for the Visual Studio you wish to use via Start Menu => Programs => Microsoft Visual Studio XXXX => Visual Studio Tools => Visual Studio XXXX Command Prompt (iv) change directory to the nedmalloc directory (e.g. by dragging in its folder) (v) type "!MakeMSVCProjs" and hit Return. Note that for Visual Studio 2008 and later support you need scons v2.1 or later.
nedalloc comes with two new memory allocator APIs: one is for C++, and the other is for C. Full documentation for all nedalloc's APIs and features is provided in the enclosed nedalloc.chm which is in Microsoft HTML Help format (Linux and Apple Mac OS X will happily read this format too). If you don't want to use the CHM documentation, nedmalloc.h is extensively commented with doxygen markup.
For the v1.10 release which was generously sponsored by Applied Research Associates (USA), a C++ metaprogrammed STL allocator was designed which makes use of advanced nedalloc features to remedy many of the long standing problems and inefficiencies caused by C++'s traditional over-fondness for copying things. While its implementation is complex, usage is extremely easy - simply supply nedallocator<> as the custom allocator to STL container classes.
As nedmalloc can do even better for vector extension, nedmalloc.h also contains a nedvector<> implementation which is the standard STL vector<> implementation except that it makes use of the non-relocating facilities of realloc2() (see below). This allows nedvector<> to not need to overallocate memory (most STL vector<> implementations will overallocate by 50%) which saves a lot of memory as well as completely avoiding array copy construction which make std::vector<>::resize() so very, very slow.
Even without nedalloc's major speed improvements as a simple C style allocator, the improvements to the C++ memory infrastructure alone can generate huge performance gains.
[Note: This API will be completely replaced in v1.2]
For the v1.10 release which was generously sponsored by Applied Research Associates (USA), a new general purpose allocator API was designed which is intended to remedy many of the long standing problems and inefficiencies introduced by the ISO C allocator API. Internally nedalloc's implementations of nedmalloc(), nedcalloc(), nedmemalign() and nedrealloc() all call into this API:
void* malloc2(size_t bytes, size_t alignment, unsigned flags)
void* realloc2(void* mem, size_t bytes, size_t alignment, unsigned
flags)
If nedmalloc.h is being included by C++ code, the alignment and flags parameters default to zero which makes the new API identical to the old API (roll on the introduction of default parameters to C!). The ability for realloc2() to take an alignment is particularly useful for extending aligned vector arrays such as SSE/AVX vector arrays. Hitherto SSE/AVX vector code had to jump through all sorts of unpleasant hoops to maintain alignment during array extension :(.
The flags supported include the ability to zero memory, to prevent realloc2() from moving a memory block, to force mmap() to be used from the beginning (useful when you know an array will be repeatedly extended) and to cause malloc2() to reserve additional address space after the allocation such that a realloc2() up to that reserved space will be very quick. On 32 bit Windows and Linux this reservation costs no address space in your process, so using it will NOT cause premature address space exhaustion.
You should note that realloc()'s thunk to realloc2() defaults the flags to M2_RESERVE_MULT(8) i.e. if realloc() needs to allocate a block larger than mmap_threshold, it will also reserve eight times the address space of that allocation in order to make future realloc()'s up to that point much faster. This catches the vast majority of situations where large arrays are repeatedly extended.
If you want the very latest version of this allocator, get it from the TnFOX GIT repository at either of (both are identical mirrors):
IF YOU THINK YOU HAVE FOUND A BUG, PLEASE CHECK ONE OF THESE REPOS FIRST BEFORE REPORTING IT!
Because of how nedalloc allocates an mspace per thread, it can cause severe bloating of memory usage under certain allocation patterns. You can substantially reduce this wastage by setting DEFAULTMAXTHREADSINPOOL or the threads parameter to nedcreatepool() to a fraction of the number of threads which would normally be in a pool at once. This will reduce bloating at the cost of an increase in lock contention, with DEFAULTMAXTHREADSINPOOL=1 removing almost all bloating. If the block sizes typically allocated are less than THREADCACHEMAX, locking is avoided 90-99% of the time and if most of your allocations are below this value, you can safely set DEFAULTMAXTHREADSINPOOL or even MAXTHREADSINPOOL to one.
If you have LOTS of threads you may find that the threadcache held per thread is causing memory bloating. You can call nedtrimthreadcache() to trim the cache in a thread when you know that it won't be doing memory allocation (e.g. just before going to sleep), or alternatively you can set THREADCACHEMAXFREESPACE to something smaller than its default of 1Mb.
Lastly, some people find that memory is not returned to the system when they think it ought to be. dlmalloc only returns free memory to the system when there is DEFAULT_TRIM_THRESHOLD (default=2Mb) free in a mspace, and it only checks how much there is free outside the topmost segment every MAX_RELEASE_CHECK_RATE free()'s. In other words, if your program very rapidly deallocates an awful lot of memory and then does not call free() for some time thereafter, dlmalloc will not release memory to the system. Generally in any real world code scenario free() will be called fairly frequently, and if not then you can always force release using nedmalloc_trim().
You will suffer memory leakage unless you call neddisablethreadcache() per pool for every thread which exits (unless you are using nedalloc from its DLL on Windows). This is because nedalloc cannot portably know when a thread exits and thus when its thread cache can be returned for use by other code. Don't forget pool zero, the system pool. On some POSIX threads implementations there exists a pthread_atexit() which registers a termination handler for thread exit - if you don't have one of these then you'll have to do it manually.
Equally if you use nedalloc from a dynamically loaded DLL or shared object which you later kick out of memory, you will leak memory if you don't disable all thread caches for all pools (as per the preceding paragraph), destroy all thread pools using neddestroypool() and destroy the system pool using neddestroysyspool().
For C++ type allocation patterns (where the same small sizes of memory are regularly allocated and deallocated as objects are created and destroyed), the threadcache always benefits performance as it will cache all malloc/free allocations under THREADCACHEMAX in size. If however your allocation patterns are different, searching the threadcache may significantly slow down your code - as a rule of thumb, if cache utilisation is below 80% (see the source for neddisablethreadcache() for how to enable debug printing in release mode) then you should disable the thread cache for that thread. You can compile out the threadcache code by setting THREADCACHEMAX to zero.
For some applications defining ENABLE_LARGE_PAGES can give a 10-15% performance increase by having nedalloc allocate using large pages only (which are 2Mb on x86/x64). Large pages take much less space in the TLB cache and can greatly benefit programs with a large working set, particularly on 64 bit systems.
Support for large pages is limited to Linux and Windows. On Linux one must employ the libhugetlbfs library anyway as this is the "official" form of large page support, and setting it up and configuring it involves mounting a special hugetlbfs filing system. dlmalloc does not require a dependency on the libhugetlbfs headers, rather it searches for the library in the current process and if not found it silently disables support.
On Windows, large page support is only implemented on Windows Server 2003/Vista or later and they are only permitted to be allocated by users holding the "Lock pages in memory" local security setting which is DISABLED by default. Furthermore, the process using nedalloc must hold the SeLockMemoryPrivilege privilege. If you are using the DLL then the DLL attempts to enable the SeLockMemoryPrivilege during initialisation - therefore if you are not using the DLL you will have to do this manually yourself. As with Linux support, if at any stage large pages cannot be allocated, then dlmalloc silently disables support - this allows one binary to function correctly in any environment. Note that on Windows if your process allocates a lot of memory at once when the machine has been running for an extended period, then the whole computer may hang for several seconds as the Windows kernel copies memory around in order to coalesce a large page. This is a problem with the Windows kernel and its VM design, not nedmalloc! If you would like to see how large pages ought to be implemented, research how FreeBSD implemented them.
It is often very useful to have a log of the memory operations which an application performs - you would be amazed at the inefficiencies in memory usage that this can reveal. nedalloc contains a very fast memory operation logger which keeps a per-thread log of selected operations, including an optional stack backtrace. On pool destruction, or nedflushlogs(), nedalloc will write out the log as a Comma Separated Value format file which can be loaded into applications such as Excel for analysis.
To use, define ENABLE_LOGGING to the bitmask of enum LogEntryType items in which you are interested, so 0xffffffff would log absolutely everything. The macro NEDMALLOC_TESTLOGENTRY, whose default is (ENABLE_LOGGING & logentrytype), is then used to determine which items should be logged. You can also enable stack backtracing on MSVC and GCC using NEDMALLOC_STACKBACKTRACEDEPTH.
If you are running on Windows, there are quite a few extra options available thanks to work generously sponsored by Applied Research Associates (USA):
If you define REPLACE_SYSTEM_ALLOCATOR when building the DLL then the DLL will replace most usage of the MSVCRT allocator (release MSVCRT, not debug MSVCRTD) within any process it is loaded into with nedalloc's routines instead, whilst remaining able to handle the odd free() of a MSVCRT allocated block allocated during CRT init. This very conveniently allows you to simply link with the nedalloc DLL and your application magically now uses it with no code changes required, and because the MSVC implementation of operators new and delete both call malloc() and free() it also covers all C++ code. The following code is suggested:
#pragma comment(lib, "nedmalloc.lib")
This asks the linker to link against nedmalloc.lib during linking - without this pragma the linker will generally leave out nedmalloc as there are no explicitly imported routines that it understands. This auto-patching feature can also be combined with Microsoft's Detours to run any arbitrary application using nedalloc instead of the system allocator:
withdll /d:nedmalloc.dll program.exe
For those not able to use Microsoft Detours, there is an enclosed unsupported/nedmalloc_loader program which does one variant of the same thing. It may or may not be useful to you - it is not intended to be maintained, and it probably doesn't work on newer systems.
The reason that only the release MSVCRT not the debug MSVCRTD is patched is twofold: (i) usually one wants the debug heap in debug builds so it does memory corruption checking and reports memory leaks and (ii) the MSVC CRT actually implements operator new and malloc using a completely different implementation based on the Windows kernel HeapAlloc() function and it does a lot of hoop jumping to handle mismatching CRT versions and lots of other stuff. You can enable patching of the debug memory allocation functions in winpatcher.c by uncommenting the relevant lines.
See Benchmarks.xls for details.
The enclosed test.c can do one of two things: it can be a torture test which mostly hammers realloc() or it can be a pure speed test which sticks to simple malloc() and free(). If you enable C++ mode, half of the allocation sizes will be a two power multiple less than 512 bytes (to mimic C++ stack instantiated objects) which are extremely common in C++ code.
The torture test is designed to mercilessly work realloc() which is the most complex and complete code path in any memory allocator. Most allocators have very poor realloc() performance - not so nedalloc which makes use of mremap() support on Linux and Windows. Even without mremap() support nedalloc's realloc() tends to be significantly faster than any standard allocator.
The speed test is designed to be a representative synthetic memory allocator test where most allocations follow a stack pattern. It works by randomly mixing allocations with frees with sizes being a random value less than 16Kb.
The C++ test.cpp simply benchmarks how much difference nedalloc::nedallocatorise<> makes to std::vector<> performance, particularly the performance of push_back(), pop_back() and vector assignment all of which are very common in real world code. As you will see, the STL - even with C++0x move constructor support - does not perform anywhere close to nedalloc's version which achieves its gains by simply avoiding copy and move construction completely.
The real world code results are from Tn's TestIO benchmark. This is a heavily multithreaded and memory intensive benchmark with a lot of branching and other stuff modern processors don't like so much. As you'll note, the test doesn't show the benefits of the threadcache mostly due to the saturation of the memory bus being the limiting factor.
I get a quite a few bug reports about code not working properly under nedalloc. I do not wish to sound presumptuous, however in an overwhelming majority of cases the problem is in your application code and not nedalloc (see below for all the bugs reported and fixed since 2006). Some of the largest corporations and IT deployments in the world use nedalloc pre-v1.10, and pre-v1.10 has been very heavily stress tested on everything from 32 processor SMP clusters right through to root DNS servers, ATM machine networks and embedded operating systems requiring a very high uptime. The v1.10 release adds a LOT of new code and features, and hence there are quite likely a lot of new bugs in the new code.
In particular, just because it just happens to appear to work under the system allocator does not mean that your application is not riddled with memory corruption and non-ANSI usage of the API! And usually this is not your code's fault, but rather it is usually the third party libraries being used which sadly often include system libraries.
Even though debugging an application for memory errors is a true black art made possible only with a great deal of patience, intuition and skill, here is a checklist for things to do before reporting a bug in nedalloc:
I hope that these tips help. And I urge anyone considering simply dropping back to the system allocator as a quick fix to reconsider: squashing memory bugs often brings with it significant extra benefits in performance and reliability. It may cost what appears to be a lot extra now, but it usually will save itself many times its cost over the next few years. I know of one large multinational corporation who saved hundreds of millions of dollars due to the debugging of their system software performed when trying to get it working with nedalloc - they found one bug in nedalloc but over a hundred in their own code, and in the process improved performance threefold which saved an expensive hardware upgrade and deployment. The conclusion can only be that fixing memory bugs now tends to be worth it in the long run.