llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
27.84k stars 11.47k forks source link

libc (embedded) `malloc` space and time improvements (comparing to dlmalloc) #98096

Open mysterymath opened 2 months ago

mysterymath commented 2 months ago

Despite its age, Doug Lea's malloc implementation (dlmalloc) still remains near the state of the art for embedded single-threaded or lightly multi-threaded allocators. While there has been much improvement in the space of general system mallocs since dlmalloc was SOTA, much of that work owes to Hoard, which shlfted design focus towards cache locality, limiting contention, and page friendliness. These concerns typically apply much less (or not at all) in the embedded space. There has been a fair amount of work in the embedded space too, but much of it centers around improving dlmalloc (e.g. TLSF). dlmalloc is also the main implementation in newlib and picolibc, so llvm-libc's malloc is likely to be directly compared against it.

Accordingly, this issue catalogs the characteristics of the current llvm-libc that compare unfavorably to dlmalloc. Many of these can be corrected without too much trouble; others might require some light algorithms work. It's also likely possible to build something smaller than dlmalloc that would better suite this space; after all, it wasn't originally designed for it.

All of the following assumes a 32-bit system. Take all the following with a huge grain of salt. I've only read the comments and implemented something similar to dlmalloc on the side. I haven't actually measured dlmalloc or llvm-libc's current allocator.

Space

Time

All dlmalloc operations finish in "constant time". This is only true in the same sense that it's true for every logarithmic time operation on a data structure stored in memory; dlmalloc uses a clever binary trie to organize its free lists.

The current allocator scans its free lists in time linear to the number of free blocks in that size class. Since the free lists are singly linked, it also does a scans when chunks are coalesced during free, which could otherwise be done in constant time.

Initialization cost

Embedded dlmalloc typically initializes the heap lazily. Since the heap doesn't need to be zeroed, the initialization simply establishes boundary tags for the free area, which takes constant time.

llvm-libc's malloc places the heap in .bss, which may require it to be unnecessarily zeroed on program startup. This is linear in the size of the heap.

llvmbot commented 2 months ago

@llvm/issue-subscribers-libc

Author: Daniel Thornburgh (mysterymath)

Despite its age, Doug Lea's `malloc` implementation (dlmalloc) still remains near the state of the art for embedded single-threaded or lightly multi-threaded allocators. While there has been much improvement in the space of general system `malloc`s since dlmalloc was SOTA, much of that work owes to Hoard, which shlfted design focus away towards cache efficiency and page friendliness, which mostly don't apply in the embedded space. There has been a fair amount of work in the embedded space too, but much of it also mostly centers around improving dlmalloc (e.g. TLSF). dlmalloc is also the main implementation in newlib and picolibc, so llvm-libc's `malloc` is likely to be directly compared against it. Accordingly, this issue catalogs the characteristics of the current llvm-libc that compare unfavorably to dlmalloc. Many of these can be corrected without too much trouble; others might require some light algorithms work. It's also likely possible to build something smaller and better than dlmalloc for this space; after all, it wasn't originally designed for it. All of the following assumes a 32-bit system. Take all the following with a huge grain of salt; I've only read the code and implemented something similar to dlmalloc on the side; I haven't actually measured dlmalloc or llvm-libc's current allocator. Space: - Minimum allocation size - dlmalloc's minimum effective allocation size is 16 bytes, which is the minimum possible for an 8-byte aligned `malloc` with any overhead. - It's a bit hard to read out of the implementation, but it appears that the current allocator has 20 bytes of overhead per free chunk, making the minimum size at least 20 bytes. More may be required for alignment, but it's difficult to tell what the minimum or average cases would be. - Used block efficiency - dlmalloc's used blocks contain only a single 4 byte word as overhead. Accordingly, the minimum 16-byte chunks size can fit a 14-byte allocation. - The current allocator has used blocks pay all but 8 bytes of the overhead of a free block. So a minimum-sized 20-byte chunk could only fit an 8 byte allocation. Time: All dlmalloc operations finish in "constant time". This is only true in the same sense that it's true for every logarithmic time operation on a data structure stored in memory; dlmalloc uses a clever binary trie to organize its free lists. But many operations do not involve the trie, and they function with a much smaller constant overhead. The current allocator scans its free lists in time linear to the number of free blocks in that size class. Since the free lists are singly linked, it also does a scans when chunks are coalesced during `free`, which could otherwise be done in constant time. Initialization cost: Embedded dlmalloc typically initializes the heap lazily. Since the heap doesn't need to be zeroed, the initialization simply establishes boundary tags for the free area, which takes constant time. llvm-libc's malloc places the heap in `.bss`, which requires it to be unnecessarily zeroed on program startup, which is an operation linear in the size of the heap.