gramineproject / gramine

A library OS for Linux multi-process applications, with Intel SGX support
GNU Lesser General Public License v3.0
608 stars 202 forks source link

[Metaissue] Memory allocators in Gramine, their performance and possible alternatives #1767

Open dimakuv opened 9 months ago

dimakuv commented 9 months ago

Description of the feature

Memory allocators in Gramine were written ~15 years ago. It's time to re-evaluate them and propose alternatives.

I want to start a discussion on this topic. In particular, the next posts will contain information on:

Why Gramine should implement it?

We see more and more performance bottlenecks in memory/object allocation inside Gramine itself. They all stem from the fact that our memory allocators are old, ad-hoc, not scalable, and not CPU- and cache-friendly; also our allocators do not perform object caching.

Just a couple of recent examples:

For possible alternatives (in Rust), see this comment: https://github.com/gramineproject/gramine/pull/1723#pullrequestreview-1864216238

dimakuv commented 9 months ago

Description of current memory allocators

There are two memory allocators in Gramine: MEMMGR and SLAB.

Both allocators rely on the following shared logic:

MEMMGR fixed-size allocator

Used to allocate specific objects in specific subsystems. Currently used only in LibOS.

Each subsystem of LibOS that uses MEMMGR specifies its own (global to the subsystem) lock. Thus, MEMMGR object allocs/frees in the same subsystem are synchronized on this lock, but object allocs/frees in different subsystems can run in parallel.

The current users of MEMMGR:

Every managed object is wrapped into a MEM_OBJ_TYPE struct. This struct doesn't have additional fields (if not built with ASan), so it's the most compact representation possible. When the object is "freed" and its underlying memory is moved to the free list, MEM_OBJ_TYPE's list field is used instead.

Design and implementation are very simple:

  1. There is a single "memory manager" object created for each LibOS subsystem.
  2. This memory manager keeps two lists:
    • available areas (disjoint allocated memory regions)
    • free list of re-usable objects
  3. Each subsystem chooses the initial size (number of objects) this manager allocates at startup. The first area is allocated which can host this initial number of objects. This first area is also marked as active (when the first area becomes full, the second area will be allocated and will be marked as active, and first area becomes inactive).
  4. When the subsystem needs a new object (alloc), the manager first checks if its currently active area has any untouched memory left, and if yes, object is "allocated" in this memory. If no memory in active area, then the manager checks the free list, and if there is one free slot, object is "allocated" in this slot. If no free slots, then (if subsystem allowed it), the manager creates a new area, makes it active, and "allocates" the object at the base address of this new area.
  5. When the subsystem frees the object, there are two possibilities:
    • The to-be-freed object is inside the checkpoint memory region (i.e., the object was copied from the parent process). This corner case is not handled (checkpoint memory is never freed or reused), so the manager returns and the object is leaked.
    • Otherwise, the object is added to the free list.

The MEMMGR memory managers are never "reset", or shrunk, or deleted. Thus, if LibOS allocated a lot of MEMMGR objects initially, and then freed them all, then this MEMMGR memory is leaked. This should be a very rare and unimportant case though.

Backend-memory (page-sized) allocation happens via __system_malloc() declared here: https://github.com/gramineproject/gramine/blob/master/libos/src/libos_malloc.c

Backend-memory (page-sized) deallocation, as mentioned above, doesn't really happen. But if it would, then it would be via __system_free() also declared here: https://github.com/gramineproject/gramine/blob/master/libos/src/libos_malloc.c

Support for ASan

Open issues

SLAB variable-size allocator

Generic backend for malloc and free in all other subsystems. Used both in LibOS and in PAL.

When any (random size) object needs to be allocated/freed in LibOS or in PAL, the traditional malloc() and free() calls are used. They are wrappers around slab_alloc() and slab_free(). See:

Backend-memory (page-sized) allocation and deallocation is implemented via:

There is a single global slab manager and corresponding lock for LibOS and similarly a single global slab manager and corresponding lock for PAL. See these:

NOTE We have a struct libos_lock in LibOS. This lock is implemented via PalEventWait() and PalEventSet() which do have a fast path, but the slow path results in ocall_futex(), which is super-expensive in SGX. Maybe we could replace it with a spinlock()? Technically most of the time we'll hit the slab-allocator's cache, which is a fast operation; so it seems like no real need to ocall_futex() in this case.

Design and implementation is based on the MEMMGR allocator for the common case, and has a trivial fallback for the large-object case:

Deallocation happens similarly to the allocation description above:

  1. Get the SLAB allocator's metadata for the to-be-freed object. Part of this metadata is the level value.
  2. If level == -1, it means that the object's size is greater than the max allowed in SLAB, so it was allocated via backend-memory allocation, thus it must be deallocated via backend-memory free: https://github.com/gramineproject/gramine/blob/f35d8e034c1d9bf7b6e2da9125b64a25b622b1d9/common/include/slabmgr.h#L402-L412
  3. Otherwise this object's memory is returned back to the SLAB allocator's level's free list: https://github.com/gramineproject/gramine/blob/f35d8e034c1d9bf7b6e2da9125b64a25b622b1d9/common/include/slabmgr.h#L440-L443

Open issues