microsoft / snmalloc

Message passing based allocator
MIT License
1.57k stars 108 forks source link

__builtin_dynamic_object_size and snmalloc #531

Open mjp41 opened 2 years ago

mjp41 commented 2 years ago

On twitter @richfelker has said we should consider providing __builtin_dynamic_object_size as a more comprehensive way to provide guarded memcpy like features:

I'd love if there were some clean agreed upon way to get the size as a stronger version of __builtin_dynamic_object_size so this would be usable in all the FORTIFYable interfaces not just memcpy. twitter

and

The right way to do this is the way fortify works. Not putting linkage to malloc in the external memcpy, but providing an enhanced __builtin_dynamic_object_size that can query the allocator for knowledge of size. twitter

I'm raising this issue to build a discussion of what this should mean with snmalloc.

The Clang documentation is here: https://clang.llvm.org/docs/LanguageExtensions.html#evaluating-object-size-dynamically

Here is the LLVM review adding it https://reviews.llvm.org/D56760

It doesn't seem it works from an offset into an object. But I think this is worth experimenting with, as it would allow the compiler to remove some checks, and then what is left to be passed to the snmalloc routine.

I think we could define __builtin_dynamic_object_size as just remaining_bytes from snmalloc.

@davidchisnall thoughts? Is there something sensible that could be experimented with here?

davidchisnall commented 2 years ago

I'm not a huge fan of the FORTIFY_SOURCE approach because it doesn't catch any bugs that the clang static analyser doesn't also catch. When I asked Google's security team why they'd added it to Android, their response was that they don't run any kind of static analysis in CI (I believe that's changed now).

I don't think __builtin_dynamic_object_size does what your Twitter people think it does. It has the same behaviour as __builtin_object_size, except that it the result isn't required to be a compile-time constant. For example:

size_t size(void*a)
{
        return __builtin_dynamic_object_size(a, 0);
}

This does not generate any kind of library call, it simply evaluates to -1 unconditionally.

We could provide a library call that the compiler can lower this to but I suspect that the perf overhead would be too high (my first prototype did exactly this). There are basically three approaches:

Note that the last approach composes with a FORTIFY_SOURCE-like approach, because it allows the caller to check stack allocations and globals and compiles down to nothing for the cases where the compiler doesn't know the size at compile time, punting to the library version for heap allocations. We could possibly add an unsafe_memcpy entry point that skips the bounds checks so that a version in the header can dispatch to either of these: if it knows the bounds checks will succeed then it's fine.

We provide our own memcpy because the cost of an extra function call (which involves spilling and reloading arguments) is quite measurable if we do it any other way.

mjp41 commented 2 years ago

I think both of the comments were about expanding the behaviour of

__builtin_dynamic_object_size

rather than its current behaviour. I could easily imagine a "stronger"/"enhanced" version that would call snmalloc instead of returning -1.

I guess the bit that is most appealing is getting checks statically eliminated.

We provide our own memcpy because the cost of an extra function call (which involves spilling and reloading arguments) is quite measurable if we do it any other way.

If the compiler can eliminate sufficiently many checks, then for some instances, the cost of the checks being an additional function call could be mitigated, but we would have to measure.

@davidchisnall Could you post a link to your FreeBSD experiments here?

davidchisnall commented 2 years ago

If the compiler can eliminate sufficiently many checks, then for some instances, the cost of the checks being an additional function call could be mitigated, but we would have to measure.

Currently, LLVM can elide these only for stack allocations, so we'd still be inserting the extra call(s - two if we're checking the source as well as the destination) for things that are not statically provable to be stack allocations. Not sure if GCC does better here.

@davidchisnall Could you post a link to your FreeBSD experiments here?

The current version is here, it doesn't include the hardening for non-memcpy functions - I have an old version of that locally but it needs updating / rewriting for the newer snmalloc and newer libc.

rengolin commented 2 years ago

rather than its current behaviour. I could easily imagine a "stronger"/"enhanced" version that would call snmalloc instead of returning -1.

I'm not an expert, but here are some considerations...

There are a few issues with this:

  1. There is already a fixed (documented) API for both __builtin_object_size and __builtin_dynamic_object_size (here) which specify more or less what the clang docs said, but more explicitly. If we want different behaviour we'd have to create a new call (ex. __builtin_instrumented_object_size) and pass pointer, type and perhaps a function pointer to the malloc-specific checker (with some pre-defined API of its own).
  2. The new builtin will definitely be a dynamic check, but it should be forced inline and super simple (ex. load a value from a global table), with hopefully no branching but safe (returning -1 or 0 when not known).
  3. The builtin isn't used directly on malloc/free, but via __builtin___memcpy_chk and friends, which accept the additional size argument. This requires either user source modification, compiler instrumentation or a default "safe" malloc call and an additional unsafe_malloc as you suggested.
  4. Requiring user code changes is a non-starter, I think. The easiest is to change snmalloc to be safe0first, but that adds dynamic checks on every malloc/copy operation, which may be substantial.
  5. If we want the compiler to add that kind of instrumentation on malloc/free calls, we'll have to change both object_size and _chk variants to select the instrumented variants with some kind of attribute/pragma like fortify source.

In the end, you don't need to change the behaviour of __builtin_object_size and __builtin_dynamic_object_size, you can "just" instrument the _chk functions directly in your memcpy et al replacement with some local computation getting he right size, but that requires overriding those symbols and all that it entails.

fweimer-rh commented 2 years ago

Address Sanitizer already supports such object size checks and has an interface for them, but it does not provide ABI stability.

One complication is that you cannot assume that an arbitrary pointer points into the malloc heap. It could be inside a stack allocation, or the backing memory could have been allocated via a direct mmap system call.

(And it's true that __builtin_dynamic_object_size is not really related to any of this—it can be used to compare compile-inferred sizes with run-time size requirements. From an allocator perspective, this only needs allocation functions that are annotated as such, so that the compiler can recognize them even if they do not have standard names such as malloc.)

mjp41 commented 2 years ago

Address Sanitizer already supports such object size checks and has an interface for them, but it does not provide ABI stability.

Interesting, so could you point me at the documentation or some use cases of this? I wasn't aware that ASAN could determine the size of an object from an interior pointer without a linear scan.

One complication is that you cannot assume that an arbitrary pointer points into the malloc heap. It could be inside a stack allocation, or the backing memory could have been allocated via a direct mmap system call.

So the snmalloc functionality overapproximates the size to the whole address range if it is not backed by an snmalloc allocation. So it will only report actual errors, but if it is allocated elsewhere then it will not report a problem.

fweimer-rh commented 2 years ago

Address Sanitizer already supports such object size checks and has an interface for them, but it does not provide ABI stability.

Interesting, so could you point me at the documentation or some use cases of this?

The use case is the ASAN crash reporter. But I was mistaken: there is no unified API to get the size information directly, the crash reporter in https://github.com/llvm/llvm-project/blob/main/compiler-rt/lib/asan/asan_descriptions.cpp open-codes it.

The regular overflow checking happens differently in ASAN (using shadow memory).

I wasn't aware that ASAN could determine the size of an object from an interior pointer without a linear scan.

The ASAN heap allocator has GetBlockBegin and GetActuallyAllocatedSize interfaces (see https://github.com/llvm/llvm-project/blob/main/compiler-rt/lib/sanitizer_common/sanitizer_allocator_combined.h). I think those are fairly efficient, but they will not work for non-heap memory (I think).

One complication is that you cannot assume that an arbitrary pointer points into the malloc heap. It could be inside a stack allocation, or the backing memory could have been allocated via a direct mmap system call.

So the snmalloc functionality overapproximates the size to the whole address range if it is not backed by an snmalloc allocation. So it will only report actual errors, but if it is allocated elsewhere then it will not report a problem.

Right, that should work for size checking, it's how __builtin_dynamic_object_size decays as well.

rengolin commented 2 years ago

IIUC, the idea behind the __builtin_instrumented_object_size case would be to provide an actual simple implementation that returns a reasonable answer without having to instrument the whole code. Sure, sanitizers can guess well where allocations happen, class sizes and when they're freed, but they do so by intercepting calls to malloc, free and other library calls, which is costly.

But the allocator already has that information on its internal structures. If the allocator provides an API to do that without additional shadow data structures, then runtime checks become very cheap.

WRT API stability, sanitizers don't even have that stability guarantee on the same release, across different architectures, or different OSs, different word sizes. Some things exist in one and not others, some things are reasonably fast in some and ridiculously slow in others.

In contrast, asking the allocator what is the size of an object is a table lookup in most cases, but of course, it can't know about other types of allocations like mmap or even a larger allocation to serve an internal user-driven allocator, or stack objects.

It could be possible (though reasonably pedestrian and frail) to add another call for non-snmalloc allocations so that snmalloc knows about them but not necessarily manages them. If users are willing to part the beaten path with builtin calls, then they could also be happy to manually comlement their mmap calls with snmalloc helpers with pointer and size.

None of that helps with the stack, but the stack has a more controlled temporal usage pattern. For the cases where you want to use the builtin, you could use the static builtin (on the same frame), or propagate the sizeof through some function parameter, etc.

davidchisnall commented 2 years ago

@rengolin, a few comments:

The new builtin will definitely be a dynamic check, but it should be forced inline and super simple (ex. load a value from a global table), with hopefully no branching but safe (returning -1 or 0 when not known).

I believe the idea is to have an API that determines the memory size. You could implement this as something like:

extern size_t malloc_object_size(void*);
__attribute__((always_inline))
static inline size_t object_size(void *obj)
{
  if (__builtin_dynamic_object_size(obj) != SIZE_T_MAX)
    return __builtin_dynamic_object_size(obj);
  return malloc_object_size(obj);
};

If the compiler can statically determine the size of the object then this compiles to the SSA register containing the size, otherwise it compiles to a call.

We get better code generation in the snmalloc versions by expressing this as an explicit bounds-check API. We'd probably want something like (ignoring error messages):

static inline void bounds_check(void *obj, size_t size)
{
  static_assert(__builtin_object_size(obj) < size, "Bounds check failed statically");
  if (__builtin_dynamic_object_size(obj) != SIZE_T_MAX)
  {
    if (__builtin_dynamic_object_size(obj) < size)
      __builtin_trap();
  }
  malloc_bounds_check(obj, size);
}

This won't actually work because __builtin_object_size is not a compile-time constant, but removing the static assert this gives the right codegen for cases where the function call is acceptable.

The builtin isn't used directly on malloc/free, but via __builtin___memcpy_chk and friends, which accept the additional size argument. This requires either user source modification, compiler instrumentation or a default "safe" malloc call and an additional unsafe_malloc as you suggested.

The current checks are also not used directly on malloc/free, they are used in memcpy and other libc functions.

Requiring user code changes is a non-starter, I think. The easiest is to change snmalloc to be safe0first, but that adds dynamic checks on every malloc/copy operation, which may be substantial.

We're exploring integrating snmalloc with libc in several contexts and I think requiring modifications to libc and libc headers is fine (in particular, the _chk wrappers, which are inlined).

If we want the compiler to add that kind of instrumentation on malloc/free calls, we'll have to change both object_size and _chk variants to select the instrumented variants with some kind of attribute/pragma like fortify source.

We don't want to add instrumentation on malloc or free, we want to add a tool that allows you to write wrappers around the accessor functions. The problem with this is that the check must be one of:

If you're doing static linking with LTO, the first one of these is fine. In general, it's too expensive. We've (mostly @mjp41) done a lot of work in the memcpy implementation to ensure that the inlined bounds checks are fast. Any implementation without that level of tight coupling would have worse performance. This doesn't matter for things like sprintf, where the cost of an extra call is in the noise with respect to other functions, but it does matter for anything that's currently fast.