mingtaoy commented 3 years ago

Sized Deallocation Support

High performance memory allocators such as jemalloc and TCMalloc provide a non-standard sdallocx API which improves performance on the deallocation path. For jemalloc, sized deallocation is approximately ~30% cheaper in terms of CPU compared to unsized deallocations.

I'm opening this issue to discuss adding sized deallocation support in OpenSSL. Sized deallocation is supported in BoringSSL, and I believe this would also be useful in OpenSSL as well.

Possible implementation strategies include:

Prepending the size of the allocation to each object. Each OPENSSL_malloc or OPENSSL_zalloc of N bytes is actually an allocation of N + sizeof(size_t) bytes. OPENSSL_free reads the length prefix and invokes sdallocx(p, *(size_t*)(p-1)) bytes. This is the approach taken by BoringSSL. The advantage is that this approach minimizes code changes (this can be done almost entirely in crypto/mem.c). The disadvantages include the additional wasted space overhead and potential complexity of dealing with non-aligned objects (when interfacing with assembly routines).
Changing OPENSSL_free(p) calls throughout the codebase to OPENSSL_free_sized(p, sizeof(``*p))* or OPENSSL_free_sized(p, p->len). This can be done with the assistance of a refactoring tool, or done incrementally, as there are diminishing returns. The hottest deallocations paths we measured in the Facebook fleet are primarily in SSL3_BUFFER (due to SSL_MODE_RELEASE_BUFFERS) and in WPACKET_close., so we would start with those.

After these changes to the internals of OpenSSL so that deallocation is accompanied with a potential size hint, we would need to wire up the sdallocx calls somehow.

You can use some linking tricks with weak symbols, and rely on ELF symbol interposition to find the sdallocx API. BoringSSL takes this approach. The advantage is that this allows sized allocation support to be detected at runtime transparently. The disadvantage is that not all platforms support this, and there are situations where this cannot work.
Explicitly having the user supply a sized deallocation function, via something like CRYPTO_set_mem_funcs. If a sized deallocator is not set, fall back to using the traditional unsized free. a. Changing the signature of CRYPTO_free_fn from void(void``, const char, int) to void(void``, size_t, const char*, int) or adding an additional argument to CRYPTO_set_mem_funcs would be ABI and API breaking. b. Consequently, I'm inclined to add a separate setter for a sized deallocation func (e.g. CRYPTO_set_mem_func_sized(...)). If no sized deallocation function is set, then existing behavior is preserved.
Or this can be purely a compile time option. However, all of the major Linux distributions use glibc's allocator, which does not support sdallocx. Since OpenSSL is typically packaged as a dynamically linked library compiled against the glibc headers, this would limit the usability of this change.

If this feature request is reasonable, is there a preferred implementation approach before we start sending PRs?

paulidale commented 3 years ago

Why shouldn't the high performance memory allocator maintain the size as a prefix (or elsewhere), if that path is so much faster? I.e. I don't see that this is something OpenSSL should do.

davidtgoldblatt commented 3 years ago

@mingtaoy asked me to comment here for some background as a jemalloc developer.

I think there's a few reasons why most modern perf-sensitive allocators employ out-of-band metadata

Most of the heap in most real programs is occupied by small allocations. Storing size as a prefix would require adding (at least) _Alignof(max_align_t) bytes to every allocation, increasing memory consumption. Out-of-band metadata lets these allocations share space.
Thread-local caching as a lock-contention-avoidance strategy has better cache locality during fills and flushes if the metadata it has to access is more centralized.

I think I agree with @paulidale that the suggested strategy 1 probably doesn't make sense; e.g. having OpenSSL redundantly track sizes would hit the same problems as above. Rather, I think using the sized deallocation APIs makes more sense in those cases where the size of the deallocation is known anyways to the application, which in practice tends to be most of the time (the only common exception being NULL-terminated C-strings). I think strategy 2 makes more sense; in the common case where the user is already tracking allocation sizes anyways, letting the implementation know is free on the application side, and helps the malloc side out quite a bit.

paulidale commented 3 years ago

OpenSSL uses sized deallocations when the data needs to be zeroed. There is no reason to not pass sizes to other deallocations (where they are known), but I still suspect it would be better for the memory allocation library to track this.

marksantaniello commented 3 years ago

It's worth mentioning that the size passed into CRYPTO_clear_free seems to be, in at least some cases, the /used/ portion of the buffer, and not the original allocated size. Passing that size directly into sdallocx() resulted in heap corruption in our experiments.

marksantaniello commented 3 years ago

I'd like to mention one other thing: AFAIK, the C-style sdallocx() is not standard, but for what it's worth, C++14 did standardize its moral equivalent "sized delete": http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2013/n3536.html

TL;DR: sized deallocation is an increasingly common feature of high-performance allocators.

richsalz commented 3 years ago

It's worth mentioning that the size passed into CRYPTO_clear_free seems to be, in at least some cases, the /used/ portion of the buffer, and not the original allocated size. Passing that size directly into sdallocx() resulted in heap corruption in our experiments.

The secure heap rounds things up to a power of two size, so if you ask for 14bytes you'll get 16. The actual_size function in crypto/mem_sec.c handles this.

marksantaniello commented 3 years ago

The secure heap rounds things up to a power of two size, so if you ask for 14bytes you'll get 16. The actual_size function in crypto/mem_sec.c handles this.

Oh I see. In that case it should be easy to integrate sized deallocation into CRYPTO_secure_free and CRYPTO_secure_clear_free. Thanks!

paulidale commented 3 years ago

Just a note: the secure heap always clears the memory: there is no difference between CRYPTO_secure_free and CRYPTO_secure_clear_free.

I'm also not sure how suitable the secure memory is for conversion to use sized deallocation. In the case where it is not used, the calls could go through to specialised library calls. Where it is being used, it manages memory itself and never frees anything.

nhorman commented 2 months ago

marking as inactive, to be closed at the end of the 3.4 dev cycle barring further input

marksantaniello commented 2 months ago

It might be worth reconsidering this now that we have standard support in the C language as of C23: https://en.cppreference.com/w/c/memory/free_sized

nhorman commented 2 months ago

openssl standardizes on C89, so any use of c23 features would have to at least be opt-in

openssl / openssl

Sized deallocation support #13787

Sized Deallocation Support