gapisback commented 1 year ago

The main change with this commit is the support for free-fragment lists and recycling of small fragments from shared memory. This was a main limitation of the support added in previous commits.

Another driving factor for implementing free-fragment list support was that previous multi-user concurrent insert performance benchmarking was not functional beyond a point. We would frequently run into shmem Out-Of-Memory (OOMs), even with shmem sizes > 8 GiB (which worked in a prior dev/perf-test cycle).

Design Overview

The main design changes to manage small-fragments are follows:

Managing memory allocation / free using `platform_memfrag{}` fragments

Allocation and free of memory is dealt with in terms of "memory fragments", a small structure that holds the memory->{addr, size}. All memory requests (as is being done previously) are aligned to the cacheline.
- Allocation: All clients of memory allocation have to "hand-in" an opaque platform_memfrag{} handle, which will be returned populated with the memory address, and more importantly, the size-of-the-fragment that was used to satisfy the memory request.
- Free: Clients now have to safely keep a handle to this returned platform_memfrag{}, and hand it back to the free() method. free() will rely "totally" on the size specified in this input fragment handle supplied. And the free'd memory fragment will be returned to the corresponding free-list bucket, if the fragment's size is one in a small set of free-fragments being tracked.
Upon free(), the freed-fragment is tracked in a few free-lists bucketed by size of the freed-fragment. For now, we support 4 buckets, size <= 64, <= 128, <= 256 & <= 512 bytes. (These sizes are sufficient for current benchmarking requirements.)

A free'd fragment is hung off of the corresponding list, threading the free-fragments using the fragment's memory itself.
New struct free_frag_hdr{} provides the threading structure. It tracks the current fragment's size and free_frag_next pointer. The 'size' provided to the free() call is is recorded as the free'd fragment's size.
Subsequently, a new alloc() request is 1st satisfied by searching the free-list corresponding to the memory request.

For example, a request from a client for 150 bytes will be rounded-up to a cacheline boundary, i.e. 192 bytes. The free-list for bucket 256 bytes will be searched to find the 1st free-fragment of the right size. If no free fragment is found in the target list, we then allocate a new fragment. The returned fragment will have a size of 256 (for an original request of 150 bytes).
An immediate consequence of this approach is that there is a small, but significant, change in the allocation, free APIs; i.e. TYPED_MALLOC(), TYPED_ARRAY_MALLOC() and TYPED_FLEXIBLE_STRUCT_MALLOC(), and their 'Z' equivalents, which return 0'ed out memory.
All existing clients of the various TYPED*() memory allocation calls have been updated to declare an on-stack platform_memfrag{} handle, which is passed back to platform_free().
In some places memory is allocated to initialize sub-systems and then torn down during deinit(). In a few places existing structures are extended to track an additional 'size' field. The size of the memory fragment allocated during init() is recorded here, and then used to invoke platform_free() as part of the deinit() method.
- An example is clockcache_init() where this kind of work to record the 'size' of the fragment is done and passed-down to clockcache_deinit(), where the memory fragment is then freed with the right 'size'.
This pattern is now to be seen in many such init()/deinit() methods of diff sub-systems; e.g. pcq_alloc(), pcq_free(), ...
Copious debug and platform asserts have been added in shmem alloc/free methods to cross-check to some extent illegal calls.

Cautionary Note

If the 'ptr' handed to platform_free() is not of type platform_memfrag{} , it is treated as a generic , and its sizeof() will be used as the 'size' of the fragment to free.

This works in most cases. Except for some lapsed cases where, when allocating a structure, the allocator ended up selecting a "larger" fragment that just happened to be available in the free-list. The consequence is that we might end-up free'ing a larger fragment to a smaller-sized free-list. Or, even if we do free it to the right-sized bucket, we still end-up marking the free-fragment's size as smaller that what it really is. Over time, this may add up to a small memory leak, but hasn't been found to be crippling in current runs. (There is definitely no issue here with over-writing memory due to incorrect sizes.)

Fingerprint Object Management

Managing memory for fingerprint arrays was particularly problematic.

This was the case even in a previous commit, before the introduction of the memfrag{} approach. Managing fingerprint memory was found to be especially cantankerous due to the way filter-building and compaction tasks are queued and asynchronously processed by some other thread / process.

The requirements from the new interfaces are handled as follows:

Added a new fingerprint{} object, struct fp_hdr{}, which embeds at its head a platform_memfrag{}. And few other short fields are added for tracking fingerprint memory mgmt gyrations.
Various accessor methods are added to manage memory for fingerprint arrays through this object. E.g.,
- fingerprint_init() - Will allocate required fingerprint for 'ntuples'.
- fingerprint_deinit() - Will dismantle object and free the memory
- fingerprint_start() - Returns start of fingerprint array's memory
- fingerprint_nth() - Returns n'th element of fingerprint
Packaging the handling of fingerprint array through this object and its interfaces helped greatly to stabilize the memory histrionics.

When SplinterDB is closed, shared memory dismantling routine will tag any large-fragments that are still found "in-use". This is percolated all the way back to splinterdb_close(), unmount() and to platform_heap_destory() as a failure $rc. Test will fail if they have left some un-freed large fragments. (Similar approach was considered to book-keep all small fragments used/freed, but due to some rounding errors, it cannot be a reliable check at this time. So hasn't been done.)

Test changes

Miscellaneous

Elaborate and illustrative tracing added to track memory mgmt done for fingerprint arrays, especially when they are bounced around queued / re-queued tasks. (Was a very problematic debugging issue.)
Extended tests to exercise core memory allocation / free APIs, and to exercise fingerprint object mgmt, and writable_buffer interfaces:
- platform_apis_test:
- splinter_shmem_test.c: Adds specific test-cases to verify that free-list mgmt is happening correctly.
Enhanced various diagnostics, asserts, tracing
Improved memory usage stats gathering and reporting
Added hooks to cross-check multiple-frees of fragments, and testing hooks to verify if a free'd fragment is relocated to the right free-list

netlify[bot] commented 1 year ago

Deploy Preview for splinterdb canceled.

Name	Link
Latest commit	9281c83f8d953edfaa6d8a338f97da32a53001ed
Latest deploy log	https://app.netlify.com/sites/splinterdb/deploys/65bb050c366a1500096d0cad

gapisback commented 1 year ago

NOTE: to the reviewers: @rtjohnso @ajhconway @rosenhouse :

This is part-3 of the shared memory support dev work.

I have layered this on top of the aguraajda/shmem-mp-support-Rev, so you will see 3 commits. ONLY review the diffs in the [3rd commit](Support free-fragment recycling in shared-segment. Add fingerprint ob…).

The other two previous commits are being reviewed under previous 2 PRs.

Here is the order to review files in, to get a good grip on this change-set:

platform.h, platform_inline.h: For changes in TYPED_*ALLOC interfaces. Introduction of platform_memfrag{}
util.h and util.c - See the changes to add struct fp_hdr{} and fingerprint object mgmt APIs. Some changes related to writable_buffers are also in these two files.
btree.h and btree.c - Initial exposure to finger print object changes
shmem.c - For the core of free-list management logic changes
Rest of the files where the interfaces changes are religiously plumbed through alloc / free actions.
Check test.sh to see extended test-coverage being brought-in.
Specifically enabling --use-shmem to functional filter_test.c was a BIG Help to stabilize fingerprint object mgmt
See also in unit: platform_apis_test.c, splinter_shmem_test.c, and few others that were enhanced.

gapisback commented 11 months ago

HI, @rtjohnso - Status update.

I have re-based this work on top of /main and have gone thru multiple commits to stabilize this work. However, it's not quite, yet, ready for review, for these reasons:

While finalizing these changes, I came across some calls to platform_free() which looked suspiciously wrong. They could result in unaccounted memory-usage being 'freed' causing, over time, a slow "memory leak".
Rather than fixing instances one-at-a-time, I started to implement some sanity checking to verify that the memory fragment being freed is, indeed, an allocated one and that the allocated fragment's size matches the size provided to free. Even early implementations of this sanity-checking tripped up quite easily.
I realized that in my previous implementation, I had not rototilled interfaces such as TYPED_MALLOC(), TYPED_ZALLOC and TYPED_ALIGNED_MALLOC(), TYPED_ALIGNED_ZALLOC to take the platform_memfrag * argument. (They were left to work in the current style, somewhat as a convenience.) I now think it's better to tighten all memory allocation and free interfaces to always take a platform_memfrag * argument. This will greatly simplify and tighten sanity and assertion checking during free.
~CI-jobs are jammed -- reasons are unknown to me. They are not getting unblocked.~ (Gabriel has resolved this ...) So, I now have ~no~ insight into overall stability of this work for the larger set of CI test-jobs. Most test runs have passed, except for one failure in one CI-job. Will investigate.

I'm off rest of this week of Thanksgiving. Will try to see if I can get some time off-line to work on (3) and finish-up (2) above. I will try to complete this work and bring something to review 1st week of Dec.

gapisback commented 8 months ago

@rtjohnso - The final part-3 shared memory support change-set is now ready for review.

The suggested order in which to review these diffs is:

src/platform_linux/platform.h
src/platform_linux/platform_inline.h
src/platform_linux/platform_types.h
src/util.h, src/util.c
src/platform_linux/shmem.h
src/platform_linux/shmem.c
src/trunk.h, src/trunk.c
src/routing_filter.c
Then the rest of the files.
Good luck!

rtjohnso commented 8 months ago

I think the current memfrag interface is leaky and not general.

I think the interface should look like this:

platform_status
platform_alloc(memfrag *mf, // OUT
               int size);

platform_status
platform_realloc(memfrag *mf, // IN/OUT
                 int newsize);

platform_status
platform_free(memfrag *mf); // IN

void *
memfrag_get_pointer(memfrag *mf);

(Note that details, like the exact names of the functions or the memfrag datatype are not too important in this example.)

The point is that the rest of the code should treat memfrags as opaque objects. In the current code, the rest of the code goes around pulling out fields and saving them for later use. It means that internal details of the current allocator implementation are being leaked all over the rest of the code. This will make it difficult to change the allocator implementation down the road.

As for names, I would advocate renaming memfrag to memory_allocation.

gapisback commented 8 months ago

HI, @rtjohnso --

Thanks for your initial approach on reworking the interfaces.

I'm happy to take this further, but I feel this round-trip discussion will become long and meandering. And this review panel UI exchange is not ideally suited for that kind of interaction.

I want to avoid re-doing the implementation till we've settled on and agreed to the new interfaces. Every bit of code rework requires massively editing the change-set and re-stabilizing - an effort I would like to avoid doing multiple times.

How about I start a new thread under Discussions tab, with your initial proposal? And, will give you my responses, rebuttal. I suspect we will have to go back-and-forth a few times before settling on the final interfaces.

(As a team, we haven't used the Discussions tab feature internally. As I am beginning my transition to fully out-of-VMware, it may be a good opportunity to engage using this GitHub feature, so it continues even when I'm a fully O-Sourced' engineer.)

gapisback commented 8 months ago

@rtjohnso - My CI-stabilization jobs have succeeded. I have squashed all changes arising from our proposal discussion thread into this one single commit and have refreshed this change-set.

You can restart your review on this amended change-set. (I expect CI-jobs will succeed as they did in the stabilization PR #616 )

gapisback commented 8 months ago

@rtjohnso : Fyi -- I want to log this one ASAN-instability the most recent round of CI-jobs ran into, as I am not going to remember all this later.

Here is the state of affairs and results of my investigations.

CI Job no. 109 (main-pr-asan) job failed with this error:

 build/release-asan/bin/driver_test splinter_test --perf --use-shmem --max-async-inflight 0 --num-insert-threads 4 --num-lookup-threads 4 --num-range-lookup-threads 0 --tree-size-gib 2 --cache-capacity-mib 512

build/release-asan/bin/driver_test: splinterdb_build_version 9281c83f

Dispatch test splinter_test

Attempt to create shared segment of size 8589934592 bytes.

Created shared memory of size 8589934592 bytes (8 GiB), shmid=8617984.

Completed setup of shared memory of size 8589934592 bytes (8 GiB), shmaddr=0x7f6924570000, shmid=8617984, available memory = 8589894272 bytes (~7.99 GiB).

filter-index-size: 256 is too small, setting to 512

Running splinter_test with 1 caches

splinter_test: SplinterDB performance test started with 1 tables

splinter_perf_inserts() starting num_insert_threads=4, num_threads=4, num_inserts=27185152 (~27 million) ...

Thread 2 inserting  37% complete for table 0 ... =================================================================

==2666==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x7f68fba16f80 at pc 0x7f6b276fef50 bp 0x7f69044aa080 sp 0x7f69044a9828

READ of size 589 at 0x7f68fba16f80 thread T1

Thread 2 inserting  42% complete for table 0 ... OS-pid=2666, OS-tid=2669, Thread-ID=3, Assertion failed at src/trunk.c:2213:trunk_get_new_bundle(): "(node->hdr->end_bundle != node->hdr->start_bundle)". No available bundles in trunk node. page disk_addr=1513291776, end_bundle=3, start_bundle=3

./test.sh: line 115:  2666 Aborted                 "$@"

make: *** [Makefile:558: run-tests] Error 134

Upon re-run this asan job no. 109.1 succeeded.
Attempted to manually re-run this specific test multiple times on my Nimbus-VM,but could not reproduce the ASAN heap-buffer-overflow error. Ran the exact test with different combinations (one run, 4 concurrent runs with the exact same params, 4 concurrent executions with increasing thread-count up to --num-insert-threads 8 --num-lookup-threads 8, and similar stress load on the VM), but could not repro the problem outside CI.

The last variation of this test in manual repro attempts I tried is 4 concurrent invocations of this test: (Logging this here so I can refer to this later on.)

./driver_test splinter_test --perf --use-shmem --max-async-inflight 0 --num-insert-threads 8 --num-lookup-threads 8 --num-range-lookup-threads 0 --tree-size-gib 2 --cache-capacity-mib 512

The VM has 16 vCPUs, so I figured by running with 8 insert-threads and 4 concurrent instances, we'd load the CPU high-enough to tickle any bugs out. But the ASAN problem did not recur in these manual repro attempts.

NOTE: In the original failure in CI, hard to tell exactly, but it seems like the thread ID 2 ran into the ASAN memory over flow and soon after, thread ID=3 ran into this assertion a few lines later:

OS-pid=2666, OS-tid=2669, Thread-ID=3,  Assertion failed at src/trunk.c:2213:trunk_get_new_bundle(): "(node->hdr->end_bundle != node->hdr->start_bundle)". No available bundles in trunk node. page disk_addr=1513291776, end_bundle=3, start_bundle=3

You may recall that I had reported issue #474 some time ago for this trunk bundle mgmt assertion.

I suspect that there is something lurking there that popped up in the CI-run.

I cannot explain how / whether / if this assertion tripping is caused by the ASAN heap-buffer-overflow error or if they are even related. Unfortunately, I could not repro the ASAN issue outside CI, so have to give up on this investigation now.

The rest of the test runs are stable, and this ASAN-job did succeed on a re-run. I have re-reviewed the code-diffs applied recently and could not find anything obviously broken. For now, I will have to conclude that the changes are fine except there may be some hidden instability popping up, possibly triggered by issue #474 mentioned earlier.

gapisback commented 8 months ago

@rtjohnso - I've gone thru your review comments quickly. Most of those are easily implementable. I will get to it.

I've mostly just gone through the headers in the platform code, plus the fingerprint array api.

I am curious about your review of the fingerprint array API rework. Did you not find any issues with that? I was bracing myself to get lots of comments as this area is fragile and the rework is a bit tricky. If you think this array API is acceptable, then that will reduce a bunch of rework rounds on me.

Let's get the new apis sorted and then I can review the whole PR.

Let me apply the changes requested and then re-test. (CI-re-test stabilization will be a nightmare starting tomorrow.)

Once I go over all the changes, I will be better able to answer this question of yours:

Or is there anything else major?

... for which the answer now is, I don't think so, off-hand.

rtjohnso commented 8 months ago

I left a few comments on the fingerprint array code already.

I haven't done a full evaluation. It seemed more complex than I expected, but I see that it is trying to make explicit some of the complex sharing that goes on with the fingerprint arrays, which is a goal I like. I will want to do a more thorough review of how it is used to understand how it all fits together.

rtjohnso commented 8 months ago

I spoke with Alex today about the overall design, and he really doesn't like how the whole concept of memfrags puts a burden on the rest of the code.

So let's do the following. Whenever the shm code allocates memory, it allocates one extra cache line in front, and stores the memfrag on that cacheline. Later, during a free, you use pointer arithmetic to find the memfrag for that pointer.

vmware / splinterdb

Support free-fragment recycling in shared-segment. Add fingerprint object management. #569

Design Overview

Managing memory allocation / free using `platform_memfrag{}` fragments

Cautionary Note

Fingerprint Object Management

Test changes

Deploy Preview for splinterdb canceled.

vmware / splinterdb

Support free-fragment recycling in shared-segment. Add fingerprint object management. #569

Design Overview

Managing memory allocation / free using platform_memfrag{} fragments

Cautionary Note

Fingerprint Object Management

Test changes

✅ Deploy Preview for splinterdb canceled.

Managing memory allocation / free using `platform_memfrag{}` fragments

Deploy Preview for splinterdb canceled.