vmware / splinterdb

High Performance Embedded Key-Value Store
https://splinterdb.org
Apache License 2.0
684 stars 57 forks source link

Support free-fragment recycling in shared-segment. Add fingerprint object management. #569

Open gapisback opened 1 year ago

gapisback commented 1 year ago

The main change with this commit is the support for free-fragment lists and recycling of small fragments from shared memory. This was a main limitation of the support added in previous commits.

Another driving factor for implementing free-fragment list support was that previous multi-user concurrent insert performance benchmarking was not functional beyond a point. We would frequently run into shmem Out-Of-Memory (OOMs), even with shmem sizes > 8 GiB (which worked in a prior dev/perf-test cycle).

Design Overview

The main design changes to manage small-fragments are follows:

Managing memory allocation / free using platform_memfrag{} fragments

Cautionary Note

If the 'ptr' handed to platform_free() is not of type platform_memfrag{} , it is treated as a generic , and its sizeof() will be used as the 'size' of the fragment to free.

This works in most cases. Except for some lapsed cases where, when allocating a structure, the allocator ended up selecting a "larger" fragment that just happened to be available in the free-list. The consequence is that we might end-up free'ing a larger fragment to a smaller-sized free-list. Or, even if we do free it to the right-sized bucket, we still end-up marking the free-fragment's size as smaller that what it really is. Over time, this may add up to a small memory leak, but hasn't been found to be crippling in current runs. (There is definitely no issue here with over-writing memory due to incorrect sizes.)

Fingerprint Object Management

Managing memory for fingerprint arrays was particularly problematic.

This was the case even in a previous commit, before the introduction of the memfrag{} approach. Managing fingerprint memory was found to be especially cantankerous due to the way filter-building and compaction tasks are queued and asynchronously processed by some other thread / process.

The requirements from the new interfaces are handled as follows:

Test changes

Miscellaneous

netlify[bot] commented 1 year ago

Deploy Preview for splinterdb canceled.

Name Link
Latest commit 9281c83f8d953edfaa6d8a338f97da32a53001ed
Latest deploy log https://app.netlify.com/sites/splinterdb/deploys/65bb050c366a1500096d0cad
gapisback commented 1 year ago

NOTE: to the reviewers: @rtjohnso @ajhconway @rosenhouse :

This is part-3 of the shared memory support dev work.

I have layered this on top of the aguraajda/shmem-mp-support-Rev, so you will see 3 commits. ONLY review the diffs in the [3rd commit](Support free-fragment recycling in shared-segment. Add fingerprint ob…).

The other two previous commits are being reviewed under previous 2 PRs.

Here is the order to review files in, to get a good grip on this change-set:

  1. platform.h, platform_inline.h: For changes in TYPED_*ALLOC interfaces. Introduction of platform_memfrag{}
  2. util.h and util.c - See the changes to add struct fp_hdr{} and fingerprint object mgmt APIs. Some changes related to writable_buffers are also in these two files.
  3. btree.h and btree.c - Initial exposure to finger print object changes
  4. shmem.c - For the core of free-list management logic changes
  5. Rest of the files where the interfaces changes are religiously plumbed through alloc / free actions.
  6. Check test.sh to see extended test-coverage being brought-in.
  7. Specifically enabling --use-shmem to functional filter_test.c was a BIG Help to stabilize fingerprint object mgmt
  8. See also in unit: platform_apis_test.c, splinter_shmem_test.c, and few others that were enhanced.
gapisback commented 11 months ago

HI, @rtjohnso - Status update.

I have re-based this work on top of /main and have gone thru multiple commits to stabilize this work. However, it's not quite, yet, ready for review, for these reasons:

  1. While finalizing these changes, I came across some calls to platform_free() which looked suspiciously wrong. They could result in unaccounted memory-usage being 'freed' causing, over time, a slow "memory leak".

  2. Rather than fixing instances one-at-a-time, I started to implement some sanity checking to verify that the memory fragment being freed is, indeed, an allocated one and that the allocated fragment's size matches the size provided to free. Even early implementations of this sanity-checking tripped up quite easily.

  3. I realized that in my previous implementation, I had not rototilled interfaces such as TYPED_MALLOC(), TYPED_ZALLOC and TYPED_ALIGNED_MALLOC(), TYPED_ALIGNED_ZALLOC to take the platform_memfrag * argument. (They were left to work in the current style, somewhat as a convenience.) I now think it's better to tighten all memory allocation and free interfaces to always take a platform_memfrag * argument. This will greatly simplify and tighten sanity and assertion checking during free.

  4. ~CI-jobs are jammed -- reasons are unknown to me. They are not getting unblocked.~ (Gabriel has resolved this ...) So, I now have ~no~ insight into overall stability of this work for the larger set of CI test-jobs. Most test runs have passed, except for one failure in one CI-job. Will investigate.

I'm off rest of this week of Thanksgiving. Will try to see if I can get some time off-line to work on (3) and finish-up (2) above. I will try to complete this work and bring something to review 1st week of Dec.

gapisback commented 8 months ago

@rtjohnso - The final part-3 shared memory support change-set is now ready for review.

The suggested order in which to review these diffs is:

  1. src/platform_linux/platform.h
  2. src/platform_linux/platform_inline.h
  3. src/platform_linux/platform_types.h
  4. src/util.h, src/util.c
  5. src/platform_linux/shmem.h
  6. src/platform_linux/shmem.c
  7. src/trunk.h, src/trunk.c
  8. src/routing_filter.c
  9. Then the rest of the files.
  10. Good luck!
rtjohnso commented 8 months ago

I think the current memfrag interface is leaky and not general.

I think the interface should look like this:

platform_status
platform_alloc(memfrag *mf, // OUT
               int size);

platform_status
platform_realloc(memfrag *mf, // IN/OUT
                 int newsize);

platform_status
platform_free(memfrag *mf); // IN

void *
memfrag_get_pointer(memfrag *mf);

(Note that details, like the exact names of the functions or the memfrag datatype are not too important in this example.)

The point is that the rest of the code should treat memfrags as opaque objects. In the current code, the rest of the code goes around pulling out fields and saving them for later use. It means that internal details of the current allocator implementation are being leaked all over the rest of the code. This will make it difficult to change the allocator implementation down the road.

As for names, I would advocate renaming memfrag to memory_allocation.

gapisback commented 8 months ago

HI, @rtjohnso --

Thanks for your initial approach on reworking the interfaces.

I'm happy to take this further, but I feel this round-trip discussion will become long and meandering. And this review panel UI exchange is not ideally suited for that kind of interaction.

I want to avoid re-doing the implementation till we've settled on and agreed to the new interfaces. Every bit of code rework requires massively editing the change-set and re-stabilizing - an effort I would like to avoid doing multiple times.

How about I start a new thread under Discussions tab, with your initial proposal? And, will give you my responses, rebuttal. I suspect we will have to go back-and-forth a few times before settling on the final interfaces.

(As a team, we haven't used the Discussions tab feature internally. As I am beginning my transition to fully out-of-VMware, it may be a good opportunity to engage using this GitHub feature, so it continues even when I'm a fully O-Sourced' engineer.)

gapisback commented 8 months ago

@rtjohnso - My CI-stabilization jobs have succeeded. I have squashed all changes arising from our proposal discussion thread into this one single commit and have refreshed this change-set.

You can restart your review on this amended change-set. (I expect CI-jobs will succeed as they did in the stabilization PR #616 )

gapisback commented 8 months ago

@rtjohnso : Fyi -- I want to log this one ASAN-instability the most recent round of CI-jobs ran into, as I am not going to remember all this later.

Here is the state of affairs and results of my investigations.

  1. CI Job no. 109 (main-pr-asan) job failed with this error:
 build/release-asan/bin/driver_test splinter_test --perf --use-shmem --max-async-inflight 0 --num-insert-threads 4 --num-lookup-threads 4 --num-range-lookup-threads 0 --tree-size-gib 2 --cache-capacity-mib 512

build/release-asan/bin/driver_test: splinterdb_build_version 9281c83f

Dispatch test splinter_test

Attempt to create shared segment of size 8589934592 bytes.

Created shared memory of size 8589934592 bytes (8 GiB), shmid=8617984.

Completed setup of shared memory of size 8589934592 bytes (8 GiB), shmaddr=0x7f6924570000, shmid=8617984, available memory = 8589894272 bytes (~7.99 GiB).

filter-index-size: 256 is too small, setting to 512

Running splinter_test with 1 caches

splinter_test: SplinterDB performance test started with 1 tables

splinter_perf_inserts() starting num_insert_threads=4, num_threads=4, num_inserts=27185152 (~27 million) ...

Thread 2 inserting  37% complete for table 0 ... =================================================================

==2666==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x7f68fba16f80 at pc 0x7f6b276fef50 bp 0x7f69044aa080 sp 0x7f69044a9828

READ of size 589 at 0x7f68fba16f80 thread T1

Thread 2 inserting  42% complete for table 0 ... OS-pid=2666, OS-tid=2669, Thread-ID=3, Assertion failed at src/trunk.c:2213:trunk_get_new_bundle(): "(node->hdr->end_bundle != node->hdr->start_bundle)". No available bundles in trunk node. page disk_addr=1513291776, end_bundle=3, start_bundle=3

./test.sh: line 115:  2666 Aborted                 "$@"

make: *** [Makefile:558: run-tests] Error 134
  1. Upon re-run this asan job no. 109.1 succeeded.

  2. Attempted to manually re-run this specific test multiple times on my Nimbus-VM,but could not reproduce the ASAN heap-buffer-overflow error. Ran the exact test with different combinations (one run, 4 concurrent runs with the exact same params, 4 concurrent executions with increasing thread-count up to --num-insert-threads 8 --num-lookup-threads 8, and similar stress load on the VM), but could not repro the problem outside CI.

The last variation of this test in manual repro attempts I tried is 4 concurrent invocations of this test: (Logging this here so I can refer to this later on.)

./driver_test splinter_test --perf --use-shmem --max-async-inflight 0 --num-insert-threads 8 --num-lookup-threads 8 --num-range-lookup-threads 0 --tree-size-gib 2 --cache-capacity-mib 512

The VM has 16 vCPUs, so I figured by running with 8 insert-threads and 4 concurrent instances, we'd load the CPU high-enough to tickle any bugs out. But the ASAN problem did not recur in these manual repro attempts.


NOTE: In the original failure in CI, hard to tell exactly, but it seems like the thread ID 2 ran into the ASAN memory over flow and soon after, thread ID=3 ran into this assertion a few lines later:

OS-pid=2666, OS-tid=2669, Thread-ID=3,  Assertion failed at src/trunk.c:2213:trunk_get_new_bundle(): "(node->hdr->end_bundle != node->hdr->start_bundle)". No available bundles in trunk node. page disk_addr=1513291776, end_bundle=3, start_bundle=3

You may recall that I had reported issue #474 some time ago for this trunk bundle mgmt assertion.

I suspect that there is something lurking there that popped up in the CI-run.

I cannot explain how / whether / if this assertion tripping is caused by the ASAN heap-buffer-overflow error or if they are even related. Unfortunately, I could not repro the ASAN issue outside CI, so have to give up on this investigation now.

The rest of the test runs are stable, and this ASAN-job did succeed on a re-run. I have re-reviewed the code-diffs applied recently and could not find anything obviously broken. For now, I will have to conclude that the changes are fine except there may be some hidden instability popping up, possibly triggered by issue #474 mentioned earlier.

gapisback commented 8 months ago

@rtjohnso - I've gone thru your review comments quickly. Most of those are easily implementable. I will get to it.

I've mostly just gone through the headers in the platform code, plus the fingerprint array api.

I am curious about your review of the fingerprint array API rework. Did you not find any issues with that? I was bracing myself to get lots of comments as this area is fragile and the rework is a bit tricky. If you think this array API is acceptable, then that will reduce a bunch of rework rounds on me.

Let's get the new apis sorted and then I can review the whole PR.

Let me apply the changes requested and then re-test. (CI-re-test stabilization will be a nightmare starting tomorrow.)

Once I go over all the changes, I will be better able to answer this question of yours:

Or is there anything else major?

... for which the answer now is, I don't think so, off-hand.

rtjohnso commented 8 months ago

I left a few comments on the fingerprint array code already.

I haven't done a full evaluation. It seemed more complex than I expected, but I see that it is trying to make explicit some of the complex sharing that goes on with the fingerprint arrays, which is a goal I like. I will want to do a more thorough review of how it is used to understand how it all fits together.

rtjohnso commented 8 months ago

I spoke with Alex today about the overall design, and he really doesn't like how the whole concept of memfrags puts a burden on the rest of the code.

So let's do the following. Whenever the shm code allocates memory, it allocates one extra cache line in front, and stores the memfrag on that cacheline. Later, during a free, you use pointer arithmetic to find the memfrag for that pointer.