Core changes to support running Splinter with allocated shared memory.

vmware / splinterdb

High Performance Embedded Key-Value Store

https://splinterdb.org

Apache License 2.0

680 stars 57 forks source link

Core changes to support running Splinter with allocated shared memory. #567

Closed gapisback closed 1 year ago

gapisback commented 1 year ago

Support to run SplinterDB with shared memory configured for most memory allocation is an -EXPERIMENTAL- feature added with this commit.

This commit brings in basic support to create a shared memory segment and to redirect all memory allocation primitives to shared memory. Currently, we only support a simplistic memory mgmt; i.e. only-allocs, and no frees. With shared segments of 1-2 GiB we can run many functional and unit tests.

The high-points of the changes are:

External configuration: splinterdb_config{} gains a few new visible fields to configure and troubleshoot shared memory configuration.
- Boolean: use_shmem: Default is OFF
- size_t : shmem_size:
- Few boolean trace options to trace allocs / free from shmem.
The main driving change is the re-deployment of platform_heap_id 'hid' arg that appears in all memory-related interfaces. If Splinter is configured for shared memory use, 'hid' will be an opaque handle to the shared segment. Most memory allocation will be redirected to new shmem-based alloc() / free() interfaces.

The default 'hid' remains NULL, labelled as PROCESS_PRIVATE_HEAP_ID.
Manage handling of heap-ID and heap-handle to platform_get_heap_id() to correctly return the handle to shared memory. (Otherwise, it would return NULL by default.)
Formalize usages of PROCESS_PRIVATE_HEAP_ID: A small number of clients that wish to repeatedly allocate large chunks of memory tend to cause OOMs. The memory allocated by these clients is not shared across threads / processes. For such usages, introduce PROCESS_PRIVATE_HEAP_ID as an alias to NULL. The known set of clients using this interface now are:
- merge_iterator_create(), merge_iterator_destroy()
- trunk_range() - trunk_range_iterator_init(), trunk_range_iterator_deinit()
- routing_filter_add() - temp array
BTree pack allocates large fingerprint-array. This also causes large tests to run into OOMs. For threaded execution, it's ok if the memory for this array is allocated from the heap. But for multi-process execution, when one process (thread) allocates this finger print array, another thread may pick up the task to compact a bundle and will try to free this memory. So, this memory has to come from shared memory. To cope with such repeated allocations of large chunks of memory to build fingerprint, a small recycling scheme for "free"-large-memory chunks scheme is now supported by shmem module.

Use this technique to recycle memory allocated for iterators also. They tend to be big'gish, so can also cause shmem-OOMs.
Several existing functional and unit-tests have been enhanced to now support "--use-shmem" as the 1st option. This will create Splinter with shared memory configured, and tests run in this mode. This change brings-in quite a good coverage of existing testing for this new feature. Following tests are working with this feature:
- limitations_test - splinterdb_quick_test - writable_buffer_test - task_system_test - splinter_test - btree_test - splinterdb_stress_test
- New test: large_inserts_bugs_stress_test -- added to cover the primary use-case of concurrent insert performance benchmarking (that this feature is driving in prior integration effort).
- test.sh enhanced to run different classes of test with "--use-shmem" option.
Diagnostis & Troubleshooting:
- Shmem-based alloc/free interfaces extended to print name of object and other call-site info, to better pinpoint source code-flow leading to memory issues.
- Add shared memory usage metrics, including for large-fragment handling. Report summary-line of metrics when Splinter is shutdown. Print stats on close.
- Add various utility diagnostic helper methods to validate that addresses within shared memory are valid. Unit-tests and some asserts use these.

netlify[bot] commented 1 year ago

Deploy Preview for splinterdb canceled.

Name	Link
Latest commit	2c4e5331ae054e0a6c34bab15e7804c78f78548f
Latest deploy log	https://app.netlify.com/sites/splinterdb/deploys/64ee8f331924190008a61e11

gapisback commented 1 year ago

Note to the reviewers: @rtjohnso @ajhconway @rosenhouse :

The entire shared-memory support is being parceled into 3-part change-sets. This is the 1st of the whole work.

Core shared memory support
Support for multi-process, forked-processes executing with shared memory
Support for free-fragment recycling, improved shared memory management.

I am working on putting together the PRs for (2) and (3).

Here is the order to review files in, to get a good grip on this change-set:

platform.h, platform.c
shmem.h, shmem.c: platform_shmcreate(), platform_shmdestroy(), platform_shm_alloc(), platform_shm_free(),
Misc other header files
Review test.sh to gauge the scope of changed and enhanced testing brought-in with this work.

gapisback commented 1 year ago

HI, @rtjohnso - I finally completed the work to address all (most) of your review comments. And have also rebased v/s latest /main.

Here's the overall status of this update cycle:

Couple of rebases v/s /main went thru smoothly; i.e. there were no conflicts, so I don't think there was any major code churn due to this work. There was one small merge-issue in trunk.c which I found and fixed during test cycles.
Removed most usages of platform_heap_handle from shmem.c and from many external interfaces. There are some vestigial references that need to be cleansed; in future.
splinterdb_create() const issue has been fixed. New test added to exercise gymnastics.
writable_buffer_resize() interface change corrected
Other minor cleanup and a few bugs that you spotted

Few things I didn't take-on and do ... as they will simply balloon this change-set. And are not super-critical (I think). Some examples:

Merging parsing of --use-shmem in config_parse() and only use that interface to parse this argument.
Some other code cleanup ... that I don't recall now.

I expect CI runs to complete ... there IS a lot of new testing going on, so won't be surprised if there are still some failures. I'll watch and stabilize it.

This updated PR is now ready for a final review. Hopefully we can close out loose threads and try to land this large'ish piece of work.

gapisback commented 1 year ago

@rtjohnso - I have addressed the last remaining comment that we agreed, during our discussion yesterday, needs to be fixed.

Please see this commit, and let me know if this version of the fix is less unacceptable to you.

The new unit-test case added in this commit tripped up in a CI-job, in non-shared-memory execution. I patched an assertion under this commit.

I think it's fine to have this white-box type of unit-test case to clearly understand the gyrations of shared memory allocation. If you strongly object to this tweaking, I can just delete this specific assertion check in the affected test-case, unit/writable_buffer_test.c, but that would not be my 1st choice:

419    // Currently, reallocation from shared-memory will not reuse the existing
420    // memory fragment, even if there is some room in it for the append. (That's
421    // an optimization which needs additional memory fragment-size info which
422    // is currently not available to the allocator.)
423    if (data->use_shmem) {
424       const void *new_data2 = writable_buffer_data(wb);
425       ASSERT_TRUE((new_data != new_data2));

Rest -- ball is in your court. Awaiting to hear of any other changes you'd like to see or other things you'd like to discuss. Thanks.

gapisback commented 1 year ago

@rtjohnso - I've addressed your last batch of comments that needed code changes.

Only a couple to do with config-parsing were not addressed --I've left my reasons in the responses at the appropriate file.

The one thing remaining is: What would you like the random value to define SPLINTERDB_SHMEM_MAGIC to be?

I could not think of anything better than what's already in the code. In any case, it's only for diagnostics, so does it really matter that the value defined be random, more than what it's currently defined to?

gapisback commented 1 year ago

Summary comments:

If --use-shmem is enabled in master_cfg, ensure that splinterdb_create() does create shared memory.
Evaluate what more / other tests are enabled, or can be enabled, with this approach. Include all tests that can be run with this new interface as part of test.sh. Try to see if we can all, or nearly all, existing tests with shared memory support.
Get rid of special-case parsing test_using_shmem() routine, and rework tests to only parse using config_parse().
Figure out a way to engage the new large_inserts_bugs_stress_test tests thru test.sh, but make those failures non-blocking to CI. Handle failures in test.sh so the execution does not fail immediately.

gapisback commented 1 year ago

HI, @rtjohnso -- I investigated the issues around padding for cache-alignment that you raised in this review comment and in this comment.

The affected code chunk in /main is this one, and has been in the code line since the very initial commit created to move this code base into GitHub.

403 static inline void *
404 platform_aligned_malloc(const platform_heap_id UNUSED_PARAM(heap_id),
405                         const size_t           alignment, // IN
418    const size_t padding = (alignment - (size % alignment)) % alignment;
419    return aligned_alloc(alignment, size + padding);

I looked at the semantics of aligned_alloc() and here is what I found in Gnu docs Aligned-Memory-Blocks.html:

The aligned_alloc function allocates a block of size bytes whose address is a multiple of alignment. The alignment must be a power of two and size must be a multiple of alignment.

Linux man page also documents something similar:

The function aligned_alloc() is the same as memalign(), except for the added restriction that size should be a multiple of alignment.

I think this answers a question you raised in a previous comment; i.e. "Alignment is a request on the position of the start of the returned allocation. Why would we need to pad the the size?"

Based on what the Gnu and Linux doc state, I don't think it will be right (as you suggested in our discussion) to delete the padding code as it exists today for the non-shared allocation interface.

Technically, I could not apply this padding-logic in platform_aligned_malloc() for the shared-memory allocator and try to handle the alignment inside platform_shm_alloc(). But, I do think it's a reasonable choice to keep both the schemes symmetric in this regard.

Given that this is the expected behaviour of aligned_alloc(), do you still see the need to rework shared-memory based allocator, which currently also inherits this padding-up of requested size to required bytes?

In any case, I did try to actually re-work platform_shm_alloc() to receive an alignment argument and then to do the alignment of shm->shm_next inside platform_shm_alloc(). It will need a bit of pointer-arithmetic / bit-masking and boundary condition handling to get this logic right.

But before embarking on finalizing that code rework, I wanted to bring this semantics issue vis-a-vis aligned_alloc() to see if you have a re-think.

As mentioned earlier, having to rework this now will make it a bit more tricky to merge with the upcoming memfrag{}-based memory allocation / free interfaces (in part-3).

If you still do think that this padding business must be reworked for shared-memory based allocation, let me suggest that we revisit that once part-3 lands, at which time this whole memory alloc / free picture will become clearer.

Let me know your thoughts based on this new information.

gapisback commented 1 year ago

HI, @rtjohnso -- Barring the one remaining issue of how-to handle the business of requests for aligned memory (see previous response of mine), here is the status of the remaining rework based on our discussion this past week.

All items you raised as needing changes have been addressed, with either code-rework, test-changes or extended enablement of test executions via test.sh.

Main changes arising in last round of updates, to address your review comments summarized in this entry:

Withdrew the special-case test_using_shmem() parsing interface with an equivalent interface, config_parse_use_shmem(), which uses existing config_parse() to parse the --use-shmem argument. Applied this change to all functional and unit-tests, where applicable. Now, we no longer have the restriction that --use-shmem must be the very 1st argument while invoking tests to run with shared memory configured.
Reworked clockcache_assert_clean() to become a boolean routine called under debug_assert().
Re-enabled few cases of large_inserts_bugs_stress_test unit-tests in test.sh . As these are known to sporadically fail, bracketed the call under set +e/-e to avoid percolating failures to the entire test execution. (Need to watch CI test-runs to see if this fix will allow CI jobs to 'pass' in spite of local failures.)
Extended coverage in test.sh [1] and [2] to run all the existing tests in --use-shmem mode
Other minor code cleanup.

test.sh has a new driver function, run_tests_with_shared_memory(), which calls these sub-tests executed with --use-shmem option.

[1] Unit Tests : unit_test "--use-shmem", and run_slower_unit_tests "--use-shmem" invokes these unit-tests.

cc-sdb-vm:[69] $ build/release/bin/unit_test --list

List of test suites that can be run with shared-memory configured:

  task_system                   - Yes
  splinterdb_stress             - Yes
  splinterdb_quick              - Yes
  splinterdb_heap_id_mgmt       - Yes
  splinter_shmem                - Yes
  limitations                   - Yes
  btree                         - Yes
  btree_stress                  - Yes
  writable_buffer               - Yes
  util                          - No
  platform_api                  - No
  misc                          - No
  splinter                      - Yes

 - unit/config_parse_test.c : Not relevant to feature
 - unit/misc_test.c : Not relevant to feature
 - unit/platform_apis_test.c: Does not add value to convert this test;
 - unit/util_test.c: Not relevant to feature
                               platform_heap_create() with use_shmem is
                               tested in many other tests.

[2] Functional tests that can be run with shared-memory configured

See calls to run_splinter_perf_tests(), run_btree_tests() and run_other_driver_tests() in test.sh

 - splinter_test : Yes; Both --functionality and --perf
 - btree_test    : Yes; Both basic and --perf
 - cache_test   : Yes
 - log_test      : Yes
 - filter_test   : Yes

vmware / splinterdb

Core changes to support running Splinter with allocated shared memory. #567

✅ Deploy Preview for splinterdb canceled.

Deploy Preview for splinterdb canceled.