vmware / splinterdb

High Performance Embedded Key-Value Store
https://splinterdb.org
Apache License 2.0
680 stars 57 forks source link

Support for multi-process, forked processes using shared memory #568

Closed gapisback closed 10 months ago

gapisback commented 1 year ago

This commit extends core shared memory support to now allow for a multi-process execution model, where multiple processes can now attach to Splinter shared memory. Core thread-specific concurrency primitives are modified, slightly, to now also support a multi-process execution model.

Collection of lower-level changes to move to this execution model:

Testing changes added:

netlify[bot] commented 1 year ago

Deploy Preview for splinterdb canceled.

Name Link
Latest commit 19e6ec0b1a5e2ea87a51737031e64a30e2a3dc41
Latest deploy log https://app.netlify.com/sites/splinterdb/deploys/654e9c6dd6a573000806663a
gapisback commented 1 year ago

NOTE: to the reviewers: @rtjohnso @ajhconway @rosenhouse :

This is part-2 of the shared memory support dev work: Support for multi-process, forked-processes executing with shared memory

I have rebased this on top of main, so the core shared memory support work (under SHA 42799b1d) is already included.

Here is the order to review files in, to get a good grip on this change-set:

  1. laio.h, laio.c - see my annotations
  2. Read / Write APIs tests in functional/io_apis_test.c have been enhanced to run in --fork-child execution mode, which exercises all this IO sub-system code in its full glory. I'm pretty comfortable overall with these changes.
  3. task.c - see my annotations
  4. shmem.c - In part-2, the isn't much real logic change , other than adding more tracing, testing hooks and better metrics. You can give this a pass or just a quick-look.
  5. Tests: Review or take a look at functional/io_apis_test.c, unit/splinterdb_forked_child_test.c.
  6. test.sh - Review the new tests being folded-in as part of normal CI runs.
gapisback commented 10 months ago

@rtjohnso - I am assigning this PR to you for review. All tests have passed in previous runs, but after the most recent cleanup change (to engage some new unit-tests), one splinter_test --perf --use-shmem job has failed as follows:

+ build/release/bin/driver_test splinter_test --perf --use-shmem --max-async-inflight 0 --num-insert-threads 4 --num-lookup-threads 4 --num-range-lookup-threads 0 --tree-size-gib 2 --cache-capacity-mib 512
build/release/bin/driver_test: splinterdb_build_version 4876ae59
Dispatch test splinter_test
[...]
splinter_test: SplinterDB performance test started with 1 tables
splinter_perf_inserts() starting num_insert_threads=4, num_threads=4, num_inserts=27185152 (~27 million) ...
Thread 2 inserting  34% complete for table 0 ... OS-pid=1084, OS-tid=1085, Thread-ID=1, Assertion failed at src/routing_filter.c:184:routing_get_header(): "hdr_raw_addr != 0". 
./test.sh: line 110:  1084 Aborted                 "$@"

I recall having seen such failures previously, too, even without shared memory configured, but could not spot an identical open issue reporting this failure.

I am re-running this job, just to see if will succeed. I suspect there is a lurking issue that may randomly pop-up.

I suggest you do start on the review while I try to figure this out in the background.

gapisback commented 10 months ago

HI, @rtjohnso - Update on testing status: My re-run of this failed CI/gcc job now succeeded.

As suspected there is some flaky condition that seems to pop-up occasionally.

gapisback commented 10 months ago

@rtjohnso -- I have updated this change-set to address your last round of comments.

  1. This commit has the bulk of the fixes.

  2. This additional commit was needed to re-add a change I had withdrawn.

  3. This commit comments out the enablement of some new test case (done as part of this round of changes) which is tripping up in CI runs.

(In the test.sh where new tests were attempted to be engaged, I have recorded the signature of the trunk failure / assertion.)

I think there is some inherent instability in trunk management that is surfacing if you pump through shared memory with more than a few processes.

  1. In the final CI job, all tests passed, except this one debug-gcc-no978 -- which successfully completed. But recording the status seems to have hiccupped; some CI-infra issue. I have re-started this run in CI (no. 978.1), and expect it will pass ... watching ...

I think I have processed all your feedback and have amended the change set appropriately.

Kindly take a re-look.

rtjohnso commented 10 months ago

The latest code changes look good.

However, note my two comments on the void * accessor methods and the #include <unistd.h> stuff. Both of those need to be fixed, although the latter could occur later if it would cause difficulty merging the third multiprocess PR.

gapisback commented 10 months ago

@rtjohnso -- I've made the one code change you requested (see this commit), and peeled-off new issue #599 to platformize the use of getpid().

Let me know if you see need for any further changes.