Closed gapisback closed 10 months ago
Name | Link |
---|---|
Latest commit | 19e6ec0b1a5e2ea87a51737031e64a30e2a3dc41 |
Latest deploy log | https://app.netlify.com/sites/splinterdb/deploys/654e9c6dd6a573000806663a |
NOTE: to the reviewers: @rtjohnso @ajhconway @rosenhouse :
This is part-2 of the shared memory support dev work: Support for multi-process, forked-processes executing with shared memory
I have rebased this on top of main
, so the core shared memory support work (under SHA 42799b1d) is already included.
Here is the order to review files in, to get a good grip on this change-set:
laio.h
, laio.c
- see my annotationsfunctional/io_apis_test.c
have been enhanced to run in --fork-child
execution mode, which exercises all this IO sub-system code in its full glory. I'm pretty comfortable overall with these changes.task.c
- see my annotationsshmem.c
- In part-2, the isn't much real logic change , other than adding more tracing, testing hooks and better metrics. You can give this a pass or just a quick-look.functional/io_apis_test.c
, unit/splinterdb_forked_child_test.c
.test.sh
- Review the new tests being folded-in as part of normal CI runs.@rtjohnso - I am assigning this PR to you for review. All tests have passed in previous runs, but after the most recent cleanup change (to engage some new unit-tests), one splinter_test --perf --use-shmem
job has failed as follows:
+ build/release/bin/driver_test splinter_test --perf --use-shmem --max-async-inflight 0 --num-insert-threads 4 --num-lookup-threads 4 --num-range-lookup-threads 0 --tree-size-gib 2 --cache-capacity-mib 512
build/release/bin/driver_test: splinterdb_build_version 4876ae59
Dispatch test splinter_test
[...]
splinter_test: SplinterDB performance test started with 1 tables
splinter_perf_inserts() starting num_insert_threads=4, num_threads=4, num_inserts=27185152 (~27 million) ...
Thread 2 inserting 34% complete for table 0 ... OS-pid=1084, OS-tid=1085, Thread-ID=1, Assertion failed at src/routing_filter.c:184:routing_get_header(): "hdr_raw_addr != 0".
./test.sh: line 110: 1084 Aborted "$@"
I recall having seen such failures previously, too, even without shared memory configured, but could not spot an identical open issue reporting this failure.
I am re-running this job, just to see if will succeed. I suspect there is a lurking issue that may randomly pop-up.
I suggest you do start on the review while I try to figure this out in the background.
HI, @rtjohnso - Update on testing status: My re-run of this failed CI/gcc job now succeeded.
As suspected there is some flaky condition that seems to pop-up occasionally.
@rtjohnso -- I have updated this change-set to address your last round of comments.
This commit has the bulk of the fixes.
This additional commit was needed to re-add a change I had withdrawn.
This commit comments out the enablement of some new test case (done as part of this round of changes) which is tripping up in CI runs.
(In the test.sh
where new tests were attempted to be engaged, I have recorded the signature of the trunk failure / assertion.)
I think there is some inherent instability in trunk management that is surfacing if you pump through shared memory with more than a few processes.
I think I have processed all your feedback and have amended the change set appropriately.
Kindly take a re-look.
The latest code changes look good.
However, note my two comments on the void *
accessor methods and the #include <unistd.h>
stuff. Both of those need to be fixed, although the latter could occur later if it would cause difficulty merging the third multiprocess PR.
@rtjohnso -- I've made the one code change you requested (see this commit), and peeled-off new issue #599 to platformize the use of getpid()
.
Let me know if you see need for any further changes.
This commit extends core shared memory support to now allow for a multi-process execution model, where multiple processes can now attach to Splinter shared memory. Core thread-specific concurrency primitives are modified, slightly, to now also support a multi-process execution model.
This commit sets up the stage to support fork()'ed or other OS-processes running with --use-shmem option, where each process will [in future] masquerade as a Splinter thread. A core change needed to move to that execution model is to support thread-specific IO-context structures. Otherwise, if an/other OS-process tries to do IO using AIO-context established by the main thread (i.e. by the process that started up SplinterDB), we will immediately run into hard IO-system call errors.
This commit:
An alternative could be localize this change-in-behaviour (of setting up thread-specific IO-context structs) only when the process-model of execution comes around. That execution model requires configuring SplinterDB with shared-memory support. But, just by looking at --use-shmem (or corresponding config setting), we cannot be sure that the process-model will be used or if we are just re-running rest of the test suites with shared-segment enabled. So, without trying to further complicate this choice-making, with this commit we will always set up thread-specific AIO-context structures, whenever shared memory configuration is detected.
Collection of lower-level changes to move to this execution model:
platform_buffer_init() that mmap()s' memory for the buffer cache will now use MAP_SHARED (v/s MAP_PRIVATE). The issue is that some parts of structures, e.g. buffer cache, are allocated using mmap(). The flags for this were MAP_PRIVATE, which means this memory is only accessible to the main process that set up Splinter. All child threads work on a COW-version of this mapped memory. So the changes done by the child process to the BTree in the buffer cache are not visible to the parent process.
Convert synchronization primitives to be shared across processes.
This commit reworks core synchronization APIs to use interfaces that allow the sync-hook across child processes. This affects:
Now that we have thread-specific IO-context setup, as part of thread register / deregister, we now also do io_register_thread(), io_deregister_thread(). This is basically book-keeping state of thread w.r.t IO setup & context.
Testing changes added:
Support --fork-child to test execution options. Some new tests will honor this argument, and will exercise activity using a forked-process execution model.
New test splinterdb_forked_child_test added: This covers the cases to show that IO errors could be repro'ed when running Splinter activity from a forked child process. Many other cases are added to this framework to exercise different cases of forked process doing SplinterDB activity. Much code/dev stabilization was achieved through this single new test.
Add case test_seq_key_seq_values_inserts_forked to large_inserts_stress test.
Existing functional io_apis_test to run with --fork-child option, thereby creating the scenario(s) of forked processes exercising the basic IO APIs.
Add new & extended tests to test.sh, for extended coverage using shared-memory and multi-process execution.
Add support for --wait-for-gdb and wait_for_gdb_hook() function.
To debug forked child processes, add support for new command-line flag: --wait-for-gdb . And add a looping function where we can set a breakpoint, wait_for_gdb_hook(). Use this facility in splinterdb_forked_child_test.c, which has helped debug errors seen while running test_multiple_forked_process_doing_IOs().