Closed mikaylathompson closed 1 week ago
Attention: Patch coverage is 59.25926%
with 33 lines
in your changes missing coverage. Please review.
Project coverage is 80.79%. Comparing base (
43d8b0a
) to head (9320855
). Report is 1 commits behind head on main.
Files with missing lines | Patch % | Lines |
---|---|---|
...ad/workcoordination/OpenSearchWorkCoordinator.java | 58.75% | 29 Missing and 4 partials :warning: |
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Description
One component of sub-shard RFS is creating a mechanism to robustly create new work items from a given parent. This PR implements a function in the work coordinator to take a parent work item id, and a list of one or more successor item ids.
createSuccessorItemsAndMarkComplete
is called with the current work item id, and a list of successor item ids. It does the following:successor_items
list, subject to the following criteria: a. client is using the correct script version b. the worker writing still holds the lease c. the work item's currentsuccessor_items
list is either not set or the same as the one being pushed.create
call that will only create the new items if their ids don't exist (e.g. it won't overwrite an item with the same id).The checks built in make this function idempotent. If it fails at any point, it can be run again without creating any inconsistent or conflicting state. Indeed, one of the expected behaviors is that if another work picks up a lease where 1 and any portion of 2 were completed, it will immediately re-run the function with the same parameters and this will attempt again to create the successors and complete the process.
If the first step (updating the item with the
successor_items
list) fails and the worker exits, the mechanism fails and the progress will be lost (TODO: I should have much more robust retries around this step).It fits into the broader workflow as follows:
successor_items
, jump to step 4.createSuccessorItemsAndMarkComplete
with a single item list containing a work item that defines the current shard starting where this worker is finishing. a. This creates one new work item, and marks the current one as completed.In the future, this workflow can be modified to split the remaining work into multiple shards instead of one, or to split before starting a given shard, instead of after (e.g. a set of workers first runs through and splits every shard into sub-shards, and then new workers actually tackle writing the documents).
Issues Resolved
https://opensearch.atlassian.net/browse/MIGRATIONS-2127
Testing
Tests are added that test the basic behavior, the behavior in a contentious situation (40 simultaneous workers), and the error-recovery behavior (e.g. that the function can be run idempotently). There are also some tests of error-checking behavior (e.g. attempting to supply a new
successor_items
list.I'd like to add a toxi-proxy test, a test (and check) to prevent adding one's self as a successor work item, and retries around setting the initial
successor_items
list.Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. For more information on following Developer Certificate of Origin and signing off your commits, please check here.