opensearch-project / opensearch-migrations

Migrate, upgrade, compare, and replicate OpenSearch clusters with ease.
https://aws.amazon.com/solutions/implementations/migration-assistant-for-amazon-opensearch-service/
Apache License 2.0
39 stars 28 forks source link

Create RFS successor work items #1104

Closed mikaylathompson closed 1 week ago

mikaylathompson commented 4 weeks ago

Description

One component of sub-shard RFS is creating a mechanism to robustly create new work items from a given parent. This PR implements a function in the work coordinator to take a parent work item id, and a list of one or more successor item ids.

createSuccessorItemsAndMarkComplete is called with the current work item id, and a list of successor item ids. It does the following:

  1. format the list of successor item ids into a comma separated list and update the current work item's successor_items list, subject to the following criteria: a. client is using the correct script version b. the worker writing still holds the lease c. the work item's current successor_items list is either not set or the same as the one being pushed.
  2. Send a bulk request to create each of the successor items as a new work item. This uses a create call that will only create the new items if their ids don't exist (e.g. it won't overwrite an item with the same id).
  3. If the following step succeeds with only 201 (created) or 409 (already exists) error codes, mark the current item as completed.

The checks built in make this function idempotent. If it fails at any point, it can be run again without creating any inconsistent or conflicting state. Indeed, one of the expected behaviors is that if another work picks up a lease where 1 and any portion of 2 were completed, it will immediately re-run the function with the same parameters and this will attempt again to create the successors and complete the process.

If the first step (updating the item with the successor_items list) fails and the worker exits, the mechanism fails and the progress will be lost (TODO: I should have much more robust retries around this step).

It fits into the broader workflow as follows:

  1. A work item is claimed, as currently done.
  2. If it already has successor_items, jump to step 4.
  3. Write documents until the lease has nearly expired or a SIGTERM signal is received.
  4. Call createSuccessorItemsAndMarkComplete with a single item list containing a work item that defines the current shard starting where this worker is finishing. a. This creates one new work item, and marks the current one as completed.

In the future, this workflow can be modified to split the remaining work into multiple shards instead of one, or to split before starting a given shard, instead of after (e.g. a set of workers first runs through and splits every shard into sub-shards, and then new workers actually tackle writing the documents).

Issues Resolved

https://opensearch.atlassian.net/browse/MIGRATIONS-2127

Testing

Tests are added that test the basic behavior, the behavior in a contentious situation (40 simultaneous workers), and the error-recovery behavior (e.g. that the function can be run idempotently). There are also some tests of error-checking behavior (e.g. attempting to supply a new successor_items list.

I'd like to add a toxi-proxy test, a test (and check) to prevent adding one's self as a successor work item, and retries around setting the initial successor_items list.

Check List

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. For more information on following Developer Certificate of Origin and signing off your commits, please check here.

codecov[bot] commented 3 weeks ago

Codecov Report

Attention: Patch coverage is 59.25926% with 33 lines in your changes missing coverage. Please review.

Project coverage is 80.79%. Comparing base (43d8b0a) to head (9320855). Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
...ad/workcoordination/OpenSearchWorkCoordinator.java 58.75% 29 Missing and 4 partials :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #1104 +/- ## ============================================ + Coverage 80.75% 80.79% +0.04% - Complexity 2925 2928 +3 ============================================ Files 399 399 Lines 14845 14845 Branches 1007 1007 ============================================ + Hits 11988 11994 +6 Misses 2248 2248 + Partials 609 603 -6 ``` | [Flag](https://app.codecov.io/gh/opensearch-project/opensearch-migrations/pull/1104/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=opensearch-project) | Coverage Δ | | |---|---|---| | [gradle-test](https://app.codecov.io/gh/opensearch-project/opensearch-migrations/pull/1104/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=opensearch-project) | `78.82% <59.25%> (+0.04%)` | :arrow_up: | | [python-test](https://app.codecov.io/gh/opensearch-project/opensearch-migrations/pull/1104/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=opensearch-project) | `89.93% <ø> (ø)` | | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=opensearch-project#carryforward-flags-in-the-pull-request-comment) to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.