alexashley commented 3 years ago

This PR adds automated migrations to update index mappings.

Originally we tried to implement migrations using a lock in order to prevent multiple instances of grafeas-elasticsearch from attempting to do migrations all at once. However, as we started to dive into error cases, we realized that would require some manual intervention if the app crashed or was terminated in the middle of a migration.

After that, we stumbled on this migration RFC for Kibana, which outlines a lockless migration that can happen in parallel with multiple instances. There's potentially some duplicate work, but the end result should be the same and shouldn't require manual intervention. We deviated some from what's outlined there since it's also concerned with data migrations, but it seems like a good template to follow if we want to add those in the future. We can also certainly go down the lock route as well, as long as we document what needs to be done if the migration is interrupted.

Some assumptions:

all indices that start with grafeas belong to grafeas-elasticsearch
all grafeas-elasticsearch indices have a version and it's specified in the index name (i.e., grafeas-$VERSION).
each index only has a single document kind/type (e.g., occurrences, notes, or projects)

Migration steps:

load the latest mappings for all document kinds
query all indices
if the index belongs to grafeas, determine its version by parsing the index name
if the version doesn't match the latest version for its document kind, add it to a list of indices to migrate
for each index that should be migrated:
- check if the index already has a write block, if not place the block on it (there's a read_only block as well, but that would prevent us from switching the alias name over).
- create the new index, continuing if it already exists
- start a reindex from the old to the new, configured so that we move documents over that aren't already in the new index
- poll the reindex task until its complete or we run out of attempts (right now that's set to 10 attempts with a 10 second sleep in between -- we may want to do some load testing to determine what the right number should be)
- delete the reindex task document
- cut the alias over to the new index
- delete the old index

We ran a few manual tests for this, the setup for those is in the migration-tests branch. The first tests runs a migration, the next starts multiple instances of grafeas-elasticsearch that will all attempt to migrate, and the final one fails the migration after every step to check that the process is reentrant. Potentially if we were to extract this into its own library, we could flesh those tests out more.

Lastly, some of this is too specific to grafeas-elasticsearch, but if we wanted to extract this out for use in Rode, then I think we could make the IndexManager implementation more generic and pass in a function that could be used to filter indices.

mrparkers commented 3 years ago

I'll have more time to review this later, but I wanted to address this assumption:

all indices that start with grafeas belong to grafeas-elasticsearch

This may not necessarily be true - and to account for this, we added a _meta index mapping that contains a type field that's set to grafeas:

https://github.com/rode/grafeas-elasticsearch/blob/d76bb845f4170dcffb00a4af8011eeaa45df5ff8/go/v1beta1/storage/elasticsearch.go#L879-L881

Would this be a better way to determine whether or not an index is owned by grafeas-elasticsearch?

codecov-io commented 3 years ago

Codecov Report

Merging #58 (d205794) into main (d76bb84) will increase coverage by 1.02%. The diff coverage is 92.38%.

@@            Coverage Diff             @@
##             main      #58      +/-   ##
==========================================
+ Coverage   86.64%   87.67%   +1.02%     
==========================================
  Files           4        8       +4     
  Lines         644      868     +224     
==========================================
+ Hits          558      761     +203     
- Misses         50       67      +17     
- Partials       36       40       +4

Impacted Files	Coverage Δ
go/v1beta1/storage/grafeas.go	`71.42% <0.00%> (-22.69%)`	:arrow_down:
go/v1beta1/storage/migration/util.go	`60.00% <60.00%> (ø)`
go/v1beta1/storage/migration/index.go	`88.69% <88.69%> (ø)`
go/v1beta1/storage/migration/migration.go	`94.91% <94.91%> (ø)`
go/v1beta1/storage/elasticsearch.go	`84.23% <98.36%> (+0.32%)`	:arrow_up:
go/v1beta1/storage/migration/orchestrator.go	`100.00% <100.00%> (ø)`
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update d76bb84...d205794. Read the comment docs.

alexashley commented 3 years ago

Closed in favor of #59 and another follow up PR

rode / grafeas-elasticsearch

Add migrations for mapping changes #58

Codecov Report