Zero-downtime re-indexing of annotations

chdorner commented 7 years ago

With recent changes to the re-index code we lost the ability to re-index all annotations without stopping any writes to the index during that time. A full re-index currently takes around 2 hours. So suspending writes to the index is a terrible user experience, as users will think that the annotation failed to save, even though it did not.

I've been thinking about ways to re-index without downtime and after researching solutions on Friday and talking to @nickstenning we came up with the following:

During a re-index:

Reads will go to the old index
Writes will go to both the old and the new index

There are two operations that we need to be careful about: update and delete.

Problem with update:

Re-indexer starts at annotation A
Re-indexer loads data for annotations F, G, H
User changes annotation G
Background workers updates annotation G in old index and creates it in the new index
Re-indexer sends annotation F, G, and H to Elasticsearch, including annotation G with the out-of-date data
The database and search index are out-of-sync 💥

Problem with delete:

Re-indexer starts at annotation A
Re-indexer loads data for annotations F, G, H
User deletes annotation H
Background workers deletes annotation G in old and new index
Re-indexer sends annotation F, G, and H to Elasticsearch, including the now-deleted annotation G
The database and search index are out-of-sync 💥

The solution:

We will stop deleting annotations from the index, but rather mark them as deleted by writing the body {"deleted": true}.
The re-indexer should always create annotations, so we can ensure that by using [op_type=create]() that we never override an annotation with wrong data (or re-create a deleted one)

Most of these ideas are from: https://blog.codecentric.de/en/2014/09/elasticsearch-zero-downtime-reindexing-problems-solutions/

Done when:

[x] All search queries are filtering out deleted annotations (hypothesis/h#4242).
[x] memex.search.index.delete marks annotations as deleted (hypothesis/h#4242).
[x] Re-indexer uses op_type=create and makes sure that errors are handled (op_type=create related errors should be ignored, hypothesis/h#4245).
[x] Re-indexer stores new index name in a new database table (key/value setting table) (hypothesis/h#4243, hypothesis/h#4249).
[x] h.indexer.add_annotation will write to both indices when new index name setting is configured (hypothesis/h#4250)

judell commented 7 years ago

Thanks @chdorner. Re: https://blog.codecentric.de/en/2014/09/elasticsearch-zero-downtime-reindexing-problems-solutions/, what is the trigger, in our case, for a reindex?

chdorner commented 7 years ago

@judell so far we've had to do it when we changed the index mapping. But it's a good idea to periodically re-index into a new index because even though we deleted documents from Elasticsearch, it doesn't actually remove documents from its own Lucene shards, but just marks them as deleted.

nickstenning commented 7 years ago

We're in a good place here thanks to @chdorner's tireless efforts. We can now do an online reindex with no downtime in ~15m.

dwhly commented 7 years ago

That is completely amazing. Thank you!

hypothesis / product-backlog

Zero-downtime re-indexing of annotations #106