hypothesis / product-backlog

Where new feature ideas and current bugs for the Hypothesis product live
118 stars 7 forks source link

Zero-downtime re-indexing of annotations #106

Closed chdorner closed 7 years ago

chdorner commented 7 years ago

With recent changes to the re-index code we lost the ability to re-index all annotations without stopping any writes to the index during that time. A full re-index currently takes around 2 hours. So suspending writes to the index is a terrible user experience, as users will think that the annotation failed to save, even though it did not.

I've been thinking about ways to re-index without downtime and after researching solutions on Friday and talking to @nickstenning we came up with the following:

During a re-index:

  1. Reads will go to the old index
  2. Writes will go to both the old and the new index

There are two operations that we need to be careful about: update and delete.

Problem with update:

  1. Re-indexer starts at annotation A
  2. Re-indexer loads data for annotations F, G, H
  3. User changes annotation G
  4. Background workers updates annotation G in old index and creates it in the new index
  5. Re-indexer sends annotation F, G, and H to Elasticsearch, including annotation G with the out-of-date data
  6. The database and search index are out-of-sync 💥

Problem with delete:

  1. Re-indexer starts at annotation A
  2. Re-indexer loads data for annotations F, G, H
  3. User deletes annotation H
  4. Background workers deletes annotation G in old and new index
  5. Re-indexer sends annotation F, G, and H to Elasticsearch, including the now-deleted annotation G
  6. The database and search index are out-of-sync 💥

The solution:

  1. We will stop deleting annotations from the index, but rather mark them as deleted by writing the body {"deleted": true}.
  2. The re-indexer should always create annotations, so we can ensure that by using [op_type=create]() that we never override an annotation with wrong data (or re-create a deleted one)

Most of these ideas are from: https://blog.codecentric.de/en/2014/09/elasticsearch-zero-downtime-reindexing-problems-solutions/


Done when:

judell commented 7 years ago

Thanks @chdorner. Re: https://blog.codecentric.de/en/2014/09/elasticsearch-zero-downtime-reindexing-problems-solutions/, what is the trigger, in our case, for a reindex?

chdorner commented 7 years ago

@judell so far we've had to do it when we changed the index mapping. But it's a good idea to periodically re-index into a new index because even though we deleted documents from Elasticsearch, it doesn't actually remove documents from its own Lucene shards, but just marks them as deleted.

nickstenning commented 7 years ago

We're in a good place here thanks to @chdorner's tireless efforts. We can now do an online reindex with no downtime in ~15m.

dwhly commented 7 years ago

That is completely amazing. Thank you!