evaluate new ES-ILM backup / redundancy strategy

mediacloud / story-indexer

The core pipeline used to ingest online news stories in the Media Cloud archive.

https://mediacloud.org

Apache License 2.0

2 stars 5 forks source link

evaluate new ES-ILM backup / redundancy strategy #235

Closed rahulbot closed 9 months ago

rahulbot commented 9 months ago

With the ILM ES index architecture, @philbudne raised a question about reconsidering our redundancy approach. We now know that we can restore 2-3 months from WARC files in ~2 days. What if we roll-over via ILM to a new index every 2 months, and immediately backup the rolled-over index off-site. Then if we crash restoration is 2-ish days of downloading indexes and recreating the latest (un-backed up) index from .WARC files. I think this is an acceptable downtime, and we can always later add some kind of "hot" duplicate of the latest index if we want. The task here is to consider how to design and implementation for this, whether it would really work, and to make sure it is a good idea.

Related to #157, #231, #54.

rahulbot commented 9 months ago

Notes from mtg: sounds doable via an API call (to test), perhaps use 90 days as rough time limit, validate how easy/hard it is to restore an archived index, double-check shard size spec, make sure changes for this don't require re-indexing

thepsalmist commented 9 months ago

We can do the ILM policy update via the ILM put API PUT _ilm/policy/<mc_ILM_policy_id>.

When the policy is updated, changes what take effect on our current index mc_search-0001, so this means changes would take effect after mc_search-0002. So we'll have to let the current policy rollover as per the current ILM rollover definitions as documented here

As per this image, our current max shard size is 17.5GB, so we should anticipate rollover when we ingest about triple our current data (this should be sooner than the alternate rollover action of 365 days)

Screenshot 2024-02-21 at 17 02 25

rahulbot commented 9 months ago

My proposal for setting up backups is:

after the first ILM rollover (early-March?) we take an image of that newly-frozen index and back it up off-site (S3?)
in March we work on automating that process so the next rollover automatically takes a binary snapshot of the index and backs it up off-site

Closing this and I'll set up different issues accordingly to capture these two tasks to be done at different times.

This supports a two-prong an overall strategy for catastrophic index failure recovery:

restore prior frozen indexes from off-site storage
restore "hot" index from warc files (I guess it is OK if a WARC file overlaps a little with latest "cold" index because we'll have URL-based deduplication working)