Validate backup strategies

pgulley commented 4 months ago

We just want to observe that the backups exist the way we expect

kilemensi commented 4 months ago

Off the top of my head, our Elasticsearch backup strategy:

We're using ES incremental snapshots.

Current policy: Backups to S3 every 2 weeks

We have verified that the snapshots are in indeed in S3 ✅
Should we test a full restore from S3 snapshots (or have we done it already)?

Future policy: Backup to B2 every 2 weeks

Manually take the first snapshot & verify snapshots files are indeed in B2 ✅
Switch active SLM repository from S3 to B2 (July 15?)
Verify SLM does correctly send snapshots to B2
Should we test a full restore from B2 snapshots?

cc @thepsalmist for any additional context/correction.

thepsalmist commented 4 months ago

Created the B2 repositories mediacloud-elasticsearch-snapshot, and made the first manual backup to B2

pgulley commented 4 months ago

B2 backups failed and needed to be retried for this most recent period- eventually the upload succeeded. -per @kilemensi

pgulley commented 3 months ago

Let's run a test restore of the ILM, using the B2 backups. Next step for paige is to look at the cost of this- and if they are exorbitant we can restore from S3 into an E2.

thepsalmist commented 3 months ago

Elasticsearch's Restore API allows to perform various restorations from a snapshot, including Restoring an index, or restoring an entire cluster.

Restoring an Index - We did a validation of restoring a single index on the Staging ES instance. Even though the snapshot is taken for all the indices, we can individually restore a single index from the snapshot. To avoid deleting existing data, the restoration involved renaming the restored index
Restoring an entire cluster - We should be able to restore the entire cluster from the snapshots in the cases of catastrophic failures. We can restore an entire cluster or restore our snapshots to a different clusture. For our validation strategy, it would only be practical to restore from the snapshots to a different clusture.

The existing index stats are as follows Index mc-search-000001 ~ 3TB

   "_shards": {
        "total": 60,
        "successful": 60,
        "failed": 0
    },
    "_all": {
        "primaries": {
            "store": {
                "size_in_bytes": 1604710273530,
                "total_data_set_size_in_bytes": 1604710273530,
                "reserved_in_bytes": 0
            }
        },
        "total": {
            "store": {
                "size_in_bytes": 3209420547060,
                "total_data_set_size_in_bytes": 3209420547060,
                "reserved_in_bytes": 0
            }
        }
    },

Index mc_search-000002 - 2TB

   "_shards": {
        "total": 60,
        "successful": 60,
        "failed": 0
    },
    "_all": {
        "primaries": {
            "store": {
                "size_in_bytes": 1250744283074,
                "total_data_set_size_in_bytes": 1250744283074,
                "reserved_in_bytes": 0
            }
        },
        "total": {
            "store": {
                "size_in_bytes": 2501488566148,
                "total_data_set_size_in_bytes": 2501488566148,
                "reserved_in_bytes": 0
            }
        }
    },

index mc_search-000003 - 0.8TB

     "_shards": {
        "total": 60,
        "successful": 60,
        "failed": 0
    },
    "_all": {
        "primaries": {
            "store": {
                "size_in_bytes": 417504383254,
                "total_data_set_size_in_bytes": 417504383254,
                "reserved_in_bytes": 0
            }
        },
        "total": {
            "store": {
                "size_in_bytes": 835127786743,
                "total_data_set_size_in_bytes": 835127786743,
                "reserved_in_bytes": 0
            }
        }
    },

To do any of the index's restore to a different cluster, we'd need minimum disk storage of 0.8TB

kilemensi commented 3 months ago

To do any of the index's restore to a different cluster, we'd need minimum disk storage of 0.8TB

Does ☝🏽 mean none of the current servers have such a capacity @thepsalmist?

thepsalmist commented 3 months ago

To do any of the index's restore to a different cluster, we'd need minimum disk storage of 0.8TB

Does ☝🏽 mean none of the current servers have such a capacity @thepsalmist?

Yes none

pgulley commented 3 months ago

Ok., so the options for a restore then are:

Buy some new hardware to expand our storage capacity,
Run the validation on an ephemeral cloud service.

mediacloud / story-indexer

Validate backup strategies #308