nasa / opera-sds-ops

Apache License 2.0
4 stars 2 forks source link

[New Feature]: Create, test and document a backup procedure #13

Open LucaCinquini opened 1 year ago

LucaCinquini commented 1 year ago

Checked for duplicates

Yes - I've already checked

Alternatives considered

Yes - and alternatives don't suffice

Related problems

No response

Describe the feature request

We need a reliable procedure to backup and restore the state of the SDS Elasticsearch clusters - Mozart, GRQ and possibly Metrics. Please use either the Dev Common or one of the I&T venues, checking with other developers to make sure they are not currently using it.

riverma commented 1 year ago

@LucaCinquini might be good for us to define some backup parameters / constraints.

For example, how does the following sound?

LucaCinquini commented 1 year ago

@riverma : good idea. Those parameters are ok with me. To save money, if necessary, we could also backup every 24 hours and keep the backups for 7 days.

niarenaw commented 1 year ago

Backup procedure documented here: https://wiki.jpl.nasa.gov/display/operasds/ElasticSearch+Backup+and+Restoration

Tested with GRQ and Mozart by (a) backing up all ES docs, (b) purging all documents from each cluster, and (c) restoring all documents. Confirmed the document count matched before step (a) and after step (c).

riverma commented 1 year ago

Looks good @niarenaw! Please make sure to use the new template that @LalaP set up. See the OPERA SDS OPS PROCEDURES main page for a link to creating a template wiki page from scratch.

LucaCinquini commented 1 year ago

I also had a look, thanks for testing and documenting Nick. May I suggest that others need to test this procedure - perhaps Lala and Sri (separately) after a successful completion of a regression test, so that the Elasticsearch indices are populated? We also probably need to setup a cron job to backup these indices every 24 hours.

niarenaw commented 1 year ago

I've updated the procedure to abide by the Ops Procedure template. A nightly backup should be pretty easy to add as a cron. I think it probably makes more sense to store these in s3 rather than on mozart to avoid any additional need for cleanup/disk space monitoring. Maybe we set up a new bucket and add a 14 day retention period as a lifecycle rule? or 30 days?

riverma commented 1 year ago

+1 @niarenaw to a lifecycle rule. Let's discuss the details for this.