nasa / opera-sds-ops

Apache License 2.0
4 stars 2 forks source link

[New Feature]: System Recovery Procedures #14

Closed riverma closed 1 year ago

riverma commented 1 year ago

Checked for duplicates

Yes - I've already checked

Alternatives considered

Yes - and alternatives don't suffice

Related problems

N/A

Describe the feature request

@LucaCinquini mentioned the need for ensuring we have a set of procedures for system recovery of the SDS.

The key items that need to be included are:

  1. Shutting down a system in the case of a severe problem
  2. Starting a stopped / paused system at the point of stopped processing
  3. Examining logs for diagnosis purposes
  4. Redeploying to a fresh cluster if needed
LucaCinquini commented 1 year ago

I think this procedure is not about forensic on the old system (which will probably be a responsibility of the JPL security team) but rather about deploying a new cluster (in the Ops venue?) and restarting processing. As such, key elements are: o Restart forward processing right away o Identify the missing data gap o Execute a script to resubmit processing for those jobs that have been missed

riverma commented 1 year ago

Thanks for the comments in clarification here @LucaCinquini. Will share a draft of this soon to get this squared away.

riverma commented 1 year ago

Closing this ticket as the task has been completed.