nasa / opera-sds-pcm

Observational Products for End-Users from Remote Sensing Analysis (OPERA)
Apache License 2.0
16 stars 12 forks source link

[New Feature]: Auto-restart of all PCM services needed for OPS #625

Open riverma opened 11 months ago

riverma commented 11 months ago

Checked for duplicates

Yes - I've already checked

Alternatives considered

Yes - and alternatives don't suffice

Related problems

We encountered some downtime a few weeks ago and one observation was that although AWS EC2 instances auto-restarted, the full stack of PCM services on machines like GRQ did not. This led to a 10h+ downtime until personnel detected the issue.

Describe the feature request

We should ensure all PCM services that are essential for daily operations automatically restart upon and VM reboot or process exit (up to a maximum number of times).

riverma commented 11 months ago

Suggestions on implementation:

  1. Identify all essential PCM services needed for OPS by consulting OPS and PCM teams
  2. Ensure all services wrapped as systemd services
  3. Enable auto-restart policies as needed
riverma commented 11 months ago

CC @hhlee445 - let’s triage this. I can help with step 1 and would be great to work with PCM on 2 & 3.