ucsf-wynton / wynton-website-hpc

The Official Wynton HPC User Website
https://wynton.ucsf.edu/hpc/
2 stars 14 forks source link

STATUS/SGE: Automatically report on scheduled downtimes #149

Open HenrikBengtsson opened 4 months ago

HenrikBengtsson commented 4 months ago

SGE maintenance windows are scheduled using SGE calendars. We can use that information to automatically populate docs/hpc/status/incidents-upcoming.md to give a heads-up to users.

All available SGE calendars:

$ qconf -scall
beegfs_outage
cc_outage
maint_downtime
n106_outage
rowA_downtime

Example of one of the SGE calendars:

$ qconf -scal maint_downtime
calendar_name    maint_downtime
year             30.10.2023,31.10.2023,1.11.2023,2.11.2023,3.11.2023=off
week             NONE

This says that this is a one-time event (week = NONE) running 2023-10-30 - 2023-11-03.

Our upcoming 2024-06-17T09:00-2024-06-18 downtown is encoded as:

$ qconf -scal maint_downtime
calendar_name    maint_downtime
year             17.6.2024,18.6.2024=off
week             NONE
HenrikBengtsson commented 1 month ago

How to list all calendars:

$ mapfile -t cals < <(qconf -scall)
$ for cal in "${cals[@]}"; do qconf -scal "${cal}"; done
calendar_name    beegfs_outage
year             8.7.2019-12.7.2019=off
week             NONE
calendar_name    cc_outage
year             23.10.2019=1-7=off
week             NONE
calendar_name    maint_downtime
year             17.6.2024,18.6.2024,19.6.2024,20.6.2024=off
week             NONE
calendar_name    n106_outage
year             4.8.2021=14:30-23:59=off 5.8.2021=0:00-8:00=off
week             NONE
calendar_name    rowA_downtime
year             27.4.2022=8-15=off
week             NONE

Ditto, but with ISO 8601 dates and HH:MM timestamps;

$ for cal in "${cals[@]}"; do qconf -scal "${cal}" | sed -E 's/\b([[:digit:]]+)[.]([[:digit:]]+)[.]([[:digit:]]+)\b/\3-\2-\1/g' | sed -E 's/-([[:digit:]])\b/-0\1/g' | sed -E 's/\b([[:digit:]]):/0\1:/g'; done
calendar_name    beegfs_outage
year             2019-07-08-2019-07-12=off
week             NONE
calendar_name    cc_outage
year             2019-10-23=1-07=off
week             NONE
calendar_name    maint_downtime
year             2024-06-17,2024-06-18,2024-06-19,2024-06-20=off
week             NONE
calendar_name    n106_outage
year             2021-08-04=14:30-23:59=off 2021-08-05=00:00-08:00=off
week             NONE
calendar_name    rowA_downtime
year             2022-04-27=8-15=off
week             NONE