tl-its-umich-edu / mpr-research-data

3 stars 4 forks source link

raising alerts on errors #16

Open lsloan opened 2 years ago

lsloan commented 2 years ago

Originally posted by @lsloan in Slack at https://umich-its-annarbor.slack.com/messages/GNHFUDEEN/p1652882938672849

Among the few requirements we received from Amber about this data handling process is that it should run once daily, probably a little after midnight. Instructors often use Canvas' default time of day for assignment due dates, which is 23:59. So, if we set the job to run at 00:10, it should go a little after MPR's job at 00:00 has completed.

However, I wonder what the application should do if an error occurs that prevents it from running. E.g., outages in DB, GCP, or network; application or OpenShift problems; etc. This is not a continuously running app that could provide a status service to Nagios for alerts. (Nagios may be overkill anyway.) We could have it send the console output or logs via email to our group and to Amber, but those often go overlooked or end up becoming bothersome. Maybe if email was sent only when there was a failure, it would be less trouble and more useful.

Aside from email, I'm not sure where the console output or logs from past runs are stored. Were we using Splunk for that or was it something else? Would that log store service be useful for alerting about problems? Perhaps not raising the alerts itself, but as a data source for something that would raise the alert?

As written in the thread on Slack, this needs some research and discussion with the group before implementation can begin. See the thread for details.

Additional info