reichlab / forecast-repository

Codebase for Zoltar forecast repository
https://zoltardata.com/
GNU General Public License v3.0
6 stars 3 forks source link

review how to detect failed score updates #258

Closed matthewcornell closed 3 years ago

matthewcornell commented 4 years ago

...and about notifications of problems in general. Examples:

Possible solutions:

At a higher level, we need to do a systematic review of all the kinds of production errors we regularly see, and make a plan for mitigating them.

matthewcornell commented 3 years ago

Here's a simple solution: Use Papertrail's alerts feature:

  1. Save a search in Events, e.g., JobTimeoutException
  2. Setup an alert, e.g., with Slack

We should probably make log messages more consistent - currently the messages vary depending on the error, such as here. Maybe something like <function_name>(): error: <error class> ....

Links: Papertrail integration with Slack: