PSC-STM-A8: Monitoring of PSC Stream Ingester

tiredpixel commented 8 months ago

Since it should be running continuously, it is important to get notifications of when it has crashed.

It ought to be recovering if it is a brief crash, since the stream pointer is stored so it can resume from where it was. However, frequent crashes could show something need fixing, and a long outage could lead to the stream data no longer being available (a sign we will need to ingest the snapshots to get any missing data).

Add monitoring so that some notification is sent for whenever it goes down and uptime status.

Estimate: 4 hours

tiredpixel commented 6 months ago

I'm not up-to-date with Heroku monitoring extensions, so I'll evaluate the options briefly.

The Heroku Errors and Exceptions add-ons are: https://elements.heroku.com/addons#errors-exceptions

tiredpixel commented 6 months ago

Bugsnag

https://elements.heroku.com/addons/bugsnag

7500 exceptions/month @ free
sets BUGSNAG_API_KEY env var
bugsnag Ruby library
doesn't work without configuration, even with RACK_ENV env var set
documentation has interesting example: to work on-exit for unhandled exceptions, rather than catching exceptions directly
test exception took a couple of minutes to be received
basic plan ('tauron') has limit of 1 collaborator; not sure if it's possible to add additional emails in such a case

Raygun Crash Reporting

https://elements.heroku.com/addons/raygun

5K errors/month @ free
sets RAYGUN_APIKEY env var
raygun4ruby Ruby library
compulsory company survey
doesn't work without configuration, even with RACK_ENV env var set
works with configuration
not immediately clear how to add email recipient

Sentry

https://elements.heroku.com/addons/sentry

5K errors/month @ free
sets SENTRY_DSN env var
sentry-ruby Ruby library
compulsory newsletter choice
doesn't work without configuration, even with RACK_ENV env var set
works with configuration
simple account-level alerts don't work even after email confirmation

Honeybadger

https://elements.heroku.com/addons/honeybadger

5K errors/month @ free
sets HONEYBADGER_API_KEY env var
honeybadger Ruby library
doesn't work without configuration, even with RACK_ENV env var set
quick-start instructions don't work in our case
works with configuration
attempting to add extra email alert fails with exception ( :| )
alert still went to main account email, rather than additional email

AppSignal APM

https://elements.heroku.com/addons/appsignal

250K requests/month @ $15/month
skipping since more complex and expensive than the others for our use case (full APM rather than just exception monitoring)

Airbrake Error Monitoring

https://elements.heroku.com/addons/airbrake

2K errors/month @ free
sets AIRBRAKE_API_KEY, AIRBRAKE_PROJECT_ID env vars
airbrake-ruby Ruby library
doesn't work without configuration, even with RACK_ENV env var set
works with configuration
simple account-level email alerts work
I haven't used it for many years, but it still appears to be simple to use out-the-box
errors take a couple of minutes to be detected ( :( )
the next pricing tier is more expensive than some of the others

Rollbar

https://elements.heroku.com/addons/rollbar

5K events/month @ free
sets ROLLBAR_ACCESS_TOKEN, ROLLBAR_ENDPOINT env vars
rollbar Ruby library
doesn't work without configuration, even with RACK_ENV env var set
works with configuration
not immediately clear how to add email alerts

tiredpixel commented 6 months ago

Monitoring extensions experiment conclusion

There are likely multiple options possible for us, here; I evaluated each option for only a few minutes. Had I had more experience with each option, it is likely that I would know better how to configure them for our use case. However, our use case is actually very simple: detect a failure, and send an email. I was surprised that configuring this wasn't immediately obvious in some of the options.

Nothing worked immediately out-the-box; this is typical for Ruby (or at least was some years ago), but I wondered if one would have a Ruby library which injected into the exceptions stack and worked with zero configuration, even for a basic Ruby app (not Rails, not Sinatra, not Rake, etc.). I can fully understand why they didn't opt for this approach—but it would be less work. Perhaps there are alternative configurations which support this, but I didn't spot them when glancing through the documentation for each.

From this experiment, I would say that Airbrake was the easiest to set up, understand, and configure. However, I should note that I have previous experience with Airbrake (albeit many years, perhaps closer to a decade, ago… :! )—yet I don't think this was a major influence. However, it's more expensive than the others for the next (non-free) tier. It's also not clear to me how easy it will be to configure multiple users (which themselves require a non-free event tier).

This judgement is rather arbitrary, since it's likely there are multiple good options, here—just with a little more work. However, I'm not sure where the work would be best invested (not being familiar with Heroku monitoring add-ons in recent years). So, I would suggest starting with Airbrake (which also has a free tier), and if necessary, considering whether to pay the (rather costly) non-free tier upgrade, or whether to switch to an alternative. However, from my experiments, Airbrake will easily support what we need for the time being, with minimal fuss. But as I say, this judgement is rather arbitrary; given a little more time or previous experience with some of the other options, I might well have selected one of them instead…

What's most important here is to have something that works for what we need currently—alerting us when there are crashes in the streaming app. Unfortunately, it appears that Heroku neither has this functionality natively, nor has a way of automatically restarting crashed apps. This is, quite frankly, a bitter disappointment; not only the gold-standard Kubernetes, but also other alternatives, have long supported this sort of fatal crash and restart scenario. Perhaps I missed something, but I don't see anything indicating to the contrary, at present.

Neither does there seem to be a recommended path for accomplishing this in Heroku, even with the installation of plugins. Thus, given that the primary objectives of error detection and email alerting are achieved, and given that evaulating even these Heroku add-on options has taken a fair amount of time, I recommend selecting Airbrake in the first instance, and then re-evaluating on a usage and pricing basis once those become the dominant factors.

This whole experiment puts me in mind of simply ignoring the exception monitoring altogether at the Ruby level, and instead dealing with it at an ops-level by wrapping the Procfile script and catching stdout and the status exceptions. Such would also allow for reporting monitoring statuses such as duration and unrun time to a solution (outside of Heroku, such as is typically used to monitor Crontabs and similar processes). I wouldn't be surprised if I ended up recommending such an approach instead—however, in the first instance, I'm trying to keep within the Heroku and typical Ruby solutions as much as possible—rather than simply removing it from that stack layer entirely and taking an 'old-school' devops/sysadmin approach (which would likely solve our use case, and more than we're currently able to monitor, far more simply…).

openownership / register