openownership / register

A demonstration transnational register of beneficial ownership data from the UK, Denmark, Slovakia and Armenia
https://register.openownership.org
GNU Affero General Public License v3.0
18 stars 3 forks source link

PSC-STM-A8: Monitoring of PSC Stream Ingester #247

Open tiredpixel opened 8 months ago

tiredpixel commented 8 months ago

Since it should be running continuously, it is important to get notifications of when it has crashed.

It ought to be recovering if it is a brief crash, since the stream pointer is stored so it can resume from where it was. However, frequent crashes could show something need fixing, and a long outage could lead to the stream data no longer being available (a sign we will need to ingest the snapshots to get any missing data).

Add monitoring so that some notification is sent for whenever it goes down and uptime status.

Estimate: 4 hours

tiredpixel commented 6 months ago

I'm not up-to-date with Heroku monitoring extensions, so I'll evaluate the options briefly.

The Heroku Errors and Exceptions add-ons are: https://elements.heroku.com/addons#errors-exceptions

tiredpixel commented 6 months ago

Bugsnag

https://elements.heroku.com/addons/bugsnag

Raygun Crash Reporting

https://elements.heroku.com/addons/raygun

Sentry

https://elements.heroku.com/addons/sentry

Honeybadger

https://elements.heroku.com/addons/honeybadger

AppSignal APM

https://elements.heroku.com/addons/appsignal

Airbrake Error Monitoring

https://elements.heroku.com/addons/airbrake

Rollbar

https://elements.heroku.com/addons/rollbar

tiredpixel commented 6 months ago

Monitoring extensions experiment conclusion

There are likely multiple options possible for us, here; I evaluated each option for only a few minutes. Had I had more experience with each option, it is likely that I would know better how to configure them for our use case. However, our use case is actually very simple: detect a failure, and send an email. I was surprised that configuring this wasn't immediately obvious in some of the options.

Nothing worked immediately out-the-box; this is typical for Ruby (or at least was some years ago), but I wondered if one would have a Ruby library which injected into the exceptions stack and worked with zero configuration, even for a basic Ruby app (not Rails, not Sinatra, not Rake, etc.). I can fully understand why they didn't opt for this approach—but it would be less work. Perhaps there are alternative configurations which support this, but I didn't spot them when glancing through the documentation for each.

From this experiment, I would say that Airbrake was the easiest to set up, understand, and configure. However, I should note that I have previous experience with Airbrake (albeit many years, perhaps closer to a decade, ago… :! )—yet I don't think this was a major influence. However, it's more expensive than the others for the next (non-free) tier. It's also not clear to me how easy it will be to configure multiple users (which themselves require a non-free event tier).

This judgement is rather arbitrary, since it's likely there are multiple good options, here—just with a little more work. However, I'm not sure where the work would be best invested (not being familiar with Heroku monitoring add-ons in recent years). So, I would suggest starting with Airbrake (which also has a free tier), and if necessary, considering whether to pay the (rather costly) non-free tier upgrade, or whether to switch to an alternative. However, from my experiments, Airbrake will easily support what we need for the time being, with minimal fuss. But as I say, this judgement is rather arbitrary; given a little more time or previous experience with some of the other options, I might well have selected one of them instead…

What's most important here is to have something that works for what we need currently—alerting us when there are crashes in the streaming app. Unfortunately, it appears that Heroku neither has this functionality natively, nor has a way of automatically restarting crashed apps. This is, quite frankly, a bitter disappointment; not only the gold-standard Kubernetes, but also other alternatives, have long supported this sort of fatal crash and restart scenario. Perhaps I missed something, but I don't see anything indicating to the contrary, at present.

Neither does there seem to be a recommended path for accomplishing this in Heroku, even with the installation of plugins. Thus, given that the primary objectives of error detection and email alerting are achieved, and given that evaulating even these Heroku add-on options has taken a fair amount of time, I recommend selecting Airbrake in the first instance, and then re-evaluating on a usage and pricing basis once those become the dominant factors.

This whole experiment puts me in mind of simply ignoring the exception monitoring altogether at the Ruby level, and instead dealing with it at an ops-level by wrapping the Procfile script and catching stdout and the status exceptions. Such would also allow for reporting monitoring statuses such as duration and unrun time to a solution (outside of Heroku, such as is typically used to monitor Crontabs and similar processes). I wouldn't be surprised if I ended up recommending such an approach instead—however, in the first instance, I'm trying to keep within the Heroku and typical Ruby solutions as much as possible—rather than simply removing it from that stack layer entirely and taking an 'old-school' devops/sysadmin approach (which would likely solve our use case, and more than we're currently able to monitor, far more simply…).