pronoiac / mefi.social

2 stars 0 forks source link

investigate monitoring #33

Open pronoiac opened 1 year ago

pronoiac commented 1 year ago

Our instance is on a managed host, so we don't have access to direct monitoring of services. Brainstorming:

majick commented 1 year ago

Throwing other shit at the wall, in increasing levels of complexity.

Is there a box from whence to run these things? Is one needed?

pronoiac commented 1 year ago

Dear Reader, before you go further, you should probably know that I'm an SRE - Site Reliability Engineer - and so this falls deeply into an area of interest for me, and so I can go on in waaay too much detail on this sort of thing. sorry not sorry.

For the 2022-12-28 outage, the site was still responding, but updates were (mostly) stalled. (In fact, I'm not absolutely positive the rss feeds weren't updating, but it seems like a safe bet.) For the 2023-01-15 outage, the site responded with 504s. (Oops, I haven't pushed the "we're back up" edit.)

I'm not sure yet what it looks like if a request hits during an upgrade. They're expected to take about a minute, so successive requests - even 5 min later - would be expected to be ok. We'd alert on repeated errors.

I should look into more official metrics, and what hosting offers. I jumped into "treating this as a black box" (possibly too) readily, as I built something similar at a previous job. ... from that experience, I'd suggest, as separate metrics:

I already have a personal VPS running Docker, to start with, though sharing access is currently non-trivial.

Because I had to refresh my memory on this:

majick commented 1 year ago

My friend, let us enjoy sitting together companionably in this wheelhouse! I am the SRE they send to make other people's RCAs meatier/less-blamey/more-actual-rootley, so, also, yes, very sorry-not-sorry to enjoy being in this place at this time.

The way I've been pushing SLIs is as "measure of what a person I care about actually gives a crap about" and SLOs as "how hard they give a crap about it." SLAs are, for me, eh, more about setting release policy or revenue clawbacks or clubbing executive people over the head for budget and other non-applicable stuff.

From that, you can derive all kinds of cool stuff without hanging to stand up a TSDB pod and Grafana and all that -- which, given my personal proclivities, would probably be what I'd actually wind up doing myself. Instead you can just say:

This is looooots of fun for me to deep dive on, especially in ways that are actively useful as opposed to potentially bikesheddy, so please do feel free to ask that be amped or attenuated.