opensafely-core / sysadmin

Various scripts and tools for administering OpenSAFELY organisation and infrastructure
0 stars 0 forks source link

Link up freshstatus with Job Runner Last Seen Field #36

Closed lucyb closed 2 years ago

lucyb commented 2 years ago

To investigate the feasibility of using freshstatus as a tool, we'll get it to pick up the job runner last seen field from job server, report it on the dashboard and into a slack channel.

Test out the incident management process, by checking what happens when the last seen date exceeds a certain time. It should be possible to notify a slack channel that there's an incident and when the incident is resolved.

https://bennettoxford.freshstatus.io/

Parent: #33

bloodearnest commented 2 years ago

Rather than figure out how to push this to freshping's API, I think it would be better to expose a new url on job-server that freshping can poll, as per its normal operations.

ghickman commented 2 years ago

General plan:

@tomodwyer – have we automated any part of our freshstatus page yet?

tomodwyer commented 2 years ago

@ghickman No, we have set up the page as a test, but no automation work has been done.

You will probably need to set up Freshping to ping these endpoints, and then pass that information on to Freshstatus, which will be able to open/close/update incidents based on the ping status.

ghickman commented 2 years ago

I've set up TPP to drive the TTP database status on our page. I'll hold off adding any others until we decide if we like it or not.

lucyb commented 2 years ago

@ghickman is the TPP database status just the last seen field or does it cover this issue too? https://github.com/opensafely-core/sysadmin/issues/38

ghickman commented 2 years ago

@lucyb – still just the last seen field. I still need to hook this up to the status page. The automation I put already didn't update the status page during our recent downtime.

ghickman commented 2 years ago

This is blocked while the TPP backend is up.

Rather than rely on the (very!) rare occasions when we lose access to TPP we should find another view to test on.

One option is to add /500 and /404 views which test out the relevant pages with relevant status codes.

ghickman commented 2 years ago

Realised I could just point a new check at any known 404ing URL and it would work.

Going through the docs again I found that I'd missed setting up the webhook in freshping's side to push changes to freshstatus so that explains why TPP being down a few weeks ago didn't trigger anything.

Testing now by editing the test check I added to freshping.

ghickman commented 2 years ago

Webhook worked with the dummy URL:

ghickman commented 2 years ago

Testing this was a bit of pain so leaving this here for the next person working on it.

I created a new check in freshping pointing at https://jobs.opensafely.org/TEST, a known broken URL. The integration between the two services uses webhooks so it push based. You have to wait for freshping to mark that check as down (5+ minutes), confirm it's worked in statuspage, then edit it to https://jobs.opensafely.org (ie a working URL), wait for freshping to mark the check as back up (another 5+ minutes), and confirm it's worked in statuspage.


I've the rule driving the TPP service in statusping back to the TPP check from freshping.