opensafely-core / sysadmin

Various scripts and tools for administering OpenSAFELY organisation and infrastructure
0 stars 0 forks source link

EPIC Better system degradation monitoring #33

Closed lucyb closed 2 years ago

lucyb commented 2 years ago

GOAL: Improve our responsiveness to system problems so that we don’t rely on researchers reporting problems.

Tech support should be notified of important incidents with enough information to be able to investigate the problem.

Researchers should be provided with information about the current system status, so they don’t get frustrated and don’t need to contact tech support.

Measure

Solution

https://bennettoxford.freshstatus.io/

lucyb commented 2 years ago

New issues: 1: ~Investigate feasibility of using freshstatus as a tool, including integrations. and integration with our services.~

1.a ~Get everyone setup on freshstatus, with seb as superadmin~

  1. ~Start with https://github.com/opensafely-core/sysadmin/issues/36~

  2. Then, integrating database maintenance mode (as "under maintance") Consider wording used, so that it's clear to users. Describe it as "TPP database availability"

  3. ~#35~

  4. ~VPN availability (stretch goal) - low value, as generally vpn has been up, but fails after the auth step, which we can't automate.~

  5. ~Database availability check by job-runner (stretch goal)~

  6. ~Move fresh ping alerts into #tech-support-channel https://github.com/opensafely-core/sysadmin/issues/61~

  7. ~Integrate existing website fresh pings into the status page https://github.com/opensafely-core/sysadmin/issues/62~

  8. ~Link to the status page in the relevant opensafely channels in slack~

  9. ~Let researchers and the tech group know that the status page exists.~

  10. ~Add fresh status to the tools and systems page in the Team Manual. Document how to set up an alert from freshping. https://github.com/opensafely-core/sysadmin/issues/63~

lucyb commented 2 years ago

Job runner should perform a generic db check Each component should be able to report its status

This is picked up by job server and then scraped by freshping.

benbc commented 2 years ago

I've moved this epic into the pipeline here.