norment / tsd_issues

Repo to track issues with TSD as tickets
2 stars 0 forks source link

"Kitchen duties" for TSD - daily tasks to check health status, weekly rotation #36

Closed ofrei closed 3 years ago

ofrei commented 4 years ago

I suggest to define a checklist for p33 TSD health status that can be validated manually within 10 minutes, with quick e-mail report sent DAYLY (5 days / week) to all users (e.g. tsd-p33@medisin.uio.no). To implement this, we can define a group of people who are interested, and have weekly rotations "on call" rotations (as with kitchen duties).

The checklist can include

Anything else to be in the checklist?

This can be implemented as a google form, so that that responsible user has to fill out daily (similarly to how it work with SUMSTAT inventory for submitting new data).

denvdm commented 4 years ago

Cool, happy to do this. Is there some sort of benchmark that would best capture whether the system is generally ´slow´? I mean, for me one of the most common annoyances is when TSD slows down to a point that you literally have to wait for typed text to appear. Can that be formalised? I assume this is just generally a way to monitor health and quickly find issues, as well as create a bit of overview of how often issues occur. Is there any other goal you have in mind? Anyway, great initiative

danielroelfs commented 4 years ago

Hi Alex,

I can be part of this! Good initiative!

Best, Daniel

Op 14 apr. 2020, om 20:30 heeft Oleksandr Frei notifications@github.com het volgende geschreven:

I suggest to define a checklist for p33 TSD health status that can be validated manually within 10 minutes, with quick e-mail report sent DAYLY (5 days / week) to all users (e.g. tsd-p33@medisin.uio.no mailto:tsd-p33@medisin.uio.no). To implement this, we can define a group of people who are interested, and have weekly rotations "on call" rotations (as with kitchen duties).

The checklist can include

login / ssh to all Linux VMs (p33-rhel7-login, p33-submit, p33-tl01-l, p33-tl02-l ) login to both Windows VMs /cluster/projects/p33/ mount point is accessible from p33-rhel7-login, p33-submit report free disk space on /cluster and /durable report remaining number of the CPU hours run "top" on p33-rhel7-login and p33-submit, and report excessive memory or CPU usage by users import (and export, for those with export rights) via https://data.tsd.usit.no/ https://data.tsd.usit.no/ is working encourage people to reply if they have issues not detected by this checklist Anything else to be in the checklist?

This can be implemented as a google form, so that that responsible user has to fill out daily (similarly to how it work with SUMSTAT inventory for submitting new data).

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/norment/tsd_issues/issues/36, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALDHVVWT2UCENVSQ3AWCBCLRMSTVNANCNFSM4MH6U7XA.

ofrei commented 4 years ago

The main aim I have in mind would be to discover issues earlier, instead of hitting a wall when you run an analysis. Ideally this should be automated, and run every 10 minutes :) But because TSD is a closed system I don't really know to automate such a dashboard. At some point we can do it within TSD, and then the duties can be reduce to checking that dashboard.

RE slowness, I think we can do some internal performance checks - e.g. run ping and measure time to rsync some small files. When it comes to slow typing that's hard, but let's put a note on reporting "other unusual issues with user experience".

interCM commented 4 years ago

Good idea. Happy to join.

ofrei commented 3 years ago

CLose for now, but it's a good idea - can reconsider