add timeout for jobs - Githubissues

lkeegan commented 1 month ago

Currently, once a runner claims a job it remains "running" until that runner uploads a result or failure.

So if the runner is killed or it crashes without uploading anything the sample will never be processed.

The server should go through all "running" jobs and reset any that have been running for more than e.g. 2hrs to "queued", to allow another runner to take the job.

EdGreen21 commented 1 month ago

agree - there's currently one job stuck that's blocking the queue. Not sure how many runners we can use - I guess that depends on what backend service we're using for compute (EC2/Azure/Paperspace etc)

EdGreen21 commented 1 month ago

add a ''kill' button to the admin panel (as well as a 're-run')?

lkeegan commented 1 month ago

The runners are not controlled by the web service: it just provides a list of samples awaiting analysis, then any runner can claim a sample, do the analysis, and upload the results.

This means there's not really a limit on the number of runners that can be used, you're free to run as many as you want wherever you want (so you could have some on EC2, some on a local server, one on your laptop, etc), and they are completely independent from whatever server is running the web service. This should make it fairly easy to scale up and down depending on demand.

ssciwr / predicTCR

add timeout for jobs #32