labdao / plex

Platform for running comp bio applications on distributed compute and storage infrastructure
https://lab.bio
MIT License
55 stars 14 forks source link

LAB-1450 job force fail #950

Closed supraja-968 closed 6 months ago

supraja-968 commented 6 months ago

What type of PR is this?

Description

Max running time is set in the tool manifest (labsay - 20 seconds, colabdesign - 45 minutes) which can also be changed later by updating in the DB. If a value is not set in the tool manifest json, default = 45 minutes. With this addition, the job will run until it reaches the time limit, and then is force failed and re-submitted just once again. If it fails during the 2nd try as well, the job is not resubmitted again. This feature is to keep the queue un-clogged with any CPU fallback jobs that end up running for a long time. At the moment we are noticing a 10% CPU fallback. With a force-fail single retry logic in place, this chance is reduced down to 1%. New columns added part of this feature: In jobs table - retry_count. This column tracks whether the job succeeded in a single try (in which case retry_count will be =0) or if it took a retry to succeed (=1). If retry_count = 1, the job is not resubmitted again. In tools table - max_running_time. This column has a default value of 2700 (45 minutes). When a tool is onboarded without this value (MaxRunningTime) set (in seconds), the value is set to 2700. If the tool was onboarded with a different max_running_time value initially, we can update this value later, based on the behaviour we observe over time, without having to onboard a new version of the tool.

Steps to Test

Onboard a tool with your desired MaxRunningTime, set this to a time smaller than what your tool usually takes to complete. example: 20 seconds for labsay without speedup. Submit a job, watch the logs stream. And after 20 seconds, watch the backend logs show messages related to retrying. Watch the job get updated with the new bacalhau job ID in the frontend. This time when it fails again, it won't resubmit again. So you should be able to see job status queued -> running -> (new job ID)queued -> running -> failed. In the jobs table, you should be able to see retry_count = 1. Another aspect to test, update tools table with a different max_running_time value.

linear[bot] commented 6 months ago

LAB-1450 force fail CPU fallback jobs

vercel[bot] commented 6 months ago

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment | Name | Status | Preview | Comments | Updated (UTC) | | :--- | :----- | :------ | :------- | :------ | | **docs** | ⬜️ Ignored ([Inspect](https://vercel.com/convexitylabs/docs/Fvn1T6bZZ3XxD584Atdj7SVf5Njj)) | [Visit Preview](https://docs-git-lab-1450-job-force-fail-convexitylabs.vercel.app) | | Apr 18, 2024 2:31pm |