labdao / plex

Platform for running comp bio applications on distributed compute and storage infrastructure
https://lab.bio
MIT License
54 stars 14 forks source link

Queue (multiple redundant job submission) + UI file upload false WHERE error fixes #1002

Closed supraja-968 closed 1 month ago

supraja-968 commented 1 month ago

What type of PR is this?

Description

  1. A minor change in files.go to fix the false error "WHERE condition is required": This was happening because the handler was trying to upload tags for existing files without a where condition. I moved it inside the block where we upload tags only for newly uploaded files, and for existing file, only add a user_file record.
  2. Fix for multiple redundant job submissions when only one job is submitted: I changed queue initiated processRayJob calls to worker initiated calls. Also fixed spinning up endless workers by adding a once check, so we only spin up MAX_WORKERS number of workers. With this fix, going forward the workers, when free, calls the processRayJob. And the method fetchOldestQueuedJob is now updated to fetch and update the oldest queued job to "Pending" state. This introduces a new state "Pending" into the list of possible job statuses.
  3. Similar to /queue-summary, there is now a /worker-summary that shows the job handled by each worker. This was introduced to double check that the same job is not processed by multiple workers. (This should already be fixed with the fetch and mark as pending change).
  4. Stopping dangling jobs after an hour: With Ray Services, there is an issue where during the initial boot up, jobs more than the MAX_WORKER value get picked up, and the excessive jobs go dangling as it proceeds with the queue to the next set of jobs. This is fixed by introducing an intervention where the queue fetches the oldest running job and marks it as "stopped" which is different from "failed". This change introduces a new state "stopped" into the list of possible job statuses. It is important we are able to differentiate between ray failed jobs and manual intervened stopped jobs for our analysis later.

Note: This above change will be incorporated with the upcoming stripe PR to not charge users for manually stopped dangling jobs.

image

Steps to Test

Successful test should spin up only MAX_WORKERS no. of containers per service. And the remaining jobs should stay in queued state. There shouldn't be endless scaling up (till the max scale up capacity) and showing all the jobs submitted as running. Dangling jobs should be marked as stopped after an hour. Dangling jobs stopped example: (this can be tested by setting 1*time.Hour to a lesser time infetchAndMarkOldestRunningJobAsStopped function

image
vercel[bot] commented 1 month ago

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Skipped Deployment | Name | Status | Preview | Comments | Updated (UTC) | | :--- | :----- | :------ | :------- | :------ | | **docs** | ⬜️ Ignored ([Inspect](https://vercel.com/convexitylabs/docs/HtjtJ2KXUn5DKataL33o4ckfgAyp)) | [Visit Preview](https://docs-git-file-upload-in-exp-patch-convexitylabs.vercel.app) | | Jul 19, 2024 3:43pm |