labdao / plex

Platform for running comp bio applications on distributed compute and storage infrastructure
https://lab.bio
MIT License
54 stars 14 forks source link

Parsing 404 job not found correctly in getrayjobstatus #1009

Closed supraja-968 closed 1 month ago

supraja-968 commented 1 month ago

What type of PR is this?

Description

The GetRayJobStatus() function part of the go routine MonitorRunningJobs fetches all the jobs with running status from the database and uses the ray job ID (submission ID) to fetch the status of the job. However, after we deploy a new version of the base ray job on testeks, it loses the previous state, and returns a 404 job not found to the gateway. This was causing a CrashLoopBackoff on the backend container. To fix this, now the GetRayJobStatus() handles this scenario properly and marks the job as failed and moves on instead of crashing.

image image

After deploying to test: image

vercel[bot] commented 1 month ago

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Skipped Deployment | Name | Status | Preview | Comments | Updated (UTC) | | :--- | :----- | :------ | :------- | :------ | | **docs** | ⬜️ Ignored ([Inspect](https://vercel.com/convexitylabs/docs/8dHm28cwuHuv5PKwPyyjRYVMADTf)) | | | Jul 29, 2024 9:13am |
supraja-968 commented 1 month ago

marked this PR to draft to see if a different infrastructure approach fixes this issue in a more robust way