labdao / plex

Platform for running comp bio applications on distributed compute and storage infrastructure
https://lab.bio
MIT License
55 stars 14 forks source link

Small fixes + UI jobs refresh + spam submission failure #1013

Closed supraja-968 closed 2 months ago

supraja-968 commented 2 months ago

What type of PR is this?

Description

This PR addresses the below changes:

  1. Bug example: user's compute tally = 450 so far. Tier threshold = 500. Each colabdesign job costs = 50 credits. So the user shouldn't be able to submit more than 1 job in this tier without being prompted to subscribe. But the user was still able to submit combinatorially more than 1, as we were updating the compute tally post job creation. So in this case, for example with 3 jobs submitted combinatorially, all 3 will be submitted, and the DB will be updated with compute tally = 600, tier = 1. This tier = 1 will then trigger the subscription the next time user submits 1 or more jobs combinatorially. Fix: calculate compute tally before job creation(not update. the update still comes after the job creation), and redirect to subscribe page without actually submitting these jobs.

  2. Bug: jobs weren't updating live on the UI. Everytime a user has to refresh to see the current state of the experiment. Fix: A polling mechanism just within the jobs accordion, so the whole page doesn't refresh when the jobs refresh with their current state. (Note to dev: the dot next to experiment name still is a bit behind that it requires a refresh to catch up. But this can be addressed in a following PR).

  3. Bug: API keys were getting created, but with a refresh, they disappear. So the creation worked, not the fetch. Fix: the fetch was using wallet_address, where as the column name was user_id. Which holds the wallet_address still. This got missed in the big DB migration. So I have temporarily fixed it with the fetch looking for user_id, instead of migrating the column and naming it wallet_address.

  4. Bug: With a combinatorial submission or a spam of resubmissions, some jobs were failing with 'unexpected Ray state running'. Fix: This is due to carry over of some of the logic from ray services when we migrated to ray jobs. The gateway was setting a job to pending and subsequently running states, BEFORE submitting the job to the ray's internal queue. This is fixed by removing setting these states before submission. So the status lifecycle looks like: queued -> processing -> submit to ray -> set it to pending -> start monitoring -> set it to running/stopped/failed/succeeded based on the result of the response. With this fix, we start monitoring jobs that are in running as well as pending state. Note: 'pending' is Ray's internal convention for pending jobs. So in a previous PR we introduced another status 'processing' to differentiate jobs that are pending on the gateway side to be submitted vs jobs in the internal ray queue waiting to be picked up by a worker. image image

  5. Bug: PDB files were only being used to display checkpoints, but there was no way to download them. Fix: the addFilesToDB function was handling only the files other than PDB because they are categorized separately in the RayJobResponse struct. This is fixed by adding PDB files to DB separately after the rest of the files are added.

vercel[bot] commented 2 months ago

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Skipped Deployment | Name | Status | Preview | Comments | Updated (UTC) | | :--- | :----- | :------ | :------- | :------ | | **docs** | ⬜️ Ignored ([Inspect](https://vercel.com/convexitylabs/docs/ErFs81NV7iSkfYAguyptsqQAXo2i)) | [Visit Preview](https://docs-git-supportdemojob-convexitylabs.vercel.app) | | Aug 7, 2024 2:14pm |
supraja-968 commented 2 months ago

this PR has already been included in the plex migration PR to convexity. changes are in main. deployed to test and prod. closing this PR.