theiagen / public_health_bioinformatics

Bioinformatics workflows for genomic characterization, submission preparation, and genomic epidemiology of pathogens of public health concern.
GNU General Public License v3.0
37 stars 17 forks source link

bug fix to BaseSpace_Fetch_PHB #385

Closed kapsakcj closed 6 months ago

kapsakcj commented 6 months ago

Updated basespace_fetch WDL task to use the correct bash variable $project_id to enable server-side (in other words basespace-side) filtering, previously the code was set to use a different bash variable $run_id, which was incorrect.

This section of the code is only triggered when $run_id was explicitly NOT set, meaning that the variable was empty, meaning the --project-id flag was not used at all in the one-liner on line 53 of the WDL task.

When that flag is not used, there is no filtering done on the response, which means very slow responses from basespace for Basespace accounts that have LOTS (>1000s) of datasets in basespace.

So when the bs list dataset --project-id=${run_id} command was run, it essentially equated to (see the blank?):

bs list dataset --project-id= | <rest of one-liner>

Now, with this PR, the correct variable is used, and the $project_id bash variable is actually used in the one liner, and this truly speeds things up when basespace accounts have lots of datasets (and the bs commandline tool doesn't need to list out every single dataset on the account)

For one lab in particular, this small change reduced runtime of the Basespace_fetch workflow from >1hr (and often encountered timeout errors from basespace, meaning they had to re-run the workflow many times) down to <15 minutes.

TL;DR this change speeds up the querying of basespace datasets. tested successfully with miniwdl (and greatly reduced runtime). Also tested in Terra

This PR closes #.

πŸ—‘οΈ This dev branch should NOT be deleted after merging to main.

:brain: Aim, Context and Functionality

Speed up basespace_fetch workflow and reduce timeout errors when a basespace account has a large number of "datasets" associated with the account.

:hammer_and_wrench: Impacted Workflows/Tasks & Changes Being Made

This will affect the behavior of the workflow(s) even if users don’t change any workflow inputs relative to the last version : Yes, it might run faster than before

Running this workflow on different occasions could result in different results, e.g. due to use of a live database, "latest" docker image, or stochastic data processing : No

:clipboard: Workflow/Task Step Changes

πŸ”„ Data Processing

Docker/software or software versions changed: No

Databases or database versions changed: No

Data processing/commands changed: Yes, ever so slightly

File processing changed: No

Compute resources changed: No

➑️ Inputs

N/A

⬅️ Outputs

N/A

:test_tube: Testing

Test Dataset

4 samples from a lab that has a basespace account with many "datasets" (in the 1000s or higher) accessible through the account. They reported that the basespace_fetch workflow was taking over an hour to run and it often failed. With little success even after rerunning the workflows.

Commandline Testing with MiniWDL or Cromwell (optional)

Tested successfully with miniwdl [screenshot not shown, just trust me]

Terra Testing

❌ Failures OR long runtime using v1.3.0 of basespace_fetch workflow: https://app.terra.bio/#workspaces/theiagen-validations/curtis-sandbox-theiagen-validations/job_history/3c005332-5603-481c-bdf8-66e0f5118a34

βœ… Successful workflow using this dev branch: https://app.terra.bio/#workspaces/theiagen-validations/curtis-sandbox-theiagen-validations/job_history/4845ad5a-5489-44d1-928e-415e4f4bfaff

Suggested Scenarios for Reviewer to Test

This change will impact users of the workflow that use basespace "projects" for downloading data. This lab does not use basespace Runs (because the sequencing core only shares the basespace "Project" with them instead of the basespace "Run").

It will not impact the majority of users of this workflow, but should greatly benefit the small proportion of users with this kind of basespace configuration.

I would recommend testing on sequencing runs where there is no associated "run ID", but rather a "basespace_collection_id" that corresponds with a Basespace Project (and not a Basespace Run).

Theiagen Version Release Testing (optional)

:microscope: Final Developer Checklist

🎯 Reviewer Checklist

πŸ—‚οΈ Associated Documentation (to be completed by Theiagen developer)

frankambrosio3 commented 6 months ago

Tested on alternative dataset: https://app.terra.bio/#workspaces/theiagen-validations/ambrosio_validation_sandbox/job_history/435fff9a-24ba-4e5e-adf7-ec9f81320a41