Updated basespace_fetch WDL task to use the correct bash variable $project_id to enable server-side (in other words basespace-side) filtering, previously the code was set to use a different bash variable $run_id, which was incorrect.
This section of the code is only triggered when $run_id was explicitly NOT set, meaning that the variable was empty, meaning the --project-id flag was not used at all in the one-liner on line 53 of the WDL task.
When that flag is not used, there is no filtering done on the response, which means very slow responses from basespace for Basespace accounts that have LOTS (>1000s) of datasets in basespace.
So when the bs list dataset --project-id=${run_id} command was run, it essentially equated to (see the blank?):
bs list dataset --project-id= | <rest of one-liner>
Now, with this PR, the correct variable is used, and the $project_id bash variable is actually used in the one liner, and this truly speeds things up when basespace accounts have lots of datasets (and the bs commandline tool doesn't need to list out every single dataset on the account)
For one lab in particular, this small change reduced runtime of the Basespace_fetch workflow from >1hr (and often encountered timeout errors from basespace, meaning they had to re-run the workflow many times) down to <15 minutes.
TL;DR this change speeds up the querying of basespace datasets. tested successfully with miniwdl (and greatly reduced runtime). Also tested in Terra
This PR closes #.
ποΈ This dev branch should NOT be deleted after merging to main.
:brain: Aim, Context and Functionality
Speed up basespace_fetch workflow and reduce timeout errors when a basespace account has a large number of "datasets" associated with the account.
:hammer_and_wrench: Impacted Workflows/Tasks & Changes Being Made
This will affect the behavior of the workflow(s) even if users donβt change any workflow inputs relative to the last version : Yes, it might run faster than before
Running this workflow on different occasions could result in different results, e.g. due to use of a live database, "latest" docker image, or stochastic data processing : No
:clipboard: Workflow/Task Step Changes
π Data Processing
Docker/software or software versions changed: No
Databases or database versions changed: No
Data processing/commands changed: Yes, ever so slightly
File processing changed: No
Compute resources changed: No
β‘οΈ Inputs
N/A
β¬ οΈ Outputs
N/A
:test_tube: Testing
Test Dataset
4 samples from a lab that has a basespace account with many "datasets" (in the 1000s or higher) accessible through the account. They reported that the basespace_fetch workflow was taking over an hour to run and it often failed. With little success even after rerunning the workflows.
Commandline Testing with MiniWDL or Cromwell (optional)
Tested successfully with miniwdl [screenshot not shown, just trust me]
Runtime (workflows submission to all finishing): No failures, but it did take ~1hr for 4 samples to finish successfully. When this lab was having issues, the runtime was longer (~2hrs) and failures were frequent, likely because there were more samples (roughly 80 samples) being downloaded in parallel and thus way more pings to basespace API.
Runtime (workflows submission to all finishing): 6 minutes
Suggested Scenarios for Reviewer to Test
This change will impact users of the workflow that use basespace "projects" for downloading data. This lab does not use basespace Runs (because the sequencing core only shares the basespace "Project" with them instead of the basespace "Run").
It will not impact the majority of users of this workflow, but should greatly benefit the small proportion of users with this kind of basespace configuration.
I would recommend testing on sequencing runs where there is no associated "run ID", but rather a "basespace_collection_id" that corresponds with a Basespace Project (and not a Basespace Run).
Theiagen Version Release Testing (optional)
Will changes require functional or validation testing (checking outputs etc) during the release? No
Do new samples need to be added to validation datasets? If so, upload these to the appropriate validation workspace Google bucket (). Please describe the new samples here and why these have been chosen. No
-Are there any output files that should be checked after running the version release testing? No
:microscope: Final Developer Checklist
[x] The workflow/task has been tested locally and results, including file contents, are as anticipated
[x] The workflow/task has been tested on Terra and results, including file contents, are as anticipated
[x] The CI/CD has been adjusted and tests are passing (to be completed by Theiagen developer)
Updated basespace_fetch WDL task to use the correct bash variable
$project_id
to enable server-side (in other words basespace-side) filtering, previously the code was set to use a different bash variable$run_id
, which was incorrect.This section of the code is only triggered when
$run_id
was explicitly NOT set, meaning that the variable was empty, meaning the--project-id
flag was not used at all in the one-liner on line 53 of the WDL task.When that flag is not used, there is no filtering done on the response, which means very slow responses from basespace for Basespace accounts that have LOTS (>1000s) of datasets in basespace.
So when the
bs list dataset --project-id=${run_id}
command was run, it essentially equated to (see the blank?):Now, with this PR, the correct variable is used, and the
$project_id
bash variable is actually used in the one liner, and this truly speeds things up when basespace accounts have lots of datasets (and thebs
commandline tool doesn't need to list out every single dataset on the account)For one lab in particular, this small change reduced runtime of the Basespace_fetch workflow from >1hr (and often encountered timeout errors from basespace, meaning they had to re-run the workflow many times) down to <15 minutes.
TL;DR this change speeds up the querying of basespace datasets. tested successfully with miniwdl (and greatly reduced runtime). Also tested in Terra
This PR closes #.
ποΈ This dev branch should NOT be deleted after merging to main.
:brain: Aim, Context and Functionality
Speed up basespace_fetch workflow and reduce timeout errors when a basespace account has a large number of "datasets" associated with the account.
:hammer_and_wrench: Impacted Workflows/Tasks & Changes Being Made
This will affect the behavior of the workflow(s) even if users donβt change any workflow inputs relative to the last version : Yes, it might run faster than before
Running this workflow on different occasions could result in different results, e.g. due to use of a live database, "latest" docker image, or stochastic data processing : No
:clipboard: Workflow/Task Step Changes
π Data Processing
Docker/software or software versions changed: No
Databases or database versions changed: No
Data processing/commands changed: Yes, ever so slightly
File processing changed: No
Compute resources changed: No
β‘οΈ Inputs
N/A
β¬ οΈ Outputs
N/A
:test_tube: Testing
Test Dataset
4 samples from a lab that has a basespace account with many "datasets" (in the 1000s or higher) accessible through the account. They reported that the basespace_fetch workflow was taking over an hour to run and it often failed. With little success even after rerunning the workflows.
Commandline Testing with MiniWDL or Cromwell (optional)
Tested successfully with miniwdl [screenshot not shown, just trust me]
Terra Testing
β
Failures ORlong runtime using v1.3.0 of basespace_fetch workflow: https://app.terra.bio/#workspaces/theiagen-validations/curtis-sandbox-theiagen-validations/job_history/3c005332-5603-481c-bdf8-66e0f5118a34β Successful workflow using this dev branch: https://app.terra.bio/#workspaces/theiagen-validations/curtis-sandbox-theiagen-validations/job_history/4845ad5a-5489-44d1-928e-415e4f4bfaff
Suggested Scenarios for Reviewer to Test
This change will impact users of the workflow that use basespace "projects" for downloading data. This lab does not use basespace Runs (because the sequencing core only shares the basespace "Project" with them instead of the basespace "Run").
It will not impact the majority of users of this workflow, but should greatly benefit the small proportion of users with this kind of basespace configuration.
I would recommend testing on sequencing runs where there is no associated "run ID", but rather a "basespace_collection_id" that corresponds with a Basespace Project (and not a Basespace Run).
Theiagen Version Release Testing (optional)
:microscope: Final Developer Checklist
π― Reviewer Checklist
ποΈ Associated Documentation (to be completed by Theiagen developer)