bug fix to BaseSpace_Fetch_PHB

Updated basespace_fetch WDL task to use the correct bash variable $project_id to enable server-side (in other words basespace-side) filtering, previously the code was set to use a different bash variable $run_id, which was incorrect.

This section of the code is only triggered when $run_id was explicitly NOT set, meaning that the variable was empty, meaning the --project-id flag was not used at all in the one-liner on line 53 of the WDL task.

When that flag is not used, there is no filtering done on the response, which means very slow responses from basespace for Basespace accounts that have LOTS (>1000s) of datasets in basespace.

So when the bs list dataset --project-id=${run_id} command was run, it essentially equated to (see the blank?):

bs list dataset --project-id= | <rest of one-liner>

Now, with this PR, the correct variable is used, and the $project_id bash variable is actually used in the one liner, and this truly speeds things up when basespace accounts have lots of datasets (and the bs commandline tool doesn't need to list out every single dataset on the account)

For one lab in particular, this small change reduced runtime of the Basespace_fetch workflow from >1hr (and often encountered timeout errors from basespace, meaning they had to re-run the workflow many times) down to <15 minutes.

TL;DR this change speeds up the querying of basespace datasets. tested successfully with miniwdl (and greatly reduced runtime). Also tested in Terra

This PR closes #.

🗑️ This dev branch should NOT be deleted after merging to main.

:brain: Aim, Context and Functionality

Speed up basespace_fetch workflow and reduce timeout errors when a basespace account has a large number of "datasets" associated with the account.

:hammer_and_wrench: Impacted Workflows/Tasks & Changes Being Made

This will affect the behavior of the workflow(s) even if users don’t change any workflow inputs relative to the last version : Yes, it might run faster than before

Running this workflow on different occasions could result in different results, e.g. due to use of a live database, "latest" docker image, or stochastic data processing : No

:clipboard: Workflow/Task Step Changes

🔄 Data Processing

Docker/software or software versions changed: No

Databases or database versions changed: No

Data processing/commands changed: Yes, ever so slightly

File processing changed: No

Compute resources changed: No

➡️ Inputs

N/A

⬅️ Outputs

N/A

:test_tube: Testing

Test Dataset

4 samples from a lab that has a basespace account with many "datasets" (in the 1000s or higher) accessible through the account. They reported that the basespace_fetch workflow was taking over an hour to run and it often failed. With little success even after rerunning the workflows.

Commandline Testing with MiniWDL or Cromwell (optional)

Tested successfully with miniwdl [screenshot not shown, just trust me]

Terra Testing

❌ ~~Failures OR~~ long runtime using v1.3.0 of basespace_fetch workflow: https://app.terra.bio/#workspaces/theiagen-validations/curtis-sandbox-theiagen-validations/job_history/3c005332-5603-481c-bdf8-66e0f5118a34

Runtime (workflows submission to all finishing): No failures, but it did take ~1hr for 4 samples to finish successfully. When this lab was having issues, the runtime was longer (~2hrs) and failures were frequent, likely because there were more samples (roughly 80 samples) being downloaded in parallel and thus way more pings to basespace API.

✅ Successful workflow using this dev branch: https://app.terra.bio/#workspaces/theiagen-validations/curtis-sandbox-theiagen-validations/job_history/4845ad5a-5489-44d1-928e-415e4f4bfaff

Runtime (workflows submission to all finishing): 6 minutes

Suggested Scenarios for Reviewer to Test

This change will impact users of the workflow that use basespace "projects" for downloading data. This lab does not use basespace Runs (because the sequencing core only shares the basespace "Project" with them instead of the basespace "Run").

It will not impact the majority of users of this workflow, but should greatly benefit the small proportion of users with this kind of basespace configuration.

I would recommend testing on sequencing runs where there is no associated "run ID", but rather a "basespace_collection_id" that corresponds with a Basespace Project (and not a Basespace Run).

Theiagen Version Release Testing (optional)

Will changes require functional or validation testing (checking outputs etc) during the release? No
Do new samples need to be added to validation datasets? If so, upload these to the appropriate validation workspace Google bucket (). Please describe the new samples here and why these have been chosen. No -Are there any output files that should be checked after running the version release testing? No

:microscope: Final Developer Checklist

[x] The workflow/task has been tested locally and results, including file contents, are as anticipated
[x] The workflow/task has been tested on Terra and results, including file contents, are as anticipated
[x] The CI/CD has been adjusted and tests are passing (to be completed by Theiagen developer)
[x] Code changes follow the style guide

🎯 Reviewer Checklist

[x] All impacted workflows/tasks have been tested on Terra with a different dataset than used for development
[x] All reviewer-suggested scenarios have been tested and any additional
[x] All changed results have been confirmed to be accurate
[x] All workflows/tasks impacted by change/s have been tested using a standard validation dataset to ensure no unintended change of functionality
[x] All code adheres to the style guide
[ ] MD5 sums have been updated
[x] The PR author has addressed all comments

🗂️ Associated Documentation (to be completed by Theiagen developer)

[x] Relevant documentation on the Public Health Resources "PHB Main" has been updated
[x] Workflow diagrams have been updated to reflect changes

theiagen / public_health_bioinformatics