theiagen / public_health_bioinformatics

Bioinformatics workflows for genomic characterization, submission preparation, and genomic epidemiology of pathogens of public health concern.
GNU General Public License v3.0
34 stars 16 forks source link

expose optional input parameter disk_size for kraken2 standalone wfs #316

Closed kapsakcj closed 5 months ago

kapsakcj commented 6 months ago

This PR closes #315

πŸ—‘οΈ This dev branch should be deleted after merging to main.

:brain: Aim, Context and Functionality

[copied from GH issue]

disk_size is not exposed as an optional input parameter in the kraken2_PE_PHB and kraken2_SE_PHB workflows. A user ran into disk space issues/errors using this workflow with large databases and has no way of requesting a VM with a larger disk.

:hammer_and_wrench: Impacted Workflows/Tasks & Changes Being Made

This will affect the behavior of the workflow(s) even if users don’t change any workflow inputs relative to the last version: No

Running this workflow on different occasions could result in different results, e.g. due to use of a live database, "latest" docker image, or stochastic data processing : No

This code change affects both standalone kraken2 workflows:

:clipboard: Workflow/Task Step Changes

πŸ”„ Data Processing

Nothing has changed about the way the workflow runs

Docker/software or software versions changed: No

Databases or database versions changed: No

Data processing/commands changed: No

File processing changed: No

Compute resources changed: Yes, user can now alter disk_size from the default of 100 GB

➑️ Inputs

None

⬅️ Outputs

None

:test_tube: Testing

Test Dataset

tested on one ILMN PE bacterial sample (see link to test Terra workflow below), and requested 500 as the disk_size for the task, and Terra/cromwell did use a VM with a disk size of 500 GB πŸ‘

Commandline Testing with MiniWDL or Cromwell (optional)

Unnecessary, only test in Terra.

Terra Testing

successful workflow here: https://app.terra.bio/#workspaces/theiagen-validations/curtis-sandbox-theiagen-validations/job_history/dccf8cf5-ab64-4df6-bad8-9cbdd9fd718f

proof that the VM used a 500GB disk: image

Suggested Scenarios for Reviewer to Test

Would be good to test out with large kraken2 databases. My test uses the "kraken2 standard" database which is 38GB compressed and estimated 70-ish GB when uncompressed.

Theiagen Version Release Testing (optional)

:microscope: Final Developer Checklist

🎯 Reviewer Checklist

πŸ—‚οΈ Associated Documentation (to be completed by Theiagen developer)

kapsakcj commented 6 months ago

Reminder: tag Ines when ready for review

kapsakcj commented 5 months ago

I ran one final test of kraken2_SE_PHB workflow after making the last commit to update the CI: https://app.terra.bio/#workspaces/theiagen-validations/curtis-sandbox-theiagen-validations/job_history/46106d91-fe26-4a13-8456-ddb27ca6743a

It allocated a VM with 500GB as I requested, so it ran as expected βœ…

@cimendes Would be great if you could test TheiaMeta_Illumina_PE (when you have time of course, this is lower priority)

kapsakcj commented 5 months ago

Just realized that this change also impacts the readQC subworkflows as well...

workflows/utilities/wf_read_QC_trim_pe.wdl
workflows/utilities/wf_read_QC_trim_se.wdl:

both of these subworkflows call the kraken2_standalone task located in tasks/taxon_id/task_kraken2.wdl

so I'll run at least one theiaprok_illumina_PE_PHB workflow with the call_kraken boolean enabled

kapsakcj commented 5 months ago

and just realized that I have to add this option to the read_QC subworkflows as well....

commit incoming. Maybe hold off on the review for a moment

kapsakcj commented 5 months ago

OK here's a theiaprok_illumina_pe_PHB test where I set the new optional inputs (for read_QC_pe subworkflow) kraken_disk_size to 500 and kraken_memory to 64: https://app.terra.bio/#workspaces/theiagen-validations/curtis-sandbox-theiagen-validations/workflows/theiagen-validations/TheiaProk_Illumina_PE_PHB

EDIT: ran as expected and allocated a VM with 500GB disk and 64GB memory βœ…

kapsakcj commented 5 months ago

and here's a test with TheiaProk_Illumina_SE_PHB with kraken_disk_size set to 250 and kraken_memory set to 64: https://app.terra.bio/#workspaces/theiagen-validations/curtis-sandbox-theiagen-validations/job_history/3ed3bae5-ba78-47d1-b75b-1c99131fdeda

EDIT: ran as expected and allocated a VM with 250GB disk and 64GB memory βœ…

cimendes commented 5 months ago

Tests already performed:

Workflows to be tested:

cimendes commented 5 months ago

TheiaMeta_Illumina_PE:

New input variables are present as expected: image

βœ… Run test successful: https://app.terra.bio/#workspaces/theiagen-validations/Theiagen_Mendes_Sandbox/job_history/24bbace4-3074-4c66-883c-aa2fba93bfaf

cimendes commented 5 months ago

Kraken2_PE:

New input parameter present as expected: image

βœ… Successful run: https://app.terra.bio/#workspaces/theiagen-validations/Theiagen_Mendes_Sandbox/job_history/785c725e-8d55-4327-9bc8-a324b5ec2ad5

cimendes commented 5 months ago

TheiaProk_Illumina_PE:

Inputs present as expected (pity WDL is depressing accessing these optional inputs directly through the Kraken2 task.. workflows in workflows 🀷🏻) image

βœ… Successful run: https://app.terra.bio/#workspaces/theiagen-validations/Theiagen_Mendes_Sandbox/job_history/3968fa05-d2f6-4693-b7df-4833a095ff79

cimendes commented 5 months ago

Last but not least, documentation!!

cimendes commented 5 months ago

All is looking good! Great job! I'm approving the PR!