theiagen / public_health_bioinformatics

Bioinformatics workflows for genomic characterization, submission preparation, and genomic epidemiology of pathogens of public health concern.
GNU General Public License v3.0
33 stars 15 forks source link

update BUSCO to v5.7.1 and small tweaks to WDL task #401

Closed kapsakcj closed 2 months ago

kapsakcj commented 3 months ago

This PR closes #345

πŸ—‘οΈ This dev branch should be deleted after merging to main.

:brain: Aim, Context and Functionality

Update BUSCO to the latest available version

:hammer_and_wrench: Impacted Workflows/Tasks & Changes Being Made

This will affect the behavior of the workflow(s) even if users don’t change any workflow inputs relative to the last version : Yes, BUSCO task auto-downloads their database at runtime and it is periodically updated (not sure how often but last update for enterobacteriales db was 2024-01-08

Running this workflow on different occasions could result in different results, e.g. due to use of a live database, "latest" docker image, or stochastic data processing : Yes

Impacted workflows:

:clipboard: Workflow/Task Step Changes

πŸ”„ Data Processing

Docker/software or software versions changed: upgraded to use a Theiagen-hosted copy of the ezlabgva (authors) docker image us-docker.pkg.dev/general-theiagen/ezlabgva/busco:v5.7.1_cv1

Databases or database versions changed: Database changes without warning

Data processing/commands changed: added -cpu option to main busco command

File processing changed: adjustments to parsing of output files; see code for details

Compute resources changed: none

➑️ Inputs

⬅️ Outputs

Added String busco_docker output to WDL task

TODO:

:test_tube: Testing

Test Dataset

Will update later, but will likely test across a diverse set of bacterial species and at least one eukaryotic pathogen (candida auris?)

image

Commandline Testing with MiniWDL or Cromwell (optional)

Tested the WDL task changes locally:

2024-04-04 17:12:04.718 wdl.t:busco done
2024-04-04 17:12:04.719 miniwdl-run.CallCache call cache insert :: cache_file: "/home/curtis_kapsak/.cache/miniwdl/busco/x3vxz4e57qtwr25ajw5hutw77eakiukd/vbsofqftg2typeltmjrgqa5cvw26m27k.json"
{
  "outputs": {
    "busco.busco_report": "/home/curtis_kapsak/github/public_health_bioinformatics/20240404_170838_busco/out/busco_report/03-98DDCS_busco-summary.txt",
    "busco.busco_results": "C:99.8%[S:99.3%,D:0.5%],F:0.2%,M:0.0%,n:440",
    "busco.busco_database": "enterobacterales_odb10 (2024-01-08)",
    "busco.busco_docker": "us-docker.pkg.dev/general-theiagen/ezlabgva/busco:v5.7.1_cv1",
    "busco.busco_version": "BUSCO 5.7.1"
  },
  "dir": "/home/curtis_kapsak/github/public_health_bioinformatics/20240404_170838_busco"
}

Will test workflows in Terra after code has been updated

Terra Testing

Suggested Scenarios for Reviewer to Test

Theiagen Version Release Testing (optional)

:microscope: Final Developer Checklist

🎯 Reviewer Checklist

πŸ—‚οΈ Associated Documentation (to be completed by Theiagen developer)

kevinlibuit commented 2 months ago

@kapsakcj any hesitation in taking this out of draft state? Changes are looking pretty solid to me.

kapsakcj commented 2 months ago

I can mark it ready for review, but I haven't finished testing & reviewing outputs. Only ran TheiaProk_FASTA workflow linked above, haven't tested the other workflows yet.

I would recommend testing TheiaEuk to confirm it still works as intended for eukaryotes before merging.

cimendes commented 2 months ago

Testing time!

cimendes commented 2 months ago

@kapsakcj BUSCO keeps failing on TheiaEuk 😒 It "fails successfully" so it's hard for me to understand why it's so unhappy. I'll try to dig a bit and I shall report back!

cimendes commented 2 months ago

I just did a retry on the workflow for theiaeuk, setting the memory for 16GB -> https://app.terra.bio/#workspaces/theiagen-validations/Theiagen_Mendes_Sandbox/job_history/98834755-7d1c-43a3-aab3-fc0e66c53c03

kapsakcj commented 2 months ago

Testing TheiaEuk with 3 Candida auris genomes here, now that the default RAM is set to 24GB for TheiaEuk specifically: https://app.terra.bio/#workspaces/theiagen-validations/PHB_Validation_nextcladeV3testing/job_history/b03b3a98-99a5-443e-b15a-9fd879f56b6d

kapsakcj commented 2 months ago

BUSCO ran successfully (without memory failure) with the new default of 24GB.

I think we are good to merge?