theiagen / public_health_bacterial_genomics

GNU Affero General Public License v3.0
26 stars 14 forks source link

Adds Staph aureus subwf #213

Closed kapsakcj closed 1 year ago

kapsakcj commented 1 year ago

setting as a draft for now, will provide further details soon

Still have a good bit of code to edit and needs testing in Terra before we set this as "ready to review"

EDIT: I think we need to adjust the spatyper output parsing code. Set this PR back to draft state until that is resolved.

I think we're almost wrapped up with code additions and testing in Terra. The only other thing would be to add the tool agravate, which we already have a WDL task for, but it would need updating and testing. I propose we delay that the addition of that tool

Added agrvate in PR #218 , will mark this PR as ready for review once that is merged and the CI has been fixed

PR is finally ready for review

Summary

This PR adds 3 Staphylococcus aureus-specific tools to the merlin_magic subworkflow as well as TheiaProk_Illumina_SE and TheiaProk_Illumina_PE workflows.

Changes for staphopia_sccmec and spatyper

Changes for agrvate

    File agrvate_summary = summary TSV
    File agrvate_results = tarball of all agrvate output files
    String agrvate_agr_group = final agr group desitnation
    String agrvate_agr_match_score = confidence in agr match, higher number the better. ranges 0-15. 15 is max
    String agrvate_agr_canonical = canonical or non-canonical agrD gene
    String agrvate_agr_multiple = number of agr groups found (expect only 1 per isolate)
    String agrvate_agr_num_frameshifts = number of frameshifts found in CDS of extracted agr operon (done during mummer step)
    String agrvate_version = version
    String agrvate_docker = docker image used

Testing

Tested both PE and SE workflows on Terra with 4 Staphylococcus aureus Illumina samples. Both ran successfully and export_taxon_tables ran as expected as well.

kapsakcj commented 1 year ago

@cimendes FYI I think we will need to update the spatyper parsing code to account scenarios where more than one match is found (example: different hits from different contigs)

One of the samples I tested with had 2 hits in the output spatyper TSV, leading to parsing issues with the spatyper_repeats and spatyper_type output strings

Sample is SAMEA2238216 in my sandbox workspace within the staph_illumina_pe data table, also uploaded the TSV here: SAMEA2238216.tsv.txt

Let's be careful with pushing commits now, because we have 2 non-Theiagen labs that may be using this dev branch for testing. Only push commits when we're confident they will run properly in Terra, please!

kapsakcj commented 1 year ago

retested in Terra successfully after the recent changes to spatyper results parsing:

SE - https://app.terra.bio/#workspaces/cdph-terrabio-taborda-manual/curtis_sandbox/job_history/78fe80bb-929c-47ba-8ea0-70ca8635c290 PE - https://app.terra.bio/#workspaces/cdph-terrabio-taborda-manual/curtis_sandbox/job_history/6ca6c9a8-8190-4633-90aa-71b858cc9915

kapsakcj commented 1 year ago

Successful testing in Terra following the merge of PR #218 (addition of agrvate to S aureus subworkflow)

Single End: https://app.terra.bio/#workspaces/cdph-terrabio-taborda-manual/curtis_sandbox/job_history/37dde661-73fa-48b3-b439-1ddbdebc72a6

Paired end (1 failure is related to data and expected, failure not related to code changes): https://app.terra.bio/#workspaces/cdph-terrabio-taborda-manual/curtis_sandbox/job_history/c276ff93-3a08-411b-af42-d36c1b0ec2f4

cimendes commented 1 year ago

Tested on Terra with a different set of saureus and everything worked as expected: https://job-manager.dsde-prod.broadinstitute.org/jobs/2bae52ca-80e2-42c2-9ce9-fa7a317352f6