theiagen / public_health_bioinformatics

Bioinformatics workflows for genomic characterization, submission preparation, and genomic epidemiology of pathogens of public health concern.
GNU General Public License v3.0
34 stars 16 forks source link

[TheiaMeta] Binning with SemiBin2 #323

Closed cimendes closed 5 months ago

cimendes commented 6 months ago

This PR closes #321

🗑️ This dev branch should be deleted after merging with main.

:brain: Aim, Context and Functionality

Binning is the next logical step when it comes to metagenomic analysis through assembly and genomic characterization. It allows us to (ideally) separate the components of a community into their contigs.

Two processes are needed for binning:

  1. The clean reads need to be mapped to the metagenomic assembly to produce abundance information for each contig
  2. The contigs and the sorted and indexed bam files are used by the binning algorithm to produce (ideally) as many bins as there are individual species in a community.

Downstream characterization is not yet done in this PR.

:hammer_and_wrench: Impacted Workflows/Tasks & Changes Being Made

This will affect the behaviour of the workflow(s) even if users don’t change any workflow inputs relative to the last version : No

Running this workflow on different occasions could result in different results, e.g. due to the use of a live database, "latest" docker image, or stochastic data processing : Yes (binning is a stochastic algorithm and variations are expected)

:clipboard: Workflow/Task Step Changes

🔄 Data Processing

An additional step has been introduced in the TheiaMeta workflow. Currently, this is a terminal step. After assembly, the resulting files are used to create a coverage report by mapping the clean reads to them. The resulting bams and the assembly file are binned with SemiBin2 to create possible multiple bin FASTA files.

A check was added in the SemiBin task to skip binning if the number of contigs over the minimum length threshold is less than two. This is to avoid failures with SemiBin software.

Docker/software or software versions changed: N/A

Databases or database versions changed: N/A

Data processing/commands changed: N/A

File processing changed: N/A

Compute resources changed: N/A

➡️ Inputs

New optional inputs:

⬅️ Outputs

New outputs:

:test_tube: Testing

Test Dataset

Locally:

On Terra:

Commandline Testing with MiniWDL or Cromwell (optional)

Semibin task was tested locally, concluding successfully

Terra Testing

image https://app.terra.bio/#workspaces/theiagen-validations/Theiagen_Mendes_Sandbox/job_history/25c1eab1-4e9a-4390-b792-fc3e61daf519

Suggested Scenarios for Reviewer to Test

Theiagen Version Release Testing (optional)

:microscope: Final Developer Checklist

🎯 Reviewer Checklist

🗂️ Associated Documentation (to be completed by Theiagen developer)

jrotieno commented 5 months ago

@cimendes yes, if you can make the mem memory and len length, that'd be great.

cimendes commented 5 months ago

@cimendes yes, if you can make the mem memory and len length, that'd be great.

Done! Thank you!

cimendes commented 5 months ago

TODO: merge main in!

kapsakcj commented 5 months ago

Since this is a new tool added to TheiaMeta - will you add this to the workflow diagram and documentation? I think that may be the last thing needed before merging the PR. Please let me know when you're ready to merge and I can hit the button

cimendes commented 5 months ago

Since this is a new tool added to TheiaMeta - will you add this to the workflow diagram and documentation? I think that may be the last thing needed before merging the PR. Please let me know when you're ready to merge and I can hit the button

yes!! I was waiting on a semi-approval to get that going :) Will update now

cimendes commented 5 months ago

@kapsakcj docs have been updated!