theiagen / public_health_bioinformatics

Bioinformatics workflows for genomic characterization, submission preparation, and genomic epidemiology of pathogens of public health concern.
GNU General Public License v3.0
37 stars 17 forks source link

[TheiaProk_ONT] add patch fix to kmc estimated genome size to not go over 10Mbp #459

Closed cimendes closed 4 months ago

cimendes commented 4 months ago

This PR partially closes #301

🗑️ This dev branch should be deleted after merging to main.

:brain: Aim, Context and Functionality

This PR adds a simple fix to kmc over-estimating the genome lengths on ONT data. This tends to happen when the FASTQs are extremely large (over 2GB in size).

To address this a simple catch has been implemented to prevent the estimated genome length to exceed 10M bases (as per #301 direction).

:hammer_and_wrench: Impacted Workflows/Tasks & Changes Being Made

This will affect the behavior of the workflow(s) even if users don’t change any workflow inputs relative to the last version : Yes, kmc genome size is not limited to a maximum of 10M bases

Running this workflow on different occasions could result in different results, e.g. due to use of a live database, "latest" docker image, or stochastic data processing : No

:clipboard: Workflow/Task Step Changes

🔄 Data Processing

Docker/software or software versions changed: None

Databases or database versions changed: None

Data processing/commands changed: A small catch has been implemented to prevent the estimated genome length outputted by kmc from surpassing 10M bases

File processing changed: None

Compute resources changed: None

➡️ Inputs

No outputs have been added

⬅️ Outputs

No outputs have been adjusted

:test_tube: Testing

Test Dataset

Commandline Testing with MiniWDL or Cromwell (optional)

Terra Testing

Suggested Scenarios for Reviewer to Test

Theiagen Version Release Testing (optional)

:microscope: Final Developer Checklist

🎯 Reviewer Checklist

🗂️ Associated Documentation (to be completed by Theiagen developer)

frankambrosio3 commented 4 months ago

Tested on extremely large ONT input fastq (11.5 GB). Sample failed due to insufficient compute resource allocation on the nanoq task. Compute resource (runtime) parameters are not exposed for any of the read_qc_trim tasks. See issue #470

sage-wright commented 4 months ago

Testing on large files here and regular sized files here.

Code changes look good, will approve upon successful testing with using the appropriate genome size