[TheiaProk_ONT] add patch fix to kmc estimated genome size to not go over 10Mbp

cimendes commented 4 months ago

This PR partially closes #301

🗑️ This dev branch should be deleted after merging to main.

:brain: Aim, Context and Functionality

This PR adds a simple fix to kmc over-estimating the genome lengths on ONT data. This tends to happen when the FASTQs are extremely large (over 2GB in size).

To address this a simple catch has been implemented to prevent the estimated genome length to exceed 10M bases (as per #301 direction).

:hammer_and_wrench: Impacted Workflows/Tasks & Changes Being Made

This will affect the behavior of the workflow(s) even if users don’t change any workflow inputs relative to the last version : Yes, kmc genome size is not limited to a maximum of 10M bases

Running this workflow on different occasions could result in different results, e.g. due to use of a live database, "latest" docker image, or stochastic data processing : No

:clipboard: Workflow/Task Step Changes

🔄 Data Processing

Docker/software or software versions changed: None

Databases or database versions changed: None

Data processing/commands changed: A small catch has been implemented to prevent the estimated genome length outputted by kmc from surpassing 10M bases

File processing changed: None

Compute resources changed: None

➡️ Inputs

No outputs have been added

⬅️ Outputs

No outputs have been adjusted

:test_tube: Testing

Test Dataset

TheiaProk ONT large: Set of 4 very large (over 6GB in size) bacterial ONT data
TheiaProk ONT "normal": Set of 4 bacterial (mTb) ONT data with less than 1 Gb in size

Commandline Testing with MiniWDL or Cromwell (optional)

No local testing was performed

Terra Testing

TheiaProk_ONT on large ONT data: https://app.terra.bio/#workspaces/theiagen-validations/Theiagen_Mendes_Sandbox/job_history/a11b71b4-9364-412c-9453-4c11acb50610
TheiaProk_ONT on "normal" ONT data: https://app.terra.bio/#workspaces/theiagen-validations/Theiagen_Mendes_Sandbox/job_history/8c6f29d0-414c-4ba5-afc6-4af186ac419c

Suggested Scenarios for Reviewer to Test

Theiagen Version Release Testing (optional)

:microscope: Final Developer Checklist

[ ] The workflow/task has been tested locally and results, including file contents, are as anticipated
[x] The workflow/task has been tested on Terra and results, including file contents, are as anticipated
[x] The CI/CD has been adjusted and tests are passing (to be completed by Theiagen developer)
[x] Code changes follow the style guide

🎯 Reviewer Checklist

[ ] All impacted workflows/tasks have been tested on Terra with a different dataset than used for development
[ ] All reviewer-suggested scenarios have been tested and any additional
[ ] All changed results have been confirmed to be accurate
[ ] All workflows/tasks impacted by change/s have been tested using a standard validation dataset to ensure no unintended change of functionality
[ ] All code adheres to the style guide
[ ] MD5 sums have been updated
[ ] The PR author has addressed all comments

🗂️ Associated Documentation (to be completed by Theiagen developer)

[ ] Relevant documentation on the Public Health Resources "PHB Main" has been updated
[ ] Workflow diagrams have been updated to reflect changes

frankambrosio3 commented 4 months ago

Tested on extremely large ONT input fastq (11.5 GB). Sample failed due to insufficient compute resource allocation on the nanoq task. Compute resource (runtime) parameters are not exposed for any of the read_qc_trim tasks. See issue #470

sage-wright commented 4 months ago

Testing on large files here and regular sized files here.

Code changes look good, will approve upon successful testing with using the appropriate genome size

theiagen / public_health_bioinformatics