theiagen / public_health_bioinformatics

Bioinformatics workflows for genomic characterization, submission preparation, and genomic epidemiology of pathogens of public health concern.
GNU General Public License v3.0
37 stars 17 forks source link

Maximum estimated genome size #301

Closed andrewjpage closed 4 months ago

andrewjpage commented 8 months ago

:cool:

:pushpin: Explain the Request

KMC occasionally gives crazy large estimated genome sizes. This causes long runtimes. (https://github.com/theiagen/public_health_bioinformatics/blob/main/workflows/utilities/wf_read_QC_trim_ont.wdl#L59)

:books: Context

Nanopore data has a lot of base level errors which confuses kmer based genome size estimators. Bad sequencing runs, or those with high yield cause this issue to be compounded.

:chart_with_upwards_trend: Desired Behavior

Check the estimated genome size in the WDL task and set an upper limit.

Looking at the latest version of GTDB which contains nearly 4000000 genomes, I built a histogram of the assembly sizes, binning to the nearest million (rounded down). Perhaps an upper limit of 10m bases would do the trick? Some of the very large assembly sizes are probably errors.

assembly size, no. genomes 0 9498 1000000 53195 2000000 102297 3000000 50923 4000000 76130 5000000 68168 6000000 20501 7000000 8778 8000000 3205 9000000 1357 10000000 570 11000000 197 12000000 75 13000000 23 14000000 11 15000000 1 16000000 1 23000000 1 25000000 1

cimendes commented 8 months ago

A possible solution from @sage-wright:

if Len < 5000 string Length = 5000 if len > 5000 string other_length = len then select_first between length and other_length