Maximum estimated genome size

:cool:

:pushpin: Explain the Request

KMC occasionally gives crazy large estimated genome sizes. This causes long runtimes. (https://github.com/theiagen/public_health_bioinformatics/blob/main/workflows/utilities/wf_read_QC_trim_ont.wdl#L59)

:books: Context

Nanopore data has a lot of base level errors which confuses kmer based genome size estimators. Bad sequencing runs, or those with high yield cause this issue to be compounded.

:chart_with_upwards_trend: Desired Behavior

Check the estimated genome size in the WDL task and set an upper limit.

Looking at the latest version of GTDB which contains nearly 4000000 genomes, I built a histogram of the assembly sizes, binning to the nearest million (rounded down). Perhaps an upper limit of 10m bases would do the trick? Some of the very large assembly sizes are probably errors.

assembly size, no. genomes 0 9498 1000000 53195 2000000 102297 3000000 50923 4000000 76130 5000000 68168 6000000 20501 7000000 8778 8000000 3205 9000000 1357 10000000 570 11000000 197 12000000 75 13000000 23 14000000 11 15000000 1 16000000 1 23000000 1 25000000 1

theiagen / public_health_bioinformatics