Can you clarify what the cutoff values are for gc-content and tetra-freq and how these were established? My guess is that for gc-content the cutoff of 15.75 means that only contigs that deviate from the mean GC content by more than this value are flagged as contaminated. This seems like a very, very conservative value though (e.g., mean GC of 50% only flags contigs at <34.25% or >65.75%?).
I appreciate that the tetra-freq measure if more abstract, so I'm more interested in how the 0.06 default was established.
The cutoffs in v2 are based on a classification model that I trained on sets of simulated genomes. The cutoffs for v1 were decided based on the methology described in this manuscript.
Hi,
Can you clarify what the cutoff values are for
gc-content
andtetra-freq
and how these were established? My guess is that forgc-content
the cutoff of 15.75 means that only contigs that deviate from the mean GC content by more than this value are flagged as contaminated. This seems like a very, very conservative value though (e.g., mean GC of 50% only flags contigs at <34.25% or >65.75%?).I appreciate that the
tetra-freq
measure if more abstract, so I'm more interested in how the0.06
default was established.Thanks, Donovan