snayfach / MAGpurify

Improvement of metagenome-assembled genomes
GNU General Public License v3.0
47 stars 12 forks source link

Interpretation of cutoff values #25

Open donovan-h-parks opened 1 year ago

donovan-h-parks commented 1 year ago

Hi,

Can you clarify what the cutoff values are for gc-content and tetra-freq and how these were established? My guess is that for gc-content the cutoff of 15.75 means that only contigs that deviate from the mean GC content by more than this value are flagged as contaminated. This seems like a very, very conservative value though (e.g., mean GC of 50% only flags contigs at <34.25% or >65.75%?).

I appreciate that the tetra-freq measure if more abstract, so I'm more interested in how the 0.06 default was established.

Thanks, Donovan

adityabandla commented 4 months ago

@apcamargo I'm quite interested to learn about these cutoffs as well. In v2, are these cutoffs the same or are they specific to each dataset?

apcamargo commented 4 months ago

The cutoffs in v2 are based on a classification model that I trained on sets of simulated genomes. The cutoffs for v1 were decided based on the methology described in this manuscript.