snayfach / MIDAS

An integrated pipeline for estimating strain-level genomic variation from metagenomic data
http://dx.doi.org/10.1101/gr.201863.115
GNU General Public License v3.0
124 stars 52 forks source link

Species detection threshold #88

Closed palomo11 closed 6 years ago

palomo11 commented 6 years ago

Hi,

I have a database with 50 genomes. I have applied run_midas.py species on several hundreds metagenomes. Based on the coverage file I want to do some kind of multivariate analysis (PCA,...), but instead of using the absolute value of the coverage file I want to do it with presence and absence (so a matrix with a 1 if the species is present in the metagnome and 0 if is absent).

My question is which coverage threshold should I choose to determine if a species if present or absent. It should be considered presence as every value above 0? 0.01? 0.1? or which value would you recommend me?

Thank you very much in advance.

snayfach commented 6 years ago

First, I would recommend using the '-n' option to estimate species abundance using the same number of reads per metagenome. As far as a cutoff is concerned, that's really up to you. You might try thresholding based on the number of mapped reads (e.g. at least one, two, ten, etc.) and seeing how the cutoff affects the ordination of samples in PCA space.

Hope that helps.

Thanks, Stephen

palomo11 commented 6 years ago

Thanks for the advice. I will try that!