nlapier2 / Metalign

Metalign: efficient alignment-based metagenomic profiling via containment min hash
MIT License
32 stars 7 forks source link

Possible missing genomes? #33

Open vrou1995 opened 4 years ago

vrou1995 commented 4 years ago

Hi,

Thank you very much for making this software available.

As the paper suggested, the sensitivity of Metalign is much higher than that on Kraken2 with over 90% of my sequences classified.

However, both Kraken2 and my 16S rRNA data suggest a significant (>10%) contribution from the phylum Armatimonadetes, which is not found at all by Metalign. Is it possible that the setup.sh is not pulling these sequences into the data folder?

If so, how can I fix this?

Thanks!

nlapier2 commented 4 years ago

Thanks for letting me know! We do have this phylum in our database, apparently, so I'm not sure why it's not showing up if you expect it to be there.

To check for this, I ran "grep Armatimonadetes data/db_info.txt | head" to find an organism in this phylum. The top organism was "Chthonomonas sp. UBA5584" with TaxID 1946340. To check that we had that file in our database, I ran "ls data/organism_files/taxid_1946340_genomic.fna.gz". You could try to do the same to make sure the database was pulled successfully.

Also, it does seem like this phylum was added fairly recently and used to be a candidate phylum called OP10. Because we tried to construct the most comprehensive database possible, we may have some legacy genomes from that candidate phylum that look similar. I would also try looking at the other phyla we picked up on and see if anything has a similar name to the one you expected.

If neither of these explain your finding, then my guess is that Metalign just doesn't think that anything from that phylum is there, for whatever reason.

vrou1995 commented 4 years ago

Thanks for getting back to me. I ran the tests you suggested. There is a significant number (32%) that is classified to Actinobacteria. Interestingly, in addition to the 16S data I have a second confirmation that the Armatimonadetes classification is correct.

When I do metagenome assembly and binning, I get a bin (with good contain lengths) that GTKB-Tk classifies to the genus Chthonomonas and has an average coverage depth of 100x which is very unlikely if the the phylum is only present at <1%.

Is there anyway I could test if its a sensitivity issue?

Thanks!