shenwei356 / kmcp

Accurate metagenomic profiling && Fast large-scale sequence/genome searching
https://bioinf.shenwei.me/kmcp
MIT License
176 stars 13 forks source link

Dealing with novel/non-sequenced species #26

Closed mverce closed 1 year ago

mverce commented 1 year ago

Hi shenwei356,

Thank you for this very promising tool! Based on your instructions, I was able to build an up-to-date database and have been testing KMCP on some previously characterised metagenomes. I found the results very accurate, even on the level of species! However, there is one aspect of the tool that I am uncertain about. It happens sometimes that a metagenome contains novel species or species that have not been sequenced before and are therefore not in the database. Is it possible to have KMCP account for that possibility?

For example, if there is a novel Lactobacillus species in a sample metagenome along with a known Lactobacillus species like Lactobacillus amylovorus, is it possible for KMCP to assign a certain percentage to Lactobacillus amylovorus but leave the rest assigned only to the genus level? I guess it wouldn't work with relative abundances as currently calculated, but with read percentages, but I think this possibility would be helpful in screening for potential new species.

Kind regards, Marko

shenwei356 commented 1 year ago

Hi, thanks for using KMCP. I appreciate your feedback.

Sometimes that a metagenome contains novel species or species that have not been sequenced before and are therefore not in the database.

Yes, it's common in environments that are not well studied, like soil. KMCP is a reference-based tool that could not find novel species that are not in the database.

Is it possible for KMCP to assign a certain percentage to Lactobacillus amylovorus but leave the rest assigned only to the genus level?

No, it does not.

I'd recommend performing metagenomic binning for a less well-studied environment if there's enough sequencing depth and samples. Then the MAGs can be used to build the KMCP database for metagenomic profiling if you need to find the composition difference in different samples.

mverce commented 1 year ago

Thank you for the prompt reply! What you propose sounds reasonable. I suppose we could also use an additional tool to make sure we're not missing anything interesting (but low abundant) on higher taxonomic levels, combined with KMCP for the species level. As your comment answered my questions, I will close this issue.

Kind regards, Marko