shenwei356 / kmcp

Accurate metagenomic profiling && Fast large-scale sequence/genome searching
https://bioinf.shenwei.me/kmcp
MIT License
182 stars 13 forks source link

Building database with MAGs #12

Closed durrantmm closed 2 years ago

durrantmm commented 2 years ago

Can I expect the tool to work if I use incomplete draft genomes or MAGs as inputs? What if I use collections of CDS sequences rather than assemblies?

shenwei356 commented 2 years ago

Didn't test with MAGs. But I think it works, even with CDS.

2.1 Indexing KMCP efficiently builds a database from a collection of genome sequences and taxonomic information. The microbial genomes are split into ten (for archaea, bacteria, and fungi) or five (for viruses) chunks with 100-bp overlap, and the k-mer location information is further utilized in taxonomic profiling. For genomes without a single complete genome sequence, chromosomes or contigs are concatenated with intervals of k-1 bases of N to avoid introducing fake k-mers. https://www.biorxiv.org/content/10.1101/2022.03.07.482835v2

Some genomes in GTDB are draft genomes with contigs. And CDS could also be treated as contigs.

durrantmm commented 2 years ago

Great, thank you!