shenwei356 / kmcp

Accurate metagenomic profiling && Fast large-scale sequence/genome searching
https://bioinf.shenwei.me/kmcp
MIT License
180 stars 13 forks source link

KMCP database building tutorial #25

Closed Cryphonectria closed 1 year ago

Cryphonectria commented 1 year ago

Dear Wei Shen, Thanks for this great tool! I just have a question regarding the viral database building tutorial here: https://bioinf.shenwei.me/kmcp/database/#refseq-viral-or-fungi It's recommended to split viral genomes into 5 chunks, but the flag in the tutorial is set to --split-number 10. Is this maybe a typo, or is this set elsewhere?

Thanks, Lea

shenwei356 commented 1 year ago

Thanks for pointing this. It should be 10.

Cryphonectria commented 1 year ago

So do I understand correctly that it's now recommended that all genomes (virus/bacteria/fungi) are split into 10 chunks? In your bioRxiv pre-print it was still stated that viruses are split into 5 chunks because of their small genomes. But in your published paper I cannot find this statement anymore. Many thanks

shenwei356 commented 1 year ago

Yes, we split them all into 10 chunks, so there's no need to tell the difference.


This was also questioned by reviewers.

Reviewer:

... On the other hand, the numbers 10 and 5 seem rather arbitrary. Why are these numbers of bins chosen for bacterial/archeal and viral genomes respectively? If viral genomes are broken into fewer bins than bacterial genomes this is, presumably, because they are shorter on average. Should one then consider a more general length effect? What about megaviruses — their genomes can be quite large, is 5 bins really appropriate for such references? ...

Response:

Thanks for the suggestions. We have added benchmarks with different parameters, including the number of chunks, in Supplementary Fig. S6 to illustrate why we choose ten chunks by default. And We have added a sentence to explain the reason.

Section 2.5: “We choose the genome chunk number 10 for a balance of taxon identification accuracy and analysis time (Supplementary Fig. S6a and S6b) and chunk overlap of 150 bp for searching with common short reads of 150 bp in a single-end mode which has higher accuracy (Supplementary Fig. S6c).”