Detecting closest reference in custom DB

jodyphelan commented 1 year ago

I've got some dengue NGS data and I'm wondering if I can use kmcp to find the best matching reference. I've built a database from ~4800 reference sequnces using the following:

kmcp compute -I ~/.dengue-ngs/ref/ -k 31 -D 10 -B partial -O temp
kmcp index -j 32 -I temp/ -O db -n 1 -f 0.3

Doing an assembly and then using kmcp search contig.fa -d kmcpdb/ works quite nicely and identifies the same reference sequence I get when I use blast.

#query  qLen    qKmers  FPR hits    target  chunkIdx    chunks  tLen    kSize   mKmers  qCov    tCov    jacc    queryIdx
k141_133    10523   1975    0.0000e+00  231 MH823208.1  0   1   10723   31  1503    0.7610  0.7444  0.6034  0

However sometimes it is difficult to do a genome assembly and in this case I would like to search directly using the reads. Is this possible? I've tried with kmcp search -d kmcpdb reads.fq.gz --query-whole-file -o result.tsv.gz, however this does not seem to return any hits. Any guidance would be appreciated.

shenwei356 commented 1 year ago

Yes, it does support search reads to detect the most similar strain against many reference genomes. Please follow the recently added tutorial: https://bioinf.shenwei.me/kmcp/tutorial/detecting-pathogens/ .

You may need to remove duplicated sequences first and choose a limited number of representative genomes for each strain. Because, for thousands of reference genomes of small genomes like HBV and dengue viruses, KMCP might fail to detect the target species.

Some tips:

Use a larger k value, e.g., 31 or 51, cause you have so many reference genomes and they are highly similar.
In the profiling stage, use -n 1 if no targets are detected.

Anyway, just have a try, and let me know if you have any issues.

jodyphelan commented 1 year ago

Thanks, I'll give it a try. Great tool btw!

jodyphelan commented 1 year ago

Amazing that works like a charm!

shenwei356 commented 1 year ago

Glad it helps!

I just remembered I forgot to paste the latest unreleased version, which fixes a bug in chunk computation when splitting circular genomes (--circular).

https://github.com/shenwei356/kmcp/files/11702458/kmcp_linux_amd64.tar.gz

shenwei356 / kmcp

Detecting closest reference in custom DB #32