shenwei356 / kmcp

Accurate metagenomic profiling && Fast large-scale sequence/genome searching
https://bioinf.shenwei.me/kmcp
MIT License
176 stars 13 forks source link

Add a tutorial of detecting specific pathogen in sequencing data #31

Closed shenwei356 closed 12 months ago

shenwei356 commented 1 year ago

Sample data:

Creating a KMCP database:

# split reference genomes into 10 chunks with 150-bp overlaps
kmcp compute -k 21 -n 10 -l 150 -I refs/ -O refs-n10-l150

# index with a small FPR for small genomes
kmcp index -f 0.001 -I refs-n10-l150/ -O refs.kmcp

Searching reads against the KMCP database:

kmcp search -d refs.kmcp/ testdata.fq.gz -o testdata.fq.gz.kmcp.tsv.gz

23:19:42.530 [INFO] processed queries: 676694, speed: 32.606 million queries per minute
23:19:42.530 [INFO] 8.0837% (54702/676694) queries matched

Profiling:

# --level strain is used when no taxonomy is given.
# some preset profiling modes are available.
kmcp profile --level strain testdata.fq.gz.kmcp.tsv.gz \
    | tee profile.tsv

csvtk cut -t -f ref,percentage,coverage,score,chunksFrac,reads profile.tsv \
    | csvtk pretty -t
ref           percentage   coverage     score    chunksFrac   reads
-----------   ----------   ----------   ------   ----------   -----
NC_045512.2   100.000000   275.461793   100.00   1.00         54702

coverage is the vertical coverage or depth, score is a similarity score, and chunksFrac is the horizontal coverage of the genome.

shenwei356 commented 1 year ago

Added: https://bioinf.shenwei.me/kmcp/tutorial/detecting-pathogens/

KMCP v0.9.3 or later versions is needed, which fixed a bug in chunk computation when splitting circular genomes.