Add a tutorial of detecting specific pathogen in sequencing data

Sample data:

SARS-CoV-2 reference genomes: NC_045512.2. Multiple strains are supported.
Sample data containing SARS-CoV-2: http://opengene.org/fastv/testdata.fq.gz

# split reference genomes into 10 chunks with 150-bp overlaps
kmcp compute -k 21 -n 10 -l 150 -I refs/ -O refs-n10-l150

# index with a small FPR for small genomes
kmcp index -f 0.001 -I refs-n10-l150/ -O refs.kmcp

Searching reads against the KMCP database:

kmcp search -d refs.kmcp/ testdata.fq.gz -o testdata.fq.gz.kmcp.tsv.gz

23:19:42.530 [INFO] processed queries: 676694, speed: 32.606 million queries per minute
23:19:42.530 [INFO] 8.0837% (54702/676694) queries matched

Profiling:

# --level strain is used when no taxonomy is given.
# some preset profiling modes are available.
kmcp profile --level strain testdata.fq.gz.kmcp.tsv.gz \
    | tee profile.tsv

csvtk cut -t -f ref,percentage,coverage,score,chunksFrac,reads profile.tsv \
    | csvtk pretty -t
ref           percentage   coverage     score    chunksFrac   reads
-----------   ----------   ----------   ------   ----------   -----
NC_045512.2   100.000000   275.461793   100.00   1.00         54702

coverage is the vertical coverage or depth, score is a similarity score, and chunksFrac is the horizontal coverage of the genome.

shenwei356 / kmcp

Add a tutorial of detecting specific pathogen in sequencing data #31