shenwei356 / kmcp

Accurate metagenomic profiling && Fast large-scale sequence/genome searching
https://bioinf.shenwei.me/kmcp
MIT License
181 stars 13 forks source link

QUESTION: Taxdump #15

Closed alienzj closed 2 years ago

alienzj commented 2 years ago

Dear Dr. Shen, I have a question about taxdump in KMCP documentation. The documentation mentioned:

# taxid mapping files, multiple files supported.
taxid_map=gtdb.kmcp/taxid.map,refseq-viral.kmcp/taxid.map,refseq-fungi.kmcp/taxid.map

# or concatenate them into a big taxid.map
#    cat gtdb.kmcp/taxid.map refseq-viral.kmcp/taxid.map refseq-fungi.kmcp/taxid.map > taxid.map
# taxid_map=taxid.map

# taxdump directory
taxdump=taxdump

sfile=$file.kmcp.tsv.gz

kmcp profile \
    --taxid-map      $taxid_map \
    --taxdump         $taxdump/ \
    --level             species \
    --min-query-cov        0.55 \
    --min-chunks-reads       50 \
    --min-chunks-fraction   0.8 \
    --max-chunks-depth-stdev  2 \
    --min-uniq-reads         20 \
    --min-hic-ureads          5 \
    --min-hic-ureads-qcov  0.75 \
    --min-hic-ureads-prop   0.1 \
    $sfile                      \
    --out-prefix       $sfile.kmcp.profile \
    --metaphlan-report $sfile.metaphlan.profile \
    --cami-report      $sfile.cami.profile \
    --sample-id        "0" \
    --binning-result   $sfile.binning.gz

Is the taxdump here downloaded from the NCBI taxonomy database? I don't think so, but each database is a separate directory when using taxonkit to generate a taxdump, so does the taxdump here also need to be specified like taxid map ?

Thanks!

shenwei356 commented 2 years ago

Is the taxdump here downloaded from the NCBI taxonomy database?

Right. Some other custom taxdump files, e.g., created by taxonkit create-taxdump are also acceptable.

but each database is a separate directory when using taxonkit to generate a taxdump, so does the taxdump here also need to be specified like taxid map ?

The reference genomes in pre-built databases, including GTDB, Refseq-fungi, GenBank-viral, all can be mapped to NCBI TaxIds, so a single NCBI taxonomy database is engouh.

I see, for multiple custom KMCP databases with different custom taxdump files, it can't be simply using either one or merged taxdump files.

I'd suggest creating another new taxdump with concatenated (row-wise merged) taxonomy data, so all reference genomes belong to a unified taxonomy database.

alienzj commented 2 years ago

Dear Dr. Shen,

Great! Thanks very much for your reply. Based on your suggestions, I will try to concatenate the taxonomy data, and do multi-kingom profiling using KMCP. Thanks again!