shenwei356 / kmcp

Accurate metagenomic profiling && Fast large-scale sequence/genome searching
https://bioinf.shenwei.me/kmcp
MIT License
176 stars 13 forks source link

Kmcp profile empty #39

Closed sofstam closed 10 months ago

sofstam commented 10 months ago

Hello,

Thanks for developing kmcp and for providing detailed documentation. I am running into empty output file after running kmcp profile command.

The commands I am using are:

kmcp compute -k 21 -n 10 -l 150 -O tmp-k21-n10-l150 -I gtdb-genomes

kmcp index -f 0.3 -n 1 -j 32 -I tmp-k21-n10-l150/ -O gtdb.kmcp

kmcp search -d gtdb.kmcp -o ERX5474932_ERR5766176.kmcp@gtdb.kmcp.tsv.gz \
           -1 ERX5474932_ERR5766176_1.fastq.gz -2 ERX5474932_ERR5766176_2.fastq.gz
kmcp profile -X taxdump_custom/ -T seqid2taxid.map -m 3 \
           ERX5474932_ERR5766176.kmcp@gtdb.kmcp.tsv.gz -o ERX5474932_ERR5766176.k.profile

And the output of this command is:

16:01:52.145 [INFO] using a lot of threads does not always accelerate processing, 4-threads is fast enough
16:01:52.145 [INFO] kmcp v0.9.3
16:01:52.145 [INFO]   https://github.com/shenwei356/kmcp
16:01:52.145 [INFO] 
16:01:52.145 [INFO] checking input files ...
16:01:52.145 [INFO]   1 input file(s) given
16:01:52.145 [INFO] loading TaxId mapping file ...
16:01:52.145 [INFO]   2 pairs of TaxId mapping values from 1 file(s) loaded
16:01:52.145 [INFO] loading Taxonomy from: taxdump_custom/
16:01:52.145 [INFO]   1 nodes in 1 ranks loaded
16:01:52.145 [INFO]   0 merged nodes loaded
16:01:52.145 [INFO]   0 deleted nodes loaded
16:01:52.145 [INFO]   1 names loaded
16:01:52.145 [INFO] 
16:01:52.145 [INFO] -------------------- [main parameters] --------------------
16:01:52.145 [INFO] match filtration: 
16:01:52.145 [INFO]   maximum false positive rate: 0.010000
16:01:52.145 [INFO]   minimum query coverage: 0.550000
16:01:52.145 [INFO]   keep matches with the top N scores: N=0
16:01:52.145 [INFO]   only keep the full matches: false
16:01:52.145 [INFO]   only keep main matches: false, maximum score gap: 0.400000
16:01:52.145 [INFO] 
16:01:52.145 [INFO] deciding the existence of a reference:
16:01:52.145 [INFO]   preset profiling mode: 3
16:01:52.145 [INFO]   minimum number of reads per reference chunk: 50
16:01:52.145 [INFO]   minimum number of uniquely matched reads: 20
16:01:52.145 [INFO]   minimum proportion of matched reference chunks: 0.800000
16:01:52.145 [INFO]   maximum standard deviation of relative depths of all chunks: 2.000000
16:01:52.145 [INFO] 
16:01:52.145 [INFO]   minimum number of high-confidence uniquely matched reads: 5
16:01:52.145 [INFO]   minimum query coverage of high-confidence uniquely matched reads: 0.750000
16:01:52.145 [INFO]   minimum proportion of high-confidence uniquely matched reads: 0.100000
16:01:52.145 [INFO] 
16:01:52.145 [INFO] taxonomy data:
16:01:52.145 [INFO]   taxdump directory: taxdump_custom/
16:01:52.145 [INFO]   mapping reference IDs to TaxIds: [seqid2taxid.map]
16:01:52.145 [INFO] 
16:01:52.145 [INFO] reporting:
16:01:52.145 [INFO]   default format   : ERX5474932_ERR5766176.k.profile
16:01:52.145 [INFO] -------------------- [main parameters] --------------------
16:01:52.145 [INFO] 
16:01:52.145 [INFO] stage 1/4: counting matches and unique matches for filtering out low-confidence references
16:01:52.145 [INFO]   parsing file: ERX5474932_ERR5766176.kmcp@gtdb.kmcp.tsv.gz
16:01:52.147 [INFO]   number of references in search result: 1
16:01:52.147 [INFO]   number of estimated references: 1
16:01:52.147 [INFO]   elapsed time: 2.129055ms
16:01:52.147 [INFO] 
16:01:52.147 [INFO] stage 2/4: counting ambiguous matches for correcting matches
16:01:52.147 [INFO]   parsing file: ERX5474932_ERR5766176.kmcp@gtdb.kmcp.tsv.gz
16:01:52.149 [INFO]   elapsed time: 1.777231ms
16:01:52.149 [INFO] 
16:01:52.149 [INFO] stage 3/4: recounting matches and unique matches
16:01:52.149 [INFO]   parsing file: ERX5474932_ERR5766176.kmcp@gtdb.kmcp.tsv.gz
16:01:52.152 [INFO]   number of estimated references: 0
16:01:52.152 [INFO]   elapsed time: 2.885897ms
16:01:52.152 [INFO] 
16:01:52.152 [INFO] stage 4/4: estimating abundance using EM algorithm
16:01:52.152 [INFO]   initialization step
16:01:52.152 [INFO]     parsing file: ERX5474932_ERR5766176.kmcp@gtdb.kmcp.tsv.gz
16:01:52.157 [INFO]     number of estimated references: 0
16:01:52.157 [INFO]     elapsed time: 4.52852ms
16:01:52.157 [INFO]   number of estimated references: 0
16:01:52.157 [INFO]   elapsed time: 4.575221ms
16:01:52.157 [INFO] 
16:01:52.157 [INFO] 0.0000% (0/632060) reads matched
16:01:52.157 [INFO] 
16:01:52.157 [INFO] 0.0000% (0/81) matched reads belong to the 0 references in the profile
16:01:52.158 [INFO] 
16:01:52.158 [INFO] elapsed time: 13.189091ms
16:01:52.158 [INFO] 

I am also attaching the input and output files in order to be easier to replicate the issue.

Could you please help me with what I might do wrong here?

ERX5474932_ERR5766176.k.profile.zip seqid2taxid.map.zip gtdb-genomes.zip gtdb.kmcp.zip ERX5474932_ERR5766176.kmcp@gtdb.kmcp.tsv.gz taxdump_custom.zip

shenwei356 commented 10 months ago

The custom taxdump files you created seem incorrect. You can create with taxonkit create-dump

So the log showed only one node loaded:

16:01:52.145 [INFO] loading Taxonomy from: taxdump_custom/
16:01:52.145 [INFO]   1 nodes in 1 ranks loaded
sofstam commented 10 months ago

Thank you for your reply. I was still getting the same logs after your suggestions, it worked with mode=0. I will close the issue for now.