shenwei356 / kmcp

Accurate metagenomic profiling && Fast large-scale sequence/genome searching
https://bioinf.shenwei.me/kmcp
MIT License
182 stars 13 forks source link

Error when installing KMCP #16

Closed davidmaimoun closed 2 years ago

davidmaimoun commented 2 years ago

Hi Dr Chen, You helped me and advised me to try KMCP (biostars forum). Sorry to disturb you again When I try to install it via conda, it throw an error:

What happened? The initial connection between Cloudflare's network and the origin web server timed out. As a result, the web page can not be displayed. If you're the owner of this website: Contact your hosting provider letting them know your web server is not completing requests. An Error 522 means that the request was able to connect to your web server, but that the request didn't finish. The most likely cause is that something on your server is hogging resources. https://support.cloudflare.com/hc/en-us/articles/200171906-Error-522

Do you think we could fix it?

Thank you Dr

shenwei356 commented 2 years ago

Maybe one of the conda servers is down. Don't worry, just down the binary file here: https://github.com/shenwei356/kmcp/releases/tag/v0.8.3-alpha

davidmaimoun commented 2 years ago

Thank you very much again Dr Shen!

shenwei356 commented 2 years ago

The docs may help:

The 661K dataset

Index/Database Building

# cobs ---------------------------------

# 10h
# index file: 873G
time cobs compact-construct --file-type fasta Assemblies/ ena-bact-661k.cobs_compact  --clobber

# kmcp ---------------------------------

# ~24h
# tmp files: ~19T
kmcp compute -e -I Assemblies/ -O ena-bact-661k.kmcp-k31 --log ena-bact-661k.kmcp-k31.log  

# 11h31m
# index file 843.02 GB
kmcp index -I ena-bact-661k.kmcp-k31/ -O ena-bact-661k.kmcp-k31.db --log ena-bact-661k.kmcp-k31.db.log
davidmaimoun commented 2 years ago

Great! Thank you!

shenwei356 commented 2 years ago

You may try other de bruijn graph based tools for highly redundant datasets, including cuttlefish 2, BCALM2, bifrost. And some other sequence-to-grah tools may also help, minigraph, MetaGraph.

davidmaimoun commented 2 years ago

Yes it will be smart to try all of these tools and compare the outputs

Thank you so much

davidmaimoun commented 2 years ago

Good Afternoon Dr Shen, Could you explain me please what is the meaning of chunks in the kmcp output, and why we need it? Thank you

shenwei356 commented 2 years ago

Search result format: Tab-delimited format with 15 columns:

 1. query,    Identifier of the query sequence
 2. qLen,     Query length
 3. qKmers,   K-mer number of the query sequence
 4. FPR,      False positive rate of the match
 5. hits,     Number of matches
 6. target,   Identifier of the target sequence
 7. chunkIdx, Index of reference chunk
 8. chunks,   Number of reference chunks
 9. tLen,     Reference length
10. kSize,    K-mer size
11. mKmers,   Number of matched k-mers
12. qCov,     Query coverage,  equals to: mKmers / qKmers
13. tCov,     Target coverage, equals to: mKmers / K-mer number of reference chunk
14. jacc,     Jaccard index
15. queryIdx, Index of query sequence, only for merging

It's only used in taxonomic profiling:

kmcp compute also supports splitting sequences into chunks, this could increase the specificity in profiling results at the cost of a slower searching speed.

See also Fig1a.

davidmaimoun commented 2 years ago

It was very helpful

Thank you very much!