shenwei356 / kmcp

Accurate metagenomic profiling && Fast large-scale sequence/genome searching
https://bioinf.shenwei.me/kmcp
MIT License
176 stars 13 forks source link

long read metagenomic profiling #27

Open JensUweUlrich opened 1 year ago

JensUweUlrich commented 1 year ago

Dear Wei Shen, I really like your tool and your tutorials. I just have a question regarding long read metagenomic profiling. Is there a specific parameter combination you would recommend to use to taxonomic profiling? It seems like I'm missing some organisms from the Zymo Mock Community even when using profiling mode m=0. Thanks Jens

shenwei356 commented 1 year ago

Thanks for your interest.

KMCP is only suitable for short-read metagenomic profiling, with much lower sensitivity on long-read datasets. My initial plan was to support both short and long reads. But the read matching strategy, i.e., keeping reads with enough (>= 50% ) k-mers contained in a genome chunk, is of low sensitivity for long reads, even for HIFI reads.

Some strategies were tried, but the results were out of expectation.

  1. Setting a lower similarity threshold. For our probabilistic data structure, lower thresholds will significantly increase the false-positive rates of a read, though the FPR can also be reduced at the cost of bigger databases.
  2. Using sketching algorithm. ScaledMinash, Closed Syncmers, and Minimizer were all implemented (available in the current version) and tested, but they didn't work well on error-prone long reads with lower sensitivity. Though tools like minimap2 benefit from Minimizer with location information for seeding and chaining in sequence alignment, we failed to utilize them in taxonomic profiling.
  3. Using multiple k-mers. K-mers of different lengths, e.g., 17, 21, 31, didn't do better than a single value and doubled the database size.
  4. Using Simhash with a higher tolerance than k-mer on base substitution. It's slower and has lower sensitivity unexpectedly.
  5. Breaking long reads into short ones. It only applies to HIFI reads, but the strength of the long reads is wasted.
shenwei356 commented 1 year ago

The answer is added to FAQs page: https://bioinf.shenwei.me/kmcp/faq/#does-kmcp-support-long-read-metagenomic-profiling