liaoherui / StrainScan

High-resolution strain-level microbiome composition analysis tool based on reference genomes and k-mers
https://microbiomejournal.biomedcentral.com/articles/10.1186/s40168-023-01615-w
MIT License
38 stars 5 forks source link

Output explanation - different total kmers and relationship between depth and coverage #6

Closed flashton2003 closed 1 year ago

flashton2003 commented 1 year ago

Hi,

I'm interested in using StrainScan to analyse P. copri genomes. I cloned the github version (conda version doesn't take pair reads yet), and installed the conda env to get the dependencies.

It ran successfully on an example dataset, which is great! I used your example P. copri database.

However, I was wondering if you could explain two things.

This is my output:

Screenshot 2023-05-01 at 14 55 12

I was surprised by two things:

  1. There doesn't appear to be a correlation between predicted depth and coverage. Normally with sequencing experiments, we expect that with similar depth of sequencing there should be similar coverage, but this is not the case here, with very similar depth but very different coverage.
  2. Perhaps related to the first point, why is the total number of k-mers different between the two clusters? Both clusters contain only a single genome, of similar size, therefore, I'd expect the number of k-mers to be the same. Or are these unique k-mers within the overall database and C15 is more divergent from the other clusters than C11?

All the best,

Phil

liaoherui commented 1 year ago

Hi, Thanks for your interest in using StrainScan!

According to the algorithm of StrainScan, it will divide strains into clusters and determine the potential clusters first. Then, for identified clusters, the algorithm will find informative k-mers to distinguish strains inside the cluster. As a result, the k-mers indexed for each strain can be different. This can explain why the number of k-mers differs between the strains from the two clusters.

For 1, the coverage provided by StrainScan is not the whole-genome coverage like alignment-based methods. It indicates how many informative k-mers of this strain are covered (bigger, then more confident about the existence of the strain). And the depth is estimated based on the k-mer frequency and our regression model. As a result, the correlation between coverage and depth in StrainScan's output is different from the normal case.

Hope this reply can be helpful. I will consider renaming "Coverage" to avoid confusion cause it's not the one most people think.

flashton2003 commented 1 year ago

Thank you for the quick response! Your explanation makes sense.

Perhaps using "Covered/Unique kmers" or "Covered/Distinguishing kmers" would make it clearer? And similarly "Coverage of unique k-mers"?

liaoherui commented 1 year ago

Thanks for your suggestions! "Coverage of distinguishing k-mers" seems a good name to me. Will rename it asap. :)