Closed flashton2003 closed 1 year ago
Hi, Thanks for your interest in using StrainScan!
According to the algorithm of StrainScan, it will divide strains into clusters and determine the potential clusters first. Then, for identified clusters, the algorithm will find informative k-mers to distinguish strains inside the cluster. As a result, the k-mers indexed for each strain can be different. This can explain why the number of k-mers differs between the strains from the two clusters.
For 1, the coverage provided by StrainScan is not the whole-genome coverage like alignment-based methods. It indicates how many informative k-mers of this strain are covered (bigger, then more confident about the existence of the strain). And the depth is estimated based on the k-mer frequency and our regression model. As a result, the correlation between coverage and depth in StrainScan's output is different from the normal case.
Hope this reply can be helpful. I will consider renaming "Coverage" to avoid confusion cause it's not the one most people think.
Thank you for the quick response! Your explanation makes sense.
Perhaps using "Covered/Unique kmers" or "Covered/Distinguishing kmers" would make it clearer? And similarly "Coverage of unique k-mers"?
Thanks for your suggestions! "Coverage of distinguishing k-mers" seems a good name to me. Will rename it asap. :)
Hi,
I'm interested in using StrainScan to analyse P. copri genomes. I cloned the github version (conda version doesn't take pair reads yet), and installed the conda env to get the dependencies.
It ran successfully on an example dataset, which is great! I used your example P. copri database.
However, I was wondering if you could explain two things.
This is my output:
I was surprised by two things:
All the best,
Phil