Closed hchetia closed 2 years ago
Because we are estimating a large allele from short reads, we cannot directly measure the size. Instead we have to infer it. To estimate the size of alleles greater than the read length, STRling counts the number of STR k-mers in all anchored and overlapping reads assigned to that locus. Using simulations we created a linear model of the relationship between these counts and the size of the allele and then use this to estimate the size of the allele in a given individual. The number you see is the output of that linear model. We do not round it to the closest integer, as it's just an estimate.
Thanks @hdashnow . Is there a way to visualize the kmers in anchored and overlapping reads of a locus? Will a linear model be created for each STR loci? Is there any diagram that I can refer to explain how you arrive at the linear model?
We don't have a visualization tool. If you use the debug compiled version (see https://github.com/quinlan-lab/STRling/releases) then strling call
will output a file sample-reads.txt
that lists the names of the reads that were used in each call. You could then look at those specific reads in IGV.
The linear model is shared across loci. It's intuitive that the longer an allele is, the more reads arise from it. I merely simulated alleles of varying sizes (and loci and repeat units) and confirmed that this was true and consistent. Then I estimated the intersection and slope of the line. I'm writing up the preprint right now, so I can post that here when it's out. The figures should be helpful in explaining the algorithm.
Thanks @hdashnow that's very helpful.
Hi, Could you share how the allele sizes are estimated for the samples? The no.s being in integars and not whole numbers, is kind of confusing.
Regards, Hasna