How are allele sizes estimated?

quinlan-lab / STRling

Detect novel (and reference) STR expansions from short-read data

MIT License

60 stars 9 forks source link

How are allele sizes estimated? #99

Closed hchetia closed 2 years ago

hchetia commented 2 years ago

Hi, Could you share how the allele sizes are estimated for the samples? The no.s being in integars and not whole numbers, is kind of confusing.

Regards, Hasna

hdashnow commented 2 years ago

Because we are estimating a large allele from short reads, we cannot directly measure the size. Instead we have to infer it. To estimate the size of alleles greater than the read length, STRling counts the number of STR k-mers in all anchored and overlapping reads assigned to that locus. Using simulations we created a linear model of the relationship between these counts and the size of the allele and then use this to estimate the size of the allele in a given individual. The number you see is the output of that linear model. We do not round it to the closest integer, as it's just an estimate.

hchetia commented 2 years ago

Thanks @hdashnow . Is there a way to visualize the kmers in anchored and overlapping reads of a locus? Will a linear model be created for each STR loci? Is there any diagram that I can refer to explain how you arrive at the linear model?

hdashnow commented 2 years ago

We don't have a visualization tool. If you use the debug compiled version (see https://github.com/quinlan-lab/STRling/releases) then strling call will output a file sample-reads.txt that lists the names of the reads that were used in each call. You could then look at those specific reads in IGV. The linear model is shared across loci. It's intuitive that the longer an allele is, the more reads arise from it. I merely simulated alleles of varying sizes (and loci and repeat units) and confirmed that this was true and consistent. Then I estimated the intersection and slope of the line. I'm writing up the preprint right now, so I can post that here when it's out. The figures should be helpful in explaining the algorithm.

hchetia commented 2 years ago

Thanks @hdashnow that's very helpful.