gymreklab / GangSTR

A tool for profiling long STRs from short reads
GNU General Public License v2.0
85 stars 16 forks source link

Fixing bugs in local realignment when flanking region matches the repeat #83

Closed gymreklab closed 4 years ago

gymreklab commented 4 years ago

Based on debugging this locus in hg19: chr10:25309168-25309203, a "TATATC" repeat. This locus is problematic because the flanking region is similar to the repeat. Multiple genotypes with short repeats were called, all of which looked incorrect based on capillary electrophoresis data.

Here is the reference region:

TGAACTGTCATAATTTTCTTTAAAA[TATATCTATATCTATATCTATATCTATATCTATATCT]ATATATATCTACCCCAAAGTCTTG

Example read misclassified as "enclosing" with 4 copies. Classified as enclosing since both the start and end match the flanking sequence.

AAGGAACATGTTTGTGGGAAATAATATTGATACAGAATTTACAAATTGAACTGTCATAATTTTCTTTAAAATATATCTATATCTATATCTATATCTATAT

I added a check to make sure that reads that start or end in the repeat region cannot be classified as enclosing.

Another example:

GTGGGAAATAATATTGATACAGAATTTACAAATTGAACTGTCATAATTTTCTTTAAAATATATCTATATATATATCTATATCTATATCTATATCTATATC

classified as enclosing with 4 copies. This was because expansion aware realignment breaks if the start and ends match the flanking region. I don't think that is a good check, so commented it out.

nmmsv commented 4 years ago

Changes look reasonable to me! Merging.

nmmsv commented 4 years ago

Ouch forgot to merge.