tfwillems / HipSTR

Genotype and phase short tandem repeats using Illumina whole-genome sequencing data
GNU General Public License v2.0
94 stars 31 forks source link

Report only repetitive portions of alleles #53

Closed cadlag1 closed 6 years ago

cadlag1 commented 6 years ago

Hello,

I'm interested in analysing the copy number of specific STR loci with the possibility of SNP's within the repetitive regions, although it seems that HipSTR sometimes reports polymorphisms in regions flanking the STR and classifies them as seperate alleles. For example, a specific STR locus with a GT motif, where the reference allele has the sequence ATATGTGTGTGTGTGAA as reported by HipSTR and TGTGTGTGTGTG (copy number = 6) as reported by TRF (Tandem repeat finder), has the following alleles reported by HipSTR: ATATCTGTGTGTGAT (BPDIFF = -2, copy number = 4) ATAAGTGTGTGTGTGAA (BPDIFF = 0, copy number = 5) ATATCTGTGTGTGTGAA (BPDIFF = 0, copy number = 5) AGATGTGTGTGTGTGAA (BPDIFF = 0, copy number = 6) ATATGTGTGTGTGTGAT (BPDIFF = 0, copy number = 6)

Where BPDIFF is reported by HipSTR, while copy number isn't.

Some of these alleles have the same number of TG motifs and only differ due to bases in the flanking regions. I am well aware that HipSTR reports the coordinates of the repetitive portion of the reference allele, which would be the starting and ending coordinates of the TGTGTGTGTGTG sequence mentioned above, although I was wondering whether there is a function to specify these coordinates for repetitive regions within the reported alleles for samples and not just for the reference allele? Or if there is a BPDIFFS equivalent only for the repetitive portions rather than the entire allele?

Thank you

tfwillems commented 6 years ago

Hi @cadlag1,

Unfortunately there's currently no easy way to force HipSTR to do this, although I completely understand why it's desirable in terms of studying STR mutation patterns and dynamics.

The complexity comes from the fact that variants very near STRs will negatively impact genotype accuracy if they're not properly considered during read alignment. As a result, when HipSTR builds candidate haplotypes, it identifies both STR mutations and mutations within a few base pairs of the repetitive region and uses those full sequences as candidate variants during variant calling.

Typically, the variants near to the STR are SNPs and not indels. In these cases, BPDIFFS reported for the whole variant will exactly match the BPDIFFS that would be reported for just the STR. So one simple approach might be to make this assumption, although there will of course be loci where this doesn't hold.

To assess whether this is a reasonable assumption on a per-locus basis, one option would be to determine the amount of flanking sequence HipSTR added to the variant using the reference allele starting and end coordinates as well as the START and END INFO fields. If the flanking sequences across all variants seem to match apart from a SNP, you could trim off the flanking sequences and proceed to analyze the STR-only sequences.

Another solution would be to only consider variants in the VCF where the variant's POS matches the START INFO field and the variant's end position matches the END INFO field. For these variants, no flanking mutations were considered during genotyping and so all reported mutational information should be from the STR.

Hope that's helpful but let me know if you have any follow-up questions/suggestions

Best, Thomas