tfwillems / HipSTR

Genotype and phase short tandem repeats using Illumina whole-genome sequencing data
GNU General Public License v2.0
94 stars 31 forks source link

How does HipSTR do re-alignment #49

Closed nh13 closed 6 years ago

nh13 commented 6 years ago

I have reads where after reading through either the STR or flanking homopolymers, the read quality is extremely poor (likely due to phasing issues the cluster in a read). I was wondering:

  1. if HipSTR does any quality trimming before alignment
  2. if HipSTR performs local, semi-local, or global alignment of the read. That is, does it allow only a substring of the read to require (local), does it allow only a prefix/suffix (semi-local), or does it require the alignment contain all query bases? I could imagine that semi-local would be useful to allow the 3' end in sequencing order to be trimmed via alignment for the cases I am seeing, just like BWA soft-clips (albeit on both ends).

Thanks for any insight you could offer.

tfwillems commented 6 years ago

Hi Nils,

I've encountered this read quality issue quite frequently myself, especially in homopolymer regions. It really limits us from genotyping very long homopolymers using Illumina data (>~15 bp). In regards to your questions: 1) Yes, HipSTR does quality trimming. See the TrimAlignment() function here. The approach is very basic and just scans from each end of the read and removes bases until we encounter a sufficiently high base quality score. This obviously breaks down if you have a high quality score amongst a whole set of lower scores, so if you'd like to suggest/implement an alternative approach, that'd be great 2) I'm not sure it quite falls into any of these categories. HipSTR aligns each read to each candidate haplotype, similar to GATK-HC's approach. For a given read-haplotype alignment, there is no penalty for not using haplotype bases, so in that sense it's somewhat akin to a global alignment with no reference end-gap penalties. But there's the added complication that if the read extends past the haplotype boundaries, the unmatched read bases are scored as perfectly matching. In these cases, this would allow a prefix/suffix of the read to align to the haplotype.

Personally I'm not in favor of implementing a completely semi-local approach as you could confound base quality issues with the observation of reads from candidate haplotypes that have not been properly captured. I think the right path forward would be to improve the base quality trimming functionality, so any improvements you could offer would be much appreciated