Closed nh13 closed 6 years ago
Hi Nils,
I've encountered this read quality issue quite frequently myself, especially in homopolymer regions. It really limits us from genotyping very long homopolymers using Illumina data (>~15 bp). In regards to your questions: 1) Yes, HipSTR does quality trimming. See the TrimAlignment() function here. The approach is very basic and just scans from each end of the read and removes bases until we encounter a sufficiently high base quality score. This obviously breaks down if you have a high quality score amongst a whole set of lower scores, so if you'd like to suggest/implement an alternative approach, that'd be great 2) I'm not sure it quite falls into any of these categories. HipSTR aligns each read to each candidate haplotype, similar to GATK-HC's approach. For a given read-haplotype alignment, there is no penalty for not using haplotype bases, so in that sense it's somewhat akin to a global alignment with no reference end-gap penalties. But there's the added complication that if the read extends past the haplotype boundaries, the unmatched read bases are scored as perfectly matching. In these cases, this would allow a prefix/suffix of the read to align to the haplotype.
Personally I'm not in favor of implementing a completely semi-local approach as you could confound base quality issues with the observation of reads from candidate haplotypes that have not been properly captured. I think the right path forward would be to improve the base quality trimming functionality, so any improvements you could offer would be much appreciated
I have reads where after reading through either the STR or flanking homopolymers, the read quality is extremely poor (likely due to phasing issues the cluster in a read). I was wondering:
Thanks for any insight you could offer.