tfwillems / HipSTR

Genotype and phase short tandem repeats using Illumina whole-genome sequencing data
GNU General Public License v2.0
94 stars 31 forks source link

Add functionality to report data about the full haplotype as opposed to just the STR sequence #61

Closed tfwillems closed 5 years ago

tfwillems commented 5 years ago

Under the hood, HipSTR generates haplotypes comprised of upstream and downstream regions flanking the STR as well as the STR sequence itself. Although this is critical for genotyping accuracy, no functionality was available to output this information to the VCF in a manner that might be useful to the user. In certain instances, nearby tagging SNPs can be very informative about the haplotype structure and can be informative about whether both haplotypes have been observed, even when the STR itself is homozygous.

To address this limitation, we've added functionality that explicitly outputs additional fields to the VCF:

  1. INFO fields LFLANKS and RFLANKS contain the flank sequences when > 1 flank is discovered from the assembly process
  2. FORMAT fields LFGT and RFGT contain the genotypes of the flanking sequences and refer to the sequences in LFLANKS and RFLANKS.
  3. FORMAT fields HQ and PHQ report the genotype posteriors of the unphased and phased haplotypes, respectively. These are analogous to the Q and PQ fields already reported for the STR sequence alone. These fields are informative about how confident we are about both sets of haplotypes. Cases in which Q is ~1 but HQ is << 1 are indicative of instances in which we cannot confidently locally phase the STR alleles with its nearby SNPs

These fields will be output if the option --output-hap-fields is specified, but this option is currently masked from the command line help message pending further testing

In future, it might be worth expanding this to report the flanking sequences upstream and downstream of the STR as separate VCF records, linked to the STR via some tag such as the STR start position