tfwillems / HipSTR

Genotype and phase short tandem repeats using Illumina whole-genome sequencing data
GNU General Public License v2.0
94 stars 31 forks source link

Panel Sequencing or WES is use cases of HipSTR ? #70

Closed npatel22526 closed 4 years ago

npatel22526 commented 4 years ago

Hello,

In our experience HipSTR works great with WGS but is it ever tested/tried/experimented with WES or Panel sequencing data ? If so curious to know how well it performs ?

Best, Nick

tfwillems commented 4 years ago

Hi @npatel22526,

Apologies for the slow reply regarding your question. Yes, we have tested HipSTR using WES data, and in my opinion it still works quite well. However, there are a few intricacies that make WES calls more challenging than WGS calls:

  1. For large STR indels, WES data can exhibit large capture biases b/c the capture probes are usually designed based on the reference genome
  2. WES usually involves a significant PCR amplification step post-capture, which results in the addition of substantial stutter noise (reads that contain repeats that differ from the true underlying genotype)

HipSTR is designed to deal with issue 2, as it learns a stutter model for each STR locus. Ideally, if you're analyze WES data, you should jointly genotype a reasonably sized cohort (>20 individuals) so that HipSTR can learn an appropriate stutter model for each locus. Y

The WES intricacies will also affect several aspects of the HipSTR genotypes, most notably:

  1. AB: Allele bias. The capture bias can result in skewed allele biases, whereas in WGS data we typically observe a 50/50 mix of reads for heterozygous genotypes.
  2. DSTUTTER: Reads with stutter. You will likely observe vastly increased numbers of reads with stutter, particularly for homopolymer repeats

You'll need to fine-tune your post-genotyping filtering criteria to address these issues.

Hope that helps but let me know if you have any additional questions!

npatel22526 commented 4 years ago

Hello,

Thanks for your detailed response and sorry I completely missed it. I am mostly interested in Mono and Di nucleotide repeats and almost all of them are on < 30 bp in length so hoping to get the best out of our reads with the length of 150bps. Appreciates, your suggestion on filtering.

Best, Nick