tfwillems / HipSTR

Genotype and phase short tandem repeats using Illumina whole-genome sequencing data
GNU General Public License v2.0
94 stars 31 forks source link

Different positions included as part of STR locus depending on run #80

Closed plavskin closed 3 years ago

plavskin commented 3 years ago

Hi,

Thank you for developing this super useful tool!

I have been running HipSTR on multiple sequencing runs of the same species (so, using the same BED file and reference for all the runs), and I noticed that the part of the genome included in the STR differs for some loci among runs (i.e. the POS and REF allele differ slightly, and thus don't always match what is in the BED file). This also happens (at a smaller number of loci) if I run HipSTR on the same exact data but with different options (e.g. with vs without --def-stutter-model).

Could you provide some insight into this? At what stage in the algorithm does HipSTR decide on the bounds of the STR locus? I am also trying to decide on the best way to reconcile results from multiple sequencing runs which were done with different library prep PCR conditions (so seemingly shouldn't be run through HipSTR together, with the same stutter model, if I understand correctly?)

Thank you!

tfwillems commented 3 years ago

Hi @plavskin,

Thanks for the kind words and hope you're finding HipSTR useful!

The issue you've pointed out is definitely one of the pain-points of comparing HipSTR calls across different runs. Ideally one would perform multi-sample calling to construct a single VCF for all samples of interest. But as you've suggested, the differences in PCR for your different libraries make that approach problematic here as well.

One approach would be to run multi-sample calling using all samples (ignoring the differences in PCR) to generate a crude reference VCF containing alleles for all samples. You could then re-genotype each group of samples with different PCR characteristics using this VCF as input to the --ref-vcf option, which will cause HipSTR to use the VCF alleles as candidate alleles instead of constructing them de novo.

The reason that HipSTR does not just used the exact BED coordinates is that STRs are often slightly incorrectly annotated. As a result, during the initial stages of alignment processing when HipSTR identifies candidate haplotypes, it will slightly expand the STR region to include proximal SNPs and/or indels that are very nearby. If all of these proximal variants don't end up in any of the final genotypes, HipSTR will try to re-shrink back to the original STR BED region prior to reporting the calls. But in instances where they do result in genotype changes, they must be included.

This issue is something that's bothered me for quite some time, but unfortunately I haven't found a parsimonious work-around. Ultimately, if I ever get around to it, I'm hoping to generate a reference VCF from the 1KG high-coverage samples that should capture most STR alleles and will solve this problem (via the --ref-vcf option mentioned above).

Best, Thomas

bharatij commented 3 years ago

Hi Thomas, I am experiencing the same issue as mentioned above. I am wondering if you have generate a reference VCF from the 1KG high-coverage samples. If it is available that will be super helpful. One question, if such reference vcf is used with HipSTR, it will only genotype set of STRs and alleles for that STR in the reference vcf and will not genotype any denovo alleles. So it will be only useful for common STRs expansions and not for denovo expansions, right? Thanks, Bharati