odelaneau / shapeit5

Segmented HAPlotype Estimation and Imputation Tool
https://odelaneau.github.io/shapeit5/
MIT License
61 stars 9 forks source link

Rare variants with reference scaffold #43

Open JosephLalli opened 1 year ago

JosephLalli commented 1 year ago

Hi there,

I just want to clarify something about the algorithm. When not using a reference panel, does SHAPEIT5 only draw haplotypes from samples that are shared between the scaffold and the main dataset? Or can it draw haplotypes from samples that are present in the scaffold but not in the main dataset?

In other words, if was trying to phase a vcf with 100 variant calls in samples [A,B,C] onto a scaffold of 50 variant calls from samples [A,B,C,D,E,F,G,H,I], would SHAPEIT5 use the scaffold information from [A,B,C,D,E,F,G,H,I] to determine the likely haplotypes? Or would it only use information from the records [A,B,C] in the scaffold, and discard records [D,E,F,G,H,I]?

Specifically, I'm wondering about this portion of the SHAPEIT5 paper:

For a specific rare variant, these conditioning haplotypes are chosen so that (1) they belong to samples being locally identical-by-descent (IBD) with the target sample and (2) they are polymorphic at the rare variant (that is, at least a few carry a copy of the minor allele). To comply with the first requirement, SHAPEIT5 uses a positional Burrows-Wheeler transform (PBWT) data structure[23](https://www.nature.com/articles/s41588-023-01415-w#ref-CR23) built on all the scaffold haplotypes at common variants.

Do these scaffold haplotypes need to be present in the input vcf to be considered for this process?

This seems particularly important if trying to phase small datasets with singleton alleles, since the singleton algorithm is based on "leverag[ing] IBD sharing patterns between haplotypes...", and learning IBD sharing patterns from out-of-sample haplotypes seems like it could help with that process.

Thanks! -Joe