odelaneau / shapeit4

Segmented HAPlotype Estimation and Imputation Tool
MIT License
89 stars 17 forks source link

whatshap phase set variants #39

Open Ahhgust opened 3 years ago

Ahhgust commented 3 years ago

Hi! First, let me thank you for releasing and maintaining shapeit4, and with setting aside time to contend with what usually amounts to user error on part of the community (which I will likely be guilty of in a second!).

I've been curious about using both statistical and physical phasing to get at the phase of rare variants. Shapeit4 seems to have some... features (that include the inducement of genotyping error from phasing) that caused me to write my own wrapper and clean-up script for it, and in so doing I believe I found a bug in how the phase set genotypes are treated. From what I gather the "known" phase of any genotype in a phase-set is only valid within that set; it may not be valid between sets. However in the few thousand phase-sets I've assessed every single one has the same orientation pre-shapeit4 as well as post shapeit4. This to me seems a bit wrong. At a high level, phasing should seek some path in the possible maternal/paternal orientation of genotypes (in the absence of a phase set), or in a collection of locally phased genotypes from some phase set. Instead what seems to be happening is that the orientation from the phase-set is treated as ground truth for the entire chromosome, which should have the effect of inducing switch errors at the phase-set boundaries as no information (other than that provided by LD) is available to say how the two phase-sets are phased to each other.

Said in another way (in ascii art), if the post whatshap pre shapeit4 vcf file for a single person with two phase sets (PS1 and PS2) is set up as:

                (PS1) (PS2)

HaplotypeA 1 1 0 0 HaplotypeB 0 0 1 1

Then if the phaseset error rate is 0, one would expect that statistical phasing would choose between the 1100 HaplotypeA and the 1111 HaplotypeA as the phase-sets are only stating that the first two sites are in-phase and sites 3 and 4 are in phase (but says nothing about sites 2 and 3, for example). What seems to be happening instead is that only the sites not in phase sets are being phased.

My apologies if this is in error! And thanks for your continued support for the program! -August

odelaneau commented 3 years ago

Hi August,

I'm not 100% sure to understand your problem. But what i can confirm is that S4 is supposed to phase distinct PSs relative to each other, so in your particular case it should indeed chooses between 1100 and 1111 for hapA. Happy to discuss this or any evidence that this is not what it is doing.

Best,

Ahhgust commented 3 years ago

Thanks for getting back to me! For me the issue is pretty simple-- I have whole exome data that I would like to phase. I performed genotyping (GATK) and then physical phasing (whatshap) followed by statistical phasing (shapeit4) with a reference panel (1000 Genomes). I'm concerned about retaining rare variants in particular so the statistical phasing is performed on each individual one at a time (so I minimally drop infinite sites violations), and I wrote a little wrapper script to add back in the sites that were dropped (not found in the 1000 Genomes project). Some dropped sites include sites that were physically phased, so I included an approach to use parsimony (wrt to the phaseset information) to optionally swap the polarity of the genotypes that were phased by whatshap and were dropped by shapeit4 (as the two phases need not correspond). However, I found that the parsimony routine never was used, and when I checked the two phasings (physical and statistical) I do not have a single case where they disagree (e.g., 1|0 vs 0|1), which brings us to this conversation. I'm using bcftools merge to merge the pre-shapeit4 phasing and the post-, but I handchecked dozens of sites and it doesn't appear that bcftools is the source either.

I'm using shapeit4 version 4.1.3 with the following flags:

--map ./GeneticMaps/chr$chrom.b37.gmap.gz --region $chrom --sequencing --seed $chrom --mcmc-iterations 10b,1p,1b,1p,1b,1p,1b,1p,10m --use-PS 0.0001

As well as a scaffold file (bcf from whatshap), a reference panel file (-H, 1000 Genomes genotypes) as well as the input bcf file.

If this is a user error, I'd love to know what mistake I'm making! (That's certainly the easiest and most likely answer!)

Thanks! -August

Ahhgust commented 3 years ago

... And I think I see my error. I'm guessing that including the whatshap calls as a scaffold would generate this behavior? My apologies!