odelaneau / shapeit4

Segmented HAPlotype Estimation and Imputation Tool
MIT License
89 stars 17 forks source link

Phasing with whatshap phased variants #17

Open weishwu opened 4 years ago

weishwu commented 4 years ago

I've phased variants from about 600 samples using whatshap with a pedigree file. About 100 samples in this file are in trios so majority of the variants from these samples were phased by whatshap. But the remaining 500 samples have poor phasing due to lack of trio information (parents were not sequenced).

I'm trying to use Shapeit4 to do statistical phasing in order to improve the phasing of these 500 non-trio samples. All these 600 samples are from a small isolated village so I'd like to see if the haplotypes resolved from the 100 trio-samples could improve the phasing of the non-trio samples via statistical phasing. The basic assumption is that haplotypes are broadly shared among the samples.

How should I set this up with Shapeit4? Could I just use the whathap phased vcf file as the input for Shapeit4 and Shapeit4 will use the phasing information automatically? Or should I prepare a scaffold vcf file using the phased variants from the 100 trio-samples and provide it to Shapeit4 using the "--scaffold" parameter? If latter, how should I prepare this scaffold file exactly? Should I prepare a "scaffold" file or a "reference panel" using these 100 trio-samples?

Thanks.

odelaneau commented 4 years ago

Hi,

I'd suggest to phase all samples together (trios or not) using the partially resolved haplotypes as a scaffold (see documentation). For the scaffold, you need an additional vcf/bcf file (specified with --scaffold) with phased hets as 0|1 or 1|0 and unphased variants (e.g. triple hets) as 0/1. ps: please use the last version, or you might have an "underflow" error.

Best,

weishwu commented 4 years ago

Thanks. Then the scaffold vcf (containing the 100 trio samples) will be a subset of the main input vcf (containing the 100 trio samples and the 500 non-trio samples). Is this OK?

odelaneau commented 4 years ago

Yes. But you need to make sure that only hets phased thanks to trio info are specified as phased (i.e. "|"), not those phased using sequencing reads.

weishwu commented 4 years ago

Got it. Thanks!

odelaneau commented 4 years ago

FYI, shapeit4 will use as scaffolded genotypes all those that are specified as phased (using "|"). This file therefore need to contain only genotypes that have been scaffolded thanks to family information. Somehow these guys are all in a single fixed phase set.

The main genotype file may contain phase sets; so genotypes that are phased relative to other ones. In this case, shapeit4 may correct some of this phasing information when it is too discordant with what you get from statistical (or LD based) phasing.

Unfortunately, shapeit4 can not read both layers of information from the same unique file. Hope this makes sense, and happy to give more details.

weishwu commented 4 years ago

Thanks for the explanation!

weishwu commented 4 years ago

I used only pedigree-based phased trios for the --scaffold parameter, but for the main input vcf, I tried two files. Both files contain the variants of one trio, and the difference is that the variants are unphased in file A, while file B contains the same variants but they were phased by whatshap with BAM but without pedigree (so they were read-based phased). I thought read-based phasing would help Shapeit4 phase more variants, however file A turned out to produce more phased variants. I also looked at the ratio of the phased variants that agreed with the pedigree-based phasing of these variants, and it turned out file A (unphased) had a slightly better validation rate (90%) than file B (pre-phased by whatshap based on reads; 87%). Why does this happen? Does this mean I should not do read-based pre-phasing before running Shapeit4?

weishwu commented 4 years ago

The variants used by --scaffold were phased by whatshap with a pedigree file and they always have paternal alleles in front of maternal alleles (P|M). I ran Shapeit4 with this scaffold and got all samples with no trios phased. I assume there is no way Shapeit4 can know how to assign paternal and maternal haplotypes. If so, how can I combine the phased variants between different chromosomes? Is it possible that I can run all chromosomes together to get this resolved (since the paternal/maternal separation in the scaffold is consistent between chromosomes)?