odelaneau / shapeit5

Segmented HAPlotype Estimation and Imputation Tool
https://odelaneau.github.io/shapeit5/
MIT License
56 stars 9 forks source link

Phasing with relateds as scaffold #60

Closed psnavais closed 10 months ago

psnavais commented 10 months ago

Hi!

We have a large dataset of related samples (>40K trios, >20K duos) and many more samples that are unrelated. So far, we are phasing related samples using the --pedigree flag. When it comes to phasing the rest of unrelated samples (using the previously phased related samples as scaffold), I have two questions:

  1. Do we need to include the related samples in this set as well?
  2. If so, do we have to include the --pedigree flag as well?

I will be phasing in chunks, I imagine this doesn't affect the "pipeline", except for the ligate step.

Huge thanks for the time and for the neat documentation.

Best, Pol

srubinacci commented 10 months ago

Hi,

Not sure I understand what you mean by "using the previously phased related samples as scaffold". Do you mean reference panel?

In general, if possible, I would phase everyone together without proceeding in two steps. More samples you phase together, better the phasing is. You just need to specify parental relationships in the pedigree file for the samples who have them.

Hope this helps

Simone

psnavais commented 10 months ago

That was helpful, thanks!

I got confused by the definition of scaffold. From the documentation I understood that if you had a good set of haplotypes (in our case, those coming from related individuals), then you can use these as a scaffold to phase other samples.

srubinacci commented 10 months ago

They are related terms, and it's perfectly normal to mix them. Haplotype phasing iteratively phase each sample, refining the current haplotype estimates with those of other samples in the dataset. In your case, you have a large sample size (>40K trios, >20K duos, +X unrelated), you want the software to phase all samples together to get the maximum of the phasing. Adding trio/duo information would further improve the phasing of the affected samples and as a consequence the phasing of all other samples in the dataset.

Going back to the terminology: a scaffold is a subset of variants with accurate phasing. Using a scaffold would allow to phase the remaining variants (for the same samples) using the scaffold as a backbone. This is the idea behind phase_common and phase_rare.

A reference panel is usually an external set of haplotypes you provide to the software to help you with the phasing. Usually, reference-based phasing is used when your sample size is low or you have some reliable phased data at hand (e.g. from the same population).

In your case, you don't seem to need reference-based phasing. You can still use the scaffold phase common and rare variants in two steps (as suggested in the tutorial for large sample size).

Hope this helps,

Simone