rwdavies / STITCH

STITCH - Sequencing To Imputation Through Constructing Haplotypes
http://www.nature.com/ng/journal/v48/n8/abs/ng.3594.html
GNU General Public License v3.0
73 stars 19 forks source link

How to run pseudo-haploid and haploid #55

Open MartinGiap opened 2 years ago

MartinGiap commented 2 years ago

I saw from the paper, you said that you run STITCH with 40 rounds. The first 38 rounds were in pseudo-haploid mod and the final 2 rounds were in diploid mode. I want to ask how do we configure to run an analysis with a combined method?

By the way, I want to ask about the input. I want to run a GWAS for humans. The number of SNPs is huge and the number of samples is huge, too. If we separate our samples and chromosomes into multiple sample subsets and DNA segments. Then, I run these patches separately, does it change the accuracy (comparing to the situation that we run everything once)?

rwdavies commented 2 years ago

Hi,

First, apologies, I missed several Issues opened in mid-summer while I was on vacation and didn't see the emails when I came back.

Assuming this issue is still one you're looking at, in general for humans I would recommend using the new method QUILT that I and colleagues released earlier this year. For nearly all human populations, using a reference panel like 1000 Genomes, QUILT should substantially outperform STITCH, be faster, and not be subject to per-batch biases like STITCH would be.

If you are going to run STITCH, if you run the patches (subsets of samples) separately, it does change the accuracy, though this might be be a small difference, depending on sample size. As sample size gets larger, the accuracy of STITCH should improve towards a limit. What I would be worried about is the potential for batch effects. Though there are heuristics inside STITCH to try and achieve a global maximum, these are not perfect, and STITCH in practice will only achieve local maximum solutions. This means there is run to run variability, on top of the variability in samples that make up a run and will contribute to the inference of different ancestral haplotypes and hence different imputed results. So by splitting samples into batches, there is the potential to systematically impute things differently between the batches, due only to batch membership, which can lead to inflation in a GWAS or similar setting. This sort of problem won't happen with QUILT because each sample is imputed independently of each other using only the reference panel.

About how to configure STITCH to run in that particular configuration, you would set switchModelIteration = 39, and method = diploid, as that would run 1-38 in pseudoHaploid, then switch at iteration 39 to run the final two in the slower diploid version

Hope that helps, Robbie