How to run pseudo-haploid and haploid

Hi,

First, apologies, I missed several Issues opened in mid-summer while I was on vacation and didn't see the emails when I came back.

Assuming this issue is still one you're looking at, in general for humans I would recommend using the new method QUILT that I and colleagues released earlier this year. For nearly all human populations, using a reference panel like 1000 Genomes, QUILT should substantially outperform STITCH, be faster, and not be subject to per-batch biases like STITCH would be.

If you are going to run STITCH, if you run the patches (subsets of samples) separately, it does change the accuracy, though this might be be a small difference, depending on sample size. As sample size gets larger, the accuracy of STITCH should improve towards a limit. What I would be worried about is the potential for batch effects. Though there are heuristics inside STITCH to try and achieve a global maximum, these are not perfect, and STITCH in practice will only achieve local maximum solutions. This means there is run to run variability, on top of the variability in samples that make up a run and will contribute to the inference of different ancestral haplotypes and hence different imputed results. So by splitting samples into batches, there is the potential to systematically impute things differently between the batches, due only to batch membership, which can lead to inflation in a GWAS or similar setting. This sort of problem won't happen with QUILT because each sample is imputed independently of each other using only the reference panel.

About how to configure STITCH to run in that particular configuration, you would set switchModelIteration = 39, and method = diploid, as that would run 1-38 in pseudoHaploid, then switch at iteration 39 to run the final two in the slower diploid version

Hope that helps, Robbie

rwdavies / STITCH

How to run pseudo-haploid and haploid #55