rwdavies / STITCH

STITCH - Sequencing To Imputation Through Constructing Haplotypes
http://www.nature.com/ng/journal/v48/n8/abs/ng.3594.html
GNU General Public License v3.0
73 stars 19 forks source link

Some curious observations on STITCH results #81

Open saulpierotti opened 1 year ago

saulpierotti commented 1 year ago

I have some curious observation that maybe you can help me with. I am using STITCH to impute ~1600 fish samples at variable levels of coverage (0.2x to 3x). These samples are F2 individuals from a complex cross setup of from 8 inbred lines (one line crossed to all the others, and some of the remaining lines crossed to each other). The number of individuals per cross varies but is in the n=100 range. I used stitch with 8, 16, 32 ancestral haplotypes (K), and the info score looks better the higher the value of K (see plot). I do not have an external validation.

image

I also run a GWAS on the imputed dosage for each round of imputation, and I noticed that in many loci what is a single peak in the 16 haplotypes model becames 2 peaks in the 8 haplotypes model (as an example see images - first one with K=16, second with K=8).

image image

Do you have any idea how can this happen as a result of the way STITCH imputes the dosages? I also imputed the samples with a custom HMM and the results agree more with the K=16 model overall.

Another curious observation is the following - I was plotting the relationship between imputed dosage and phenotype for a specific locus, split up by the F2 cross membership of the samples. I noticed that at some loci the association signal is very strong, but the actual dosage is almost 0 for all the samples. Digging in, I realized that this is due to the fact that the dosages are all almost 0 but not quite, and the small nudges from 0 that they have perfectly copy the pattern that I observe at other loci in the 0-2 dosage range, but this time in a tiny ~0-0.2 range. Do you have any guess on how can this happen, and how can I avoid such cases from giving a spurious association signal?

I suspected that maybe some haplotype has a non-0 allele at the SNP, and it has a non-0 likelihood and so nudges the dosages a bit in a way that is correlated to the status of some SNP with a true association signal. The first image shows the genotype to phenotype relationship in the 0-2 scale, the second image the same thing for the same locus but zooming in in the genotype axis.

image image

rwdavies commented 1 year ago

Nice writeup and great plots!

First small question, do you know the recombination rate and landscape for this species? Is it similar to mice and humans? There are some heuristics internally which are based on a recombination rate landscape like seen in mice and humans. If things are different (in particular if there are a higher density of recombination hotspots), changing some settings might be useful.

Second question, are the results stable? If you re-impute using K=8 a few times in the same region (and the other K's), do you see similar results? Same with K=16. STITCH has an option "S" to average across multiple runs internally. I hope it still works, I can't quite remember if it's fully functional. I admit I don't normally use it as the performance gains are normally very small.

Third question, can you try re-imputing one of the regions but turn on plot_shuffle_haplotype_attempts? It's a bit slow, but this plot can help determine how well the internal heuristics are working.

The comment about how you see changes in the ~0-0.2 range is interesting. If it's a perfect copy, I wonder if there is something like a duplication (or multiple fold change) in coverage. Do you see anything different about the depth of coverage across all samples at this SNP? Otherwise, in general, variants that impute well, but the dosage is so low suggests to me that two ancestral haplotypes are inaccurately collapsed together. This suggests the internal heuristics aren't working optimally. Another explanation might be if one or more of your original 8 inbred lines weren't actually fully inbred in some genomic locations?

I'm a bit surprised your p-value plots aren't block-ier, given the recent founding. This does make me wonder if recombination rate is high in this species? Your final plot also suggests a weird absence of blocks. You have a primary signal around ~19 Mbp, which then decays over the rest of the chromosome. But you lose signal in some large chunks, around ~15 Mbp. I wonder why that is, maybe the haplotype that the signal is on becomes recently IBD with another founding haplotype so the signal is attenuated? Hopefully not something like an uncorrect genome assembly problem.

Note that you can also impute / get out the imputed founder haplotype using output_haplotype_dosages, which adds an additional entry to the VCF with a vector of posterior expected number of ancestral haplotype copies for each haplotype. Could be another way / a cleaner way of seeing the same signal / confirming the background.

saulpierotti commented 1 year ago

Thanks a lot for the very quick response! I will think about those points and come back here.

saulpierotti commented 1 year ago

I run STITCH on a small region with K=8 to see haplotype shuffling - If I understand the plots correctly it seems that the recombination rate is much higher than the expected alphaMat 21 all s 1 alphaMat 21 normalized s 1 hapSum 21 s 1 hapSum_log 21 s 1 metricsForPostImputationQC 21 sample metricsForPostImputationQCChromosomeWide 21 sample r2 21 goodonly shuffleHaplotypes 4 21 shuffleHaplotypes 8 21 shuffleHaplotypes 12 21 shuffleHaplotypes 16 21

rwdavies commented 1 year ago

So the second one which is the ancestral haplotype frequencies looks pretty good.

There are a lot of INFO = 1 SNPs (I assume monomorphic alt) which you can remove as they just waste time in imputation. There are also a lot of likely false positives which you could probably clean up in a two round imputation (like in the original paper, re-impute using only SNPs at INFO > 0.4 or similar).

I think the heuristics are working OK but they could do better. You can try setting shuffle_bin_radius lower like 1000 or 200 and seeing if they look a bit better. Basically what you want to see in the last set of plots you sent is long runs of continuous colours (these are the posterior probabilities of copying from the ancestral haplotypes). You don't want to see abrupt changes in many samples at the same location, this likely indicates an ancestral haplotype switch error. By setting this value lower STITCH should try to smooth over few points and look for such changes more aggresively. But it's more of an art than a science. Something ideally I'd fix in a future STITCH with a different statistical model