rwdavies / STITCH

STITCH - Sequencing To Imputation Through Constructing Haplotypes
http://www.nature.com/ng/journal/v48/n8/abs/ng.3594.html
GNU General Public License v3.0
73 stars 19 forks source link

Output for ancestral haplotype probabilities? #89

Open jamonterotena opened 7 months ago

jamonterotena commented 7 months ago

Hi,

I have 16way multiparental populations with 2 RILs per population. The aim is to reconstruct the haploblocks from the 16 founders along the genome of the RILs. Therefore, I'm not so interested in obtaining imputed genotypes but rather in the output dosages or haplotype probabilities. My question is whether and where I can find some kind of output containing the ancestry probabilities. I checked the output of the example and couldn't find this.

Best regards, Jose

rwdavies commented 7 months ago

So I believe you're interested in this --output_haplotype_dosages=TRUE so for example if I do this using the example data on the README page (with K = 4 i.e. 4 ancestral haplotypes)

./STITCH.R --chr=chr19 --bamlist=bamlist.txt --posfile=pos.txt --genfile=gen.txt --outputdir=./ --K=4 --nGen=100 --nCores=1 --output_haplotype_dosages=TRUE

I get a VCF with entries like

GT:GP:DS:HD     ./.:0.141,0.858,0.000:0.859:0.000,0.000,1.031,0.969 

where HD is defined as

##FORMAT=<ID=HD,Number=4,Type=Float,Description="Ancestral haplotype dosages for one through K haplotypes">

so for instance for the sample above at that SNP, is ancestral haplotype dosage for the 4 ancestral haps is 0.000,0.000,1.031,0.969 i.e. it thinks there are 1.031 copies of ancestral hap 3, and 0.969 copies of ancestral hap 4 i.e. it thinks it is heterozygous for ancestral haps 3 and 4

Let me know if this is what you want or if you wanted something else? Thanks Robbie

jamonterotena commented 7 months ago

Thank you for your quick, great explanation :)

I think that's indeed what I'm looking for. It would however be hard to postprocess the VCF to simply obtain the most likely founder for every SNP/offspring. Something like: offspring1 3 3 5 16 16 16 offspring2 3 3 5 5 5 5 5 If you have ever tried sthing like that, could you please share any script for this?

image

I would like to emulate what Scott et al, 2016, did with STITCH to produce figure (c)

Figure

rmott commented 7 months ago

Hi, I'm one of the authors on the paper with the figure you show. I did not create the figure but I think it was done as follows.

First, there is a perl script vcf2df.pl to extract the Haplotype dosages from the STITCH VCF, available from https://github.com/michaelfscott/DIVERSE_MAGIC_WHEAT

Second, I think the plots are examples of stacked density charts, which can be produced as described here https://r-graph-gallery.com/135-stacked-density-graph.html

Richard

jamonterotena commented 7 months ago

Thank you Richard, I will follow your instructions. Have a good day.

Jose

rmott commented 7 months ago

Jose - you could also ask Mike Scott Michael.Scott@uea.ac.uk exactly how he made the figure

jamonterotena commented 6 months ago

Hello again,

I followed your guidelines and generated files with the founders of highest dosage for every SNP.

The problem is that I am observing unexpectedly large numbers of recombination events, despite of setting up low expRate values. Do you think of any approach that could help ameliorate this?

I noticed the option outputHaplotypeProbabilities. Do you think this would work better for haplotype reconstruction?

Best regards & thanks for the help provided, Jose

rwdavies commented 6 months ago

The method can struggle and "get stuck" in local minima which have artifactual recombinations between ancestral haps. The method does try to find these and resolve them, but it's not a fully rigorous approach.

One approach I use to evaluate this is to turn plot_shuffle_haplotype_attempts on, to get a representation of what the use of ancestral haplotypes looks like, on a per sample level, for some of the samples. Feel free to post an example back here.

Another approach is to increase the total number of iterations, and increase the number of heuristic iterations to find and recover these ancestral haplotype recombination errors (shuffleHaplotypeIterations). So e.g. set 60 iterations, and set shuffleHaplotypeIterations = c(4,8,12,16,20,24), or similar. Maybe also disable splitReadIterations

To confirm, are you using K=16 for your run of STITCH? What's your total N? What does the plot of hapsum look like, does it seem like it's using all K=16?

Robbie

jamonterotena commented 6 months ago

One approach I use to evaluate this is to turn plot_shuffle_haplotype_attempts on, to get a representation of what the use of ancestral haplotypes looks like, on a per sample level, for some of the samples. Feel free to post an example back here. shuffleHaplotypes 8 chrC09 Looks really messy. Notice how the expected rate is always far below the calculated rate.

Another approach is to increase the total number of iterations, and increase the number of heuristic iterations to find and recover these ancestral haplotype recombination errors (shuffleHaplotypeIterations). So e.g. set 60 iterations, and set shuffleHaplotypeIterations = c(4,8,12,16,20,24), or similar. Maybe also disable splitReadIterations Why not shuffleHaplotypeIterations = c(4,8,12,16,20,24, 28, 32, 36, ... 60)? Wouldn't that improve fitting?

To confirm, are you using K=16 for your run of STITCH? What's your total N? What does the plot of hapsum look like, does it seem like it's using all K=16? Yes, I'm using 16 founders. I'm not sure what N means, but this was printed: _slurm-21580929_4.out:[2023-11-28 13:36:03] Split reads, average N=12 (0.008 %) slurm-21580929_5.out:[2023-11-28 13:37:16] Split reads, average N=10 (0.007 %) slurm-21580929_6.out:[2023-11-28 13:36:03] Split reads, average N=8 (0.006 %) slurm-21580929_7.out:[2023-11-28 13:41:50] Split reads, average N=14 (0.008 %) slurm-21580929_8.out:[2023-11-28 13:33:06] Split reads, average N=9 (0.007 %) slurm-21580929_9.out:[2023-11-28 13:42:03] Split reads, average N=11 (0.007 %) slurm-21580930_10.out:[2023-11-28 13:55:31] Split reads, average N=10 (0.007 %) slurm-21580930_12.out:[2023-11-28 14:08:15] Split reads, average N=16 (0.009 %) slurm-21580930_14.out:[2023-11-28 14:15:45] Split reads, average N=20 (0.008 %) slurm-2158093015.out:[2023-11-28 14:02:56] Split reads, average N=15 (0.011 %) If N means coverage, then something else is going on because both the founders and RILs have high read coverage. If you mean N as in number of individuals, every run takes 16 founders and 2 additional RILs. hapSum_log chrC09 s 1 If colors represent founders, I couldn't see more than 10 founders at the same X-axis position, so perhaps it's skipping founders? It's hard to tell because lines tend to overlap. I can double check this.

Jose

rwdavies commented 5 months ago

By total N, I meant your total sample size. For what you gave above slurm-21580930_14.out:[2023-11-28 14:15:45] Split reads, average N=20 (0.008 %) that is in reference to splitReadIterations, which you can in fact just disable (this would be more useful if you were using long reads, like ONT or PacBIo). The above says that 20 reads were split, representing 0.008% of reads for that sample.

When you say "every run takes 16 founders and 2 additional RILs.", how many RILs do you have? The ideal way to run STITCH would be all samples at the same time (i.e. all RILs).

I agree the first plot looks far messier than I would like to see (though the interpretation of this depends on N). There are multiple samples that recombine at the same location. It certainly suggests the heuristics could use some work.

For shuffleHaplotypeIterations, I would recommend turning those off the last 10 or 20 or so iterations, as in that process (the shuffleHaplotypeIterations), it also adds a lot of noise, to help finding good global results. So the last 10 or 20 iterations helps get rid of that noise

There should be a HapSum plot that looks much blockier? A rectangle full of colours

jamonterotena commented 5 months ago

Hi @rwdavies,

I tried with some of the changes you suggested you suggested last week, like nIterations = 100, shuffleHaplotypeIterations = seq(4, nIterations-20, 4) and splitReadIterations, besides some other like increasing maxRate or shuffle_bin_radius. Despite getting plots with less shuffles sometimes, it's really hard to reconstruct a realistic mosaic of the haplotypes with the highest dosage as there are still too many short haploblocks.

Regarding the families, I have 26 of them and, in each one, brassica napus plants developed from 16 inbred founders were crossed through four generations and 2 RILs were developed by selfing the same individual from the last generation. Therefore, I'm combining the 2 RILs with the 16 founders of the same family for each STITCH run, since I understand that adding other unrelated individuals from other families might confuse the program. Coming back to your question, would it be N=18?

Here's the hapSum plot.

hapSum chrA03 s 1

Best regards, Jose

jamonterotena commented 5 months ago

Hi again,

I just realized that some misconceptions might have been dragging me behind.

I have high-coverage data for all 18 samples. However, I was just adding the founders' high-coverage data to the genfile. Also, I was assuming that K='number of founders' should match with the number of individuals in the genfile. Perhaps using the RILs high-coverage data would improve haplotype reconstruction.

When I get the haplotype dosages, the founders are numbered from 1 to 16. Do these numbers make any sort of reference to the actual samples used? In other words, is the genfile data used for more precise imputation or as the founders for reconstructing haplotypes among the RILs?

Also, given that I have 26 families, does it make sense to use S=26 with K=16?

Sorry if the questions sound stupid. I hope you can throw some light into this.

Best regards, Jose

GoliczGenomeLab commented 5 months ago

Hi again,

When I added only the genotypes of the RILs and no information about the founders, the number of haplotypes was reasonably ok. I think the problem was that I thought STITCH took samples in the gen file as founders to impute from but apparently they need to be supplied as reference. We could resolve the issue here.

I started running with reference files (hap, legend and samples). I'm using a single VCF file for all founders, splitting them by populations (16way MPPs) on the samples file, and running STITCH with a different gen file for each pair of RILs for each population. I'll post a new issue on an error related to the execution of STITCH with reference files.

Best regards & thank you very much for your help, Jose

rwdavies commented 5 months ago

Hi,

OK taking a step back, and re-reading your original message, it's probably good to go back over some things.

The most beneficial use case of STITCH is when you have lots (hundreds to thousands) of low coverage (<2x) samples without any knowledge of how the samples are related. STITCH then does a pretty good job imputing.

Here you have 16 inbred founders at high coverage, and for each of 26 populations, have 2 new RIL, with very few intervening generations, and those samples are also at high coverage, and want to paint those samples as mosaics of the founders, correct?

@rmott might have suggestions as this is much more up his alley, and is a classic problem, though STITCH can do this too, though you want to turn off any EM in STITCH for this

You'd want something like this example https://github.com/rwdavies/STITCH/blob/master/examples/examples.R#L432 You'd supply the 16 founders as haps in the reference haplotype file (16 column file), set K = 16, you can skip a reference sample file as that's not needed, then set niterations = 1 to turn off EM, requiring you to also set shuffleHaplotypeIterations = NA, refillIterations = NA. You don't need to provide genfile, though if you want, you can put the samples you're trying to "impute" (your 2 * 26 RILs), and then the corresponding bam list would have the same samples (genfile only prings out estimates of accuracy as STITCH progresses, it doesn't affect imputation otherwise). Per our earlier discussion, you'd also want to turn on --output_haplotype_dosages=TRUE

Hope that helps, let me know if that makes sense and is useful?

Best, Robbie

P.s. Sorry for my slow reply

jamonterotena commented 5 months ago

Hi,

Thank you for your suggestions.

Here you have 16 inbred founders at high coverage, and for each of 26 populations, have 2 new RIL, with very few intervening generations, and those samples are also at high coverage, and want to paint those samples as mosaics of the founders, correct?

That's correct. Interestingly, I read that Scott et al, 2021, who had a very similar population structure, turn off iterations as your suggested. Besides, they also turn off reference_shuffleHaplotypeIterations. Would you also advice that?

Scott, M. F., Fradgley, N., Bentley, A. R., Brabbs, T., Corke, F., Gardner, K. A., Horsnell, R., Howell, P., Ladejobi, O., Mackay, I. J., Mott, R., & Cockram, J. (2021). Limited haplotype diversity underlies polygenic trait architecture across 70 years of wheat breeding. Genome biology, 22(1), 137. https://doi.org/10.1186/s13059-021-02354-7

I changed the method. Now I called variants jointly among the 16 founders of one family and then used the output VCF files to produce the reference files. I applied exactly your proposed parameters to STITCH and run it for this family on the bam files of the RILs (without gen). I will keep you updated :)

All best, Jose

jamonterotena commented 5 months ago

Hi again,

After implementing changes, plots are beginning to look much better and the recombination patterns make more sense.

I have one question. I realized that when I provide the sample file and the population containing 16 individuals, it raises error if the haplotype file contains 16 columns, as it requires it to have 32. In that case, K must be 32, or the same error is raised. Is there any way of specifying 16 founders and 16 reference samples together? Something like the samples are homozygous. Is there any advantage of using the samples?

Thanks again, Jose

rwdavies commented 5 months ago

Sorry the way it's done currently, it assumes that reference haplotypes are from "individuals" that are diploid, whereas you have diploid homozygous chromosomes, so "individuals" isnt' quite right, as you want to just give the haplotypes once.

You shouldn't need to give a sample file though right? That's only needed if you want to do things like exclude populations on the fly, for instance if you were using a 1000 Genomes (humans) dataset, and prepared it once, but wanted to make separate panels for certain subpopulations

Also, sorry for my slow reply