Samples with a large amount of missing sites after imputation

GuyReeves commented 2 years ago

HI

I have been running Stitch on a 5838 sample dataset that is an 11 generation drosophila pedigree, which all originate from 4 founder individuals . While 66% of the samples have < 2% missing sites the remaining 33% have higher missing rates up to 50%. High missing rate appears to be clustered in family trios (but not all siblings have high high missing rates). Does not appear correlated mean genome coverage or depth (see graph).
Do you have any suggestions how to reduce this amount of missing data without compromising quality of imputation too much (the mendelian error rate in most families is low, again high values are again focused particular families).

Thanks

Guy

STITCH(tempdir = tempdir(), chrStart = 1, chrEnd = 23506264, chr = "chr2L", sampleNames_file = "sample_list5838.txt" , posfile = "posChr2Lnoindel.txt", outputdir = paste0(getwd(), "/"), K = 5, nGen = 11, nCores = 10, outputSNPBlockSize = 1000, gridWindowSize = 10000, outputInputInVCFFormat = TRUE, regenerateInput = FALSE, originalRegionName = "chr2L", regenerateInputWithDefaultValues = TRUE)

Ps I have I think phased the four founders would trying QUILT have advantages?

rwdavies commented 2 years ago

Hi Guy,

If there are 4 founders, then the max number of ancestral haplotypes could be is 8, so I would recommend increasing K to that. It looks like you have enough coverage and samples to support it.

Setting gridWindowSize to the default value (I think NA) would make things a bit more accurate but slower.

Related, the recombination rate for Drosophila looks to be a bit higher than humans / mice (what default parameters are set for) (based on a 30 second internet search, e.g. https://petrov.stanford.edu/cgi-bin/recombination-rates_updateR5.pl ). This might have a small difference, but I'd recommend changing expRate to 4.0. Also, you might want to decrease shuffle_bin_radius to something like 500 (this is a smoothing parameter that helps find and repair ancestral haplotypes. Again the default value is set for humans and mice. Here with a higher Ne and higher recombination rate, for this heuristic, you'll want more smoothing as the recombinations take place faster).

STITCH can take in reference haplotypes, and initialize the EM with that. That sounds like a good option for you if you have a good guess for the founding haplotypes. Take a look at the reference_haplotype_file and reference_legend_file options. If you think your inferred founder haplotypes contain no phase switch errors, you can either set niterations to 1 (to impute using only information from the reference haplotypes), or generally to something low like 5 or 10 (which would use the founder haplotypes to initialize, then proceed normally) (you'd probably want to turn off shuffleHaplotypeIterations here). If you think your inferred founder haplotypes might contain phase switch errors, you can set niterations to something like 20, then keep a few shuffleHaplotypeIterations e.g. 4, 8, 12. shuffleHaplotypeIterations looks for places where breaking and re-setting ancestral haplotypes improves the likelihood, i.e. here phase switch errors among founders. I would recommend disabling splitReadIterations and refillIterations if you go with reference haplotypes, especially if you're reasonably sure you have all the founder haplotypes, as they won't be necessary.

Best, Robbie

GuyReeves commented 2 years ago

Dear Robbie

Thanks so much for this. I have tried to implement it. Do I get from this error that the reference legend must be gzip format, is this the only file needed to be in this format? I cannot see that the compressed requirement is in the documentation.

Thanks again

Cheers

Guy

Loading required package: rrbgen [2021-12-08 22:10:04] Running STITCH(chr = chr2L, nGen = 11, posfile = posChr2Lnoindel.txt, K = 5, S = 1, outputdir = /home/reeves/diary_nov_stitch/5838/, nStarts = , tempdir = /tmp/RtmpvjCrQl, bamlist = , cramlist = , sampleNames_file = sample_list5838.txt, reference = , genfile = , method = diploid, output_format = bgvcf, B_bit_prob = 16, outputInputInVCFFormat = FALSE, downsampleToCov = 50, downsampleFraction = 1, readAware = TRUE, chrStart = 1, chrEnd = 23506264, regionStart = NA, regionEnd = NA, buffer = NA, maxDifferenceBetweenReads = 1000, maxEmissionMatrixDifference = 1e+10, alphaMatThreshold = 1e-04, emissionThreshold = 1e-04, iSizeUpperLimit = 600, bqFilter = 17, niterations = 1, shuffleHaplotypeIterations = NA, splitReadIterations = FALSE, nCores = 10, expRate = 0.4, maxRate = 100, minRate = 0.1, Jmax = 1000, regenerateInput = FALSE, originalRegionName = chr2L, keepInterimFiles = FALSE, keepTempDir = FALSE, outputHaplotypeProbabilities = FALSE, switchModelIteration = NA, generateInputOnly = FALSE, restartIterations = NA, refillIterations = NA, downsampleSamples = 1, downsampleSamplesKeepList = NA, subsetSNPsfile = NA, useSoftClippedBases = FALSE, outputBlockSize = 1000, outputSNPBlockSize = 1000, inputBundleBlockSize = NA, genetic_map_file = , reference_haplotype_file = /home/reeves/diary_nov_stitch/ref_file/ref_haplotype.txt, reference_legend_file = /home/reeves/diary_nov_stitch/ref_file/ref_legend.txt, reference_sample_file = , reference_populations = NA, reference_phred = 20, reference_iterations = 40, reference_shuffleHaplotypeIterations = c(4, 8, 12, 16), output_filename = NULL, initial_min_hapProb = 0.2, initial_max_hapProb = 0.8, regenerateInputWithDefaultValues = TRUE, plotHapSumDuringIterations = FALSE, plot_shuffle_haplotype_attempts = FALSE, plotAfterImputation = TRUE, save_sampleReadsInfo = FALSE, gridWindowSize = NA, shuffle_bin_nSNPs = NULL, shuffle_bin_radius = 5000, keepSampleReadsInRAM = FALSE, useTempdirWhileWriting = FALSE, output_haplotype_dosages = FALSE, use_bx_tag = TRUE, bxTagUpperLimit = 50000) [2021-12-08 22:10:05] Program start [2021-12-08 22:10:05] Get and validate pos and gen [2021-12-08 22:10:05] Done get and validate pos and gen [2021-12-08 22:10:05] Copying files onto tempdir [2021-12-09 00:30:40] Done copying files onto tempdir [2021-12-09 00:30:40] Generate allele count [2021-12-09 01:05:13] Quantiles across SNPs of per-sample depth of coverage [2021-12-09 01:05:13] 5% 25% 50% 75% 95% [2021-12-09 01:05:13] 1.162 1.591 1.953 2.407 3.266 [2021-12-09 01:05:13] Done generating allele count [2021-12-09 01:05:13] Outputting will be done in 232 blocks with on average 998.4 SNPs in them [2021-12-09 01:05:13] Begin parameter initialization [2021-12-09 01:05:14] Begin initializing paramters using reference haplotypes [2021-12-09 01:05:14] Begin get haplotypes from reference [2021-12-09 01:05:14] Load and validate reference legend header [2021-12-09 01:05:15] Load and validate reference legend

gzip: /home/reeves/diary_nov_stitch/ref_file/ref_legend.txt: not in gzip format Error in load_reference_legend(legend_header, reference_legend_file, regionStart, : No SNPs were loaded from the reference_legend_file:/home/reeves/diary_nov_stitch/ref_file/ref_legend.txt for region with regionStart=NA, regionEnd=NA and buffer=NA. Perhaps the reference legend file does not span the region requested? Calls: STITCH ... get_haplotypes_from_reference -> load_reference_legend Execution halted

rwdavies commented 2 years ago

Apologies, I'll add that to the documentation

rwdavies commented 2 years ago

I believe the haplotype, legend and recombination map must be gzipped. I can't remember for which it is optional or not

GuyReeves commented 2 years ago

Hi Robbie

I am getting this error

[2021-12-10 00:12:44] Load and validate reference legend header Error in validate_legend_header(legend_header) : Cannot find 'position' column in reference_legend_file Calls: STITCH ... get_haplotypes_from_reference -> validate_legend_header Execution halted

But as far as I can tell the _reference_legendfile is correctly formatted ref_legend_chr2L.txt.gz

With spaces being the column delineators.

Or have a misunderstood something about the IMPUTE format.

Are there example files I could look at and have missed?

Thanks

Guy

GuyReeves commented 2 years ago

I think the problem was that in the reference_legend_file and the haplotype file I had some INDELS (length is greater than 1bp) when I got rid of them from both files the error disappeared

rwdavies commented 2 years ago

Hi,

For the first error, Cannot find 'position' column in reference_legend_file, the program is complaining because the reference legend file doesn't have a header. Sorry for the inconsistency between the pos file and this file, this file format (the reference legend and header) exactly matches a format used by a different imputation program (IMPUTE2).

If you run through the examples, human example 2 uses a reference panel https://github.com/rwdavies/STITCH/blob/master/examples/examples.R This uses the data in this file STITCH_human_reference_example_2018_07_11.tgz from this link https://www.well.ox.ac.uk/~rwdavies/ancillary/STITCH_human_reference_example_2018_07_11.tgz where the reference legend file looks like this

RT4789ML-CBLG:Downloads robert davies$ gunzip -c 1000GP_Phase3_20.1000000.1100000.legend.gz | head 
id position a0 a1 TYPE AFR AMR EAS EUR SAS ALL
rs192365695:990565:T:C 990565 T C Biallelic_SNP 0 0 0.00297619047619048 0 0 0.000599041533546326
rs117993944:990887:G:C 990887 G C Biallelic_SNP 0 0.0907780979827089 0 0.0198807157057654 0.00511247443762781 0.0175718849840256
rs192998546:991051:G:C 991051 G C Biallelic_SNP 0 0 0.00198412698412698 0 0 0.000399361022364217

where the important columns are position, a0 and a1, and the rest ignored

Separately, yes, it is true that STITCH cannot accomadate indels properly, so they are not allowed in the input files.

Glad (I think) things are working properly for now. I'll see if I can update the README to make this clearer

Best, Robbie

rwdavies commented 2 years ago

Hey,

To make things clearer for future users, I updated the README page. Can you please take a look, and let me know if that would have answered the questions you had when you were working through this?

Thanks Robbie

GuyReeves commented 2 years ago

First of all I would like to say that I am embarrassed that I forgot the header to the .gz reference_legend_file (rsID position a0 a1), I don't know how that happened I thought I had checked and rechecked that.

I read what you wrote in the the README page and it is very clear, thank you very much.

I just want to check even if you are using niterations==1 and nhaps == K,that the output ,vcf should still be considered unphased?

Thanks Guy

GuyReeves commented 2 years ago

HI I wanted to say that I ran settings that you suggested (and using my 8 ref haplotypes ) and it really fixed sites that could not be confidently imputed (see graph above). I also get >98% of my trios having a mendelian error rate below 1%.. So that is great.

I have one remaining question is there a way to get out phased haploytypes in the the vcf? .

Thanks alot

Guy

STITCH(tempdir = tempdir(), chrStart = 1, chrEnd = 23506264, chr = "chr2L", sampleNames_file = "sample500_list5838.txt" , posfile = "posChr2Lnoindel.txt", outputdir = paste0(getwd(), "/"), K = 8, nGen = 11, nCores = 10, originalRegionName = "chr2L", regenerateInput = FALSE, regenerateInputWithDefaultValues = TRUE, expRate = 0.4, niterations = 1, splitReadIterations = FALSE, refillIterations = NA, shuffleHaplotypeIterations = NA, reference_haplotype_file = "ref_haplotype_NP_only_chr2L.txt" , reference_legend_file = "ref_legend_SNP_only_chr2L.txt.gz" )

rwdavies commented 2 years ago

Great! Sorry I didn't build a phaser into STITCH. Probably Beagle or IMPUTE2 would be your best bet on the output. If I had time I'd love to write a new version of STITCH, if and when I do, I'll definitely include phasing.

rwdavies / STITCH

Samples with a large amount of missing sites after imputation #62