Assessing imputation accuracy

kiran-lee commented 8 months ago

Do you have any suggestion for estimating imputation performance, beyond the plots produced? As stated in your README, "(Plots) are not meant to substitute for a more in depth investigation of imputation performance in your setting".

I have imputed ~1900 samples (~3x coverage on average, per sample) aligned to a reference genome for a recently bottlenecked non-model organism. We do not have a high coverage reference panel.

As far as I am aware, most imputation softwares use a “reference panel” of high coverage genomes, which is used to benchmark imputation success. This involves downsampling a panel of high coverage samples, performing imputation and then assessing concordance of original and downsampled genotypes.

I am thinking to downsample all 1900 samples to eg. 1x and 0.1x coverage using the downsampleToCov = argument, imputing and then testing concordance with the original, albeit not-so-great-coverage-to-start-with samples using, for example: https://github.com/LindoNkambule/VCFCompare

I attach the plots for a chromosome (and the parameters used), which I think has generally run well, though I am yet to optimise k and nGen parameter settings.

hapSum_log ENA|OU383783|OU383783 1_RagTag s 1 hapSum ENA|OU383783|OU383783 1_RagTag s 1 metricsForPostImputationQC ENA|OU383783|OU383783 1_RagTag sample metricsForPostImputationQCChromosomeWide ENA|OU383783|OU383783 1_RagTag sample r2 ENA|OU383783|OU383783 1_RagTag goodonly

STITCH(tempdir = tempdir(), chr="ENA|OU383783|OU383783.1_RagTag", bamlist="bamlistrg.txt", posfile="ENA|OU383783.txt", sampleNames_file="bamlist.txt", outputdir= paste0(getwd(), "/"), K=4, nGen=100, nCores=1) [2024-02-28 13:55:38] Running STITCH(chr = ENA|OU383783|OU383783.1_RagTag, nGen = 100, posfile = ENA|OU383783.txt, K = 4, S = 1, outputdir = /mnt/fastdata/bop21kgl/RawData/allplates/, nStarts = , tempdir = /scratch/3496373/RtmpiFUn7y, bamlist = bamlistrg.txt, cramlist = , sampleNames_file = bamlist.txt, reference = , genfile = , method = diploid, output_format = bgvcf, B_bit_prob = 16, outputInputInVCFFormat = FALSE, downsampleToCov = 50, downsampleFraction = 1, readAware = TRUE, chrStart = NA, chrEnd = NA, regionStart = NA, regionEnd = NA, buffer = NA, maxDifferenceBetweenReads = 1000, maxEmissionMatrixDifference = 1e+10, alphaMatThreshold = 1e-04, emissionThreshold = 1e-04, iSizeUpperLimit = 600, bqFilter = 17, niterations = 40, shuffleHaplotypeIterations = c(4, 8, 12, 16), splitReadIterations = 25, nCores = 1, expRate = 0.5, maxRate = 100, minRate = 0.1, Jmax = 1000, regenerateInput = TRUE, originalRegionName = NA, keepInterimFiles = FALSE, keepTempDir = FALSE, outputHaplotypeProbabilities = FALSE, switchModelIteration = NA, generateInputOnly = FALSE, restartIterations = NA, refillIterations = c(6, 10, 14, 18), downsampleSamples = 1, downsampleSamplesKeepList = NA, subsetSNPsfile = NA, useSoftClippedBases = FALSE, outputBlockSize = 1000, outputSNPBlockSize = 10000, inputBundleBlockSize = NA, genetic_map_file = , reference_haplotype_file = , reference_legend_file = , reference_sample_file = , reference_populations = NA, reference_phred = 20, reference_iterations = 40, reference_shuffleHaplotypeIterations = c(4, 8, 12, 16), output_filename = NULL, initial_min_hapProb = 0.2, initial_max_hapProb = 0.8, regenerateInputWithDefaultValues = FALSE, plotHapSumDuringIterations = FALSE, plot_shuffle_haplotype_attempts = FALSE, plotAfterImputation = TRUE, save_sampleReadsInfo = FALSE, gridWindowSize = NA, shuffle_bin_nSNPs = NULL, shuffle_bin_radius = 5000, keepSampleReadsInRAM = FALSE, useTempdirWhileWriting = FALSE, output_haplotype_dosages = FALSE)

Thank you for providing and supporting STITCH.

Kiran

Zilong-Li commented 8 months ago

Hey Kiran,

You can take a look at the vcfcomp function in https://github.com/Zilong-Li/vcfppR. You need to prepare 3 files, i.e imputed vcf, true vcf and a tsv file of allele frequency https://github.com/Zilong-Li/lcWGS-imputation-workflow/blob/main/config/af.tsv.

Then you can calculate correlation (r2) between the imputed dosage and the truth genotypes via

vcfcomp(test = "imputed.vcf.gz", truth = "true.vcf.gz", af = "af.tsv", stats = "r2")

You need to install the latest vcfppR by remotes::install_github("Zilong-Li/vcfppR"), since it is not yet on CRAN.

Hope it helps.

kiran-lee commented 8 months ago

Hi Zilong-Li,

Is the vcfcomp correlation (r2) different from the plot produced in STITCH ? It seems like they report the same r2.

Zilong-Li commented 8 months ago

Ah, sorry. I thought you have high coverage sequencing samples that can be used as true genotypes. If you only have maximum 3x samples, then you can only filter the variants based on INFO score (say >0.8) , Minor Allele Frequency (>0.05) and so on.

kiran-lee commented 8 months ago

Unfortunately we do not have high coverage samples. This might be a good compromise to not having a high coverage reference panel? : 1) downsample the 25 best coverage (mean >8x) samples to 1x 2) impute these downsamples samples with all the other ~1900 samples for a single chromosome 3) compare r2 of the imputed downsampled samples to their "true" genotypes at 8x mean coverage

rwdavies / STITCH

Assessing imputation accuracy #95