Closed kiran-lee closed 4 months ago
Hey Kiran,
You can take a look at the vcfcomp
function in https://github.com/Zilong-Li/vcfppR. You need to prepare 3 files, i.e imputed vcf, true vcf and a tsv file of allele frequency https://github.com/Zilong-Li/lcWGS-imputation-workflow/blob/main/config/af.tsv.
Then you can calculate correlation (r2) between the imputed dosage and the truth genotypes via
vcfcomp(test = "imputed.vcf.gz", truth = "true.vcf.gz", af = "af.tsv", stats = "r2")
You need to install the latest vcfppR by remotes::install_github("Zilong-Li/vcfppR")
, since it is not yet on CRAN.
Hope it helps.
Hi Zilong-Li,
Is the vcfcomp correlation (r2) different from the plot produced in STITCH ? It seems like they report the same r2.
Ah, sorry. I thought you have high coverage sequencing samples that can be used as true genotypes. If you only have maximum 3x samples, then you can only filter the variants based on INFO score (say >0.8) , Minor Allele Frequency (>0.05) and so on.
Unfortunately we do not have high coverage samples. This might be a good compromise to not having a high coverage reference panel? : 1) downsample the 25 best coverage (mean >8x) samples to 1x 2) impute these downsamples samples with all the other ~1900 samples for a single chromosome 3) compare r2 of the imputed downsampled samples to their "true" genotypes at 8x mean coverage
Do you have any suggestion for estimating imputation performance, beyond the plots produced? As stated in your README, "(Plots) are not meant to substitute for a more in depth investigation of imputation performance in your setting".
I have imputed ~1900 samples (~3x coverage on average, per sample) aligned to a reference genome for a recently bottlenecked non-model organism. We do not have a high coverage reference panel.
As far as I am aware, most imputation softwares use a “reference panel” of high coverage genomes, which is used to benchmark imputation success. This involves downsampling a panel of high coverage samples, performing imputation and then assessing concordance of original and downsampled genotypes.
I am thinking to downsample all 1900 samples to eg. 1x and 0.1x coverage using the downsampleToCov = argument, imputing and then testing concordance with the original, albeit not-so-great-coverage-to-start-with samples using, for example: https://github.com/LindoNkambule/VCFCompare
I attach the plots for a chromosome (and the parameters used), which I think has generally run well, though I am yet to optimise k and nGen parameter settings.
STITCH(tempdir = tempdir(), chr="ENA|OU383783|OU383783.1_RagTag", bamlist="bamlistrg.txt", posfile="ENA|OU383783.txt", sampleNames_file="bamlist.txt", outputdir= paste0(getwd(), "/"), K=4, nGen=100, nCores=1) [2024-02-28 13:55:38] Running STITCH(chr = ENA|OU383783|OU383783.1_RagTag, nGen = 100, posfile = ENA|OU383783.txt, K = 4, S = 1, outputdir = /mnt/fastdata/bop21kgl/RawData/allplates/, nStarts = , tempdir = /scratch/3496373/RtmpiFUn7y, bamlist = bamlistrg.txt, cramlist = , sampleNames_file = bamlist.txt, reference = , genfile = , method = diploid, output_format = bgvcf, B_bit_prob = 16, outputInputInVCFFormat = FALSE, downsampleToCov = 50, downsampleFraction = 1, readAware = TRUE, chrStart = NA, chrEnd = NA, regionStart = NA, regionEnd = NA, buffer = NA, maxDifferenceBetweenReads = 1000, maxEmissionMatrixDifference = 1e+10, alphaMatThreshold = 1e-04, emissionThreshold = 1e-04, iSizeUpperLimit = 600, bqFilter = 17, niterations = 40, shuffleHaplotypeIterations = c(4, 8, 12, 16), splitReadIterations = 25, nCores = 1, expRate = 0.5, maxRate = 100, minRate = 0.1, Jmax = 1000, regenerateInput = TRUE, originalRegionName = NA, keepInterimFiles = FALSE, keepTempDir = FALSE, outputHaplotypeProbabilities = FALSE, switchModelIteration = NA, generateInputOnly = FALSE, restartIterations = NA, refillIterations = c(6, 10, 14, 18), downsampleSamples = 1, downsampleSamplesKeepList = NA, subsetSNPsfile = NA, useSoftClippedBases = FALSE, outputBlockSize = 1000, outputSNPBlockSize = 10000, inputBundleBlockSize = NA, genetic_map_file = , reference_haplotype_file = , reference_legend_file = , reference_sample_file = , reference_populations = NA, reference_phred = 20, reference_iterations = 40, reference_shuffleHaplotypeIterations = c(4, 8, 12, 16), output_filename = NULL, initial_min_hapProb = 0.2, initial_max_hapProb = 0.8, regenerateInputWithDefaultValues = FALSE, plotHapSumDuringIterations = FALSE, plot_shuffle_haplotype_attempts = FALSE, plotAfterImputation = TRUE, save_sampleReadsInfo = FALSE, gridWindowSize = NA, shuffle_bin_nSNPs = NULL, shuffle_bin_radius = 5000, keepSampleReadsInRAM = FALSE, useTempdirWhileWriting = FALSE, output_haplotype_dosages = FALSE)
Thank you for providing and supporting STITCH.
Kiran