rwdavies / STITCH

STITCH - Sequencing To Imputation Through Constructing Haplotypes
http://www.nature.com/ng/journal/v48/n8/abs/ng.3594.html
GNU General Public License v3.0
73 stars 19 forks source link

Is there a way to avoid having to Load and convert BAMs..... each run #60

Open GuyReeves opened 2 years ago

GuyReeves commented 2 years ago

HI

I was wondering if there is a way to avoid having to Load and convert BAMs each run which as I understand it creates the files in the /input and /Rdata folders. The reason I ask is if I want to try different values of "K" can't I just skip the loading / generating steps?

Thanks

Guy

rwdavies commented 2 years ago

Hi Guy,

Yup, see option --regenerateInput, in particular, you can set it to FALSE if you're using the same region name (chr, regionStart, regionEnd), and it will use the processed input files in input/ (and I believe the sample names from a file in RData/). It will over-write output otherwise, so one recommendation I would have would be to rsync the previous outputdir to a new location, then run your new K there, then rsync back, probably to some separate master folder, the output you care about (e.g. the VCF).

Best, Robbie

GuyReeves commented 2 years ago

Thanks that is great and very useful.

I read in other posts (see quote below) there is advice about how to select an optimal value for K, can you please point me to this as I cannot find it. Thanks Guy

" I've run the program, varying K and number of generations as suggested, and am now evaluating the output. It seems that the mean score, and number of sites with scores > 0.4 increase as K increases from"

rwdavies commented 2 years ago

I have some information on the main page which is probably a good place to start https://github.com/rwdavies/STITCH#note-on-the-selection-of-k-and-ngen

GuyReeves commented 2 years ago

Hi

I am trying to use --regenerateInput

Was a bit confused about settings think I have it now

library(STITCH) STITCH(tempdir = tempdir(), chrStart = 1, chrEnd = 23506264, chr = "chr2L", sampleNames_file = "sample_list5838.txt" , posfile = "posChr2Lnoindel.txt", outputdir = paste0(getwd(), "/"), K = 5, nGen = 11, nCores = 10, outputSNPBlockSize = 1000, gridWindowSize = 10000, outputInputInVCFFormat = TRUE, regenerateInput = FALSE, originalRegionName = "chr2L", regenerateInputWithDefaultValues = TRUE)

rwdavies commented 2 years ago

OK so is it working as expected now? If not clear I can make a demonstration test case

GuyReeves commented 2 years ago

if I include outputInputInVCFFormat = TRUE. this means that only the inputVCF file will be generated i.e. there is no imputation. I need to leave this out if want imputation iterations to occur correct?

Thanks

Guy

Loading required package: parallel Loading required package: rrbgen [2021-12-01 12:20:25] Running STITCH(chr = chr2L, nGen = 11, posfile = posChr2Lnoindel.txt, K = 5, S = 1, outputdir = /home/reeves/diary_nov_stitch/k_5list5838/, nStarts = , tempdir = /tmp/Rtmp73QwX2, bamlist = , cramlist = , sampleNames_file = sample_list5838.txt, reference = , genfile = , method = diploid, output_format = bgvcf, B_bit_prob = 16, outputInputInVCFFormat = TRUE, downsampleToCov = 50, downsampleFraction = 1, readAware = TRUE, chrStart = 1, chrEnd = 23506264, regionStart = NA, regionEnd = NA, buffer = NA, maxDifferenceBetweenReads = 1000, maxEmissionMatrixDifference = 1e+10, alphaMatThreshold = 1e-04, emissionThreshold = 1e-04, iSizeUpperLimit = 600, bqFilter = 17, niterations = 40, shuffleHaplotypeIterations = c(4, 8, 12, 16), splitReadIterations = 25, nCores = 10, expRate = 0.5, maxRate = 100, minRate = 0.1, Jmax = 1000, regenerateInput = FALSE, originalRegionName = chr2L, keepInterimFiles = FALSE, keepTempDir = FALSE, outputHaplotypeProbabilities = FALSE, switchModelIteration = NA, generateInputOnly = FALSE, restartIterations = NA, refillIterations = c(6, 10, 14, 18), downsampleSamples = 1, downsampleSamplesKeepList = NA, subsetSNPsfile = NA, useSoftClippedBases = FALSE, outputBlockSize = 1000, outputSNPBlockSize = 1000, inputBundleBlockSize = NA, genetic_map_file = , reference_haplotype_file = , reference_legend_file = , reference_sample_file = , reference_populations = NA, reference_phred = 20, reference_iterations = 40, reference_shuffleHaplotypeIterations = c(4, 8, 12, 16), output_filename = NULL, initial_min_hapProb = 0.2, initial_max_hapProb = 0.8, regenerateInputWithDefaultValues = TRUE, plotHapSumDuringIterations = FALSE, plot_shuffle_haplotype_attempts = FALSE, plotAfterImputation = TRUE, save_sampleReadsInfo = FALSE, gridWindowSize = 10000, shuffle_bin_nSNPs = NULL, shuffle_bin_radius = 5000, keepSampleReadsInRAM = FALSE, useTempdirWhileWriting = FALSE, output_haplotype_dosages = FALSE, use_bx_tag = TRUE, bxTagUpperLimit = 50000) [2021-12-01 12:20:25] Program start [2021-12-01 12:20:25] Get and validate pos and gen [2021-12-01 12:20:26] Done get and validate pos and gen [2021-12-01 12:20:26] Copying files onto tempdir [2021-12-01 13:32:18] Done copying files onto tempdir [2021-12-01 13:32:18] Generate allele count [2021-12-01 14:17:18] Quantiles across SNPs of per-sample depth of coverage [2021-12-01 14:17:18] 5% 25% 50% 75% 95% [2021-12-01 14:17:18] 1.162 1.591 1.953 2.407 3.266 [2021-12-01 14:17:18] Done generating allele count [2021-12-01 14:17:18] Prepare data to use to build vcf from input [2021-12-01 14:32:57] Write block of VCF for samples:1-584 [2021-12-01 14:36:58] Write block of VCF for samples:585-1168 [2021-12-01 14:40:26] Write block of VCF for samples:1169-1752 [2021-12-01 14:43:38] Write block of VCF for samples:1753-2335 [2021-12-01 14:47:06] Write block of VCF for samples:2336-2919 [2021-12-01 14:50:51] Write block of VCF for samples:2920-3503 [2021-12-01 14:54:59] Write block of VCF for samples:3504-4086 [2021-12-01 15:02:26] Write block of VCF for samples:4087-4670 [2021-12-01 15:15:16] Write block of VCF for samples:4671-5254 [2021-12-01 15:56:43] Write block of VCF for samples:5255-5838 [2021-12-01 16:31:53] Build VCF from input [2021-12-01 16:39:16] Done building VCF from input

rwdavies commented 2 years ago

Yes, that is to just make an output VCF that only contains values informed by the input information, with no imputation. Useful in particular if you want to compare results vs say Beagle which takes VCF input. I think that was why I wrote that option

GuyReeves commented 2 years ago

Hi Can I check that if I run with --regenerateInput =' False the files in the input/ and RData/ folders do not use any parameters specified in the command, and that these two folders can be reused in any other combination of commands including e.g. -reference_haplotype_file...

Thanks Guy