rwdavies / STITCH

STITCH - Sequencing To Imputation Through Constructing Haplotypes
http://www.nature.com/ng/journal/v48/n8/abs/ng.3594.html
GNU General Public License v3.0
74 stars 19 forks source link

Reproducible Error for some data #16

Closed suestring7 closed 5 years ago

suestring7 commented 5 years ago

Hi Rob,

Recently I encountered the following error for some of my data, which seems to generate an NA that cause the problem?

Error in if (sum(how_many_cols_below_0) > 0) { : missing value where TRUE/FALSE needed Calls: STITCH -> completeSampleIteration -> calculate_updates Execution halted

Do you have some thoughts on what may happen to these data?

I am running with STITCH Version: 1.5.5

Thank you for any help, Yuan

rwdavies commented 5 years ago

Yuck, I haven't seen a problem like this in a while. It's not immediately clear what would cause this, although this reminds me of previous problems where high coverage causing C++ overflow manifest around this stage, so that would be my first thought. Can you share the screen output log? And a description of the situation, e.g., is this uniform low coverage WGS, what's the total N, what's average coverage, are there higher coverage regions or samples.

suestring7 commented 5 years ago

Actually, I've got two types of error, I thought the first one should be reproducible... And the second one is just out of memory kind of thing? But this morning, I found one of the data with the first error just completes its run without halted...(It would be the third or fourth try.) So it might not be that reproducible.

Sample Error log data id
Error in if (sum(how_many_cols_below_0) > 0) { :  missing value where TRUE/FALSE neededCalls: STITCH -> completeSampleIteration -> calculate_updatesExecution halted 19.01 2.01 8b.01 9.01
Error in forwardBackwardDiploid(sampleReads = sampleReads, nReads = as.integer(length(sampleReads)),  :  std::bad_allocError in check_mclapply_OK(single_iteration_results) :  An error occured during STITCH. The first such error is aboveCalls: STITCH -> completeSampleIteration -> check_mclapply_OKIn addition: Warning message:In mclapply(sampleRanges, mc.cores = nCores, FUN = subset_of_complete_iteration,  :  scheduled cores 17, 28, 10, 20, 13, 11, 25 encountered errors in user code, all values of the jobs will be affectedExecution halted 4.01 4.3 6.01 9.05

A sample of full screen output:

[2019-05-13 21:37:00] Running STITCH(chr = Chr9, nGen = 120, posfile = ../Imputation/pos-chr9.01.txt, K = 8, outputdir = chr9t.01, tempdir = NA, bamlist = realbamlist.txt, cramlist = , sampleNames_file = , reference = , genfile = , method = diploid, output_format = bgvcf, B_bit_prob = 16, outputInputInVCFFormat = FALSE, downsampleToCov = 50, downsampleFraction = 1, readAware = TRUE, chrStart = NA, chrEnd = NA, regionStart = NA, regionEnd = NA, buffer = NA, maxDifferenceBetweenReads = 1000, maxEmissionMatrixDifference = 1e+10, alphaMatThreshold = 1e-04, emissionThreshold = 1e-04, iSizeUpperLimit = 600, bqFilter = 17, niterations = 40, shuffleHaplotypeIterations = c(4, 8, 12, 16), splitReadIterations = 25, nCores = 32, expRate = 0.5, maxRate = 100, minRate = 0.1, Jmax = 1000, regenerateInput = TRUE, originalRegionName = NA, keepInterimFiles = FALSE, keepTempDir = FALSE, outputHaplotypeProbabilities = FALSE, switchModelIteration = NA, generateInputOnly = FALSE, restartIterations = NA, refillIterations = c(6, 10, 14, 18), downsampleSamples = 1, downsampleSamplesKeepList = NA, subsetSNPsfile = NA, useSoftClippedBases = FALSE, outputBlockSize = 1000, outputSNPBlockSize = 10000, inputBundleBlockSize = NA, reference_haplotype_file = , reference_legend_file = , reference_sample_file = , reference_populations = NA, reference_phred = 20, reference_iterations = 10, output_filename = NULL, initial_min_hapProb = 0.4, initial_max_hapProb = 0.6, regenerateInputWithDefaultValues = FALSE, plotHapSumDuringIterations = FALSE, plot_shuffle_haplotype_attempts = FALSE, plotAfterImputation = TRUE, save_sampleReadsInfo = FALSE, gridWindowSize = NA, shuffle_bin_nSNPs = NULL, shuffle_bin_radius = 5000, keepSampleReadsInRAM = FALSE, useTempdirWhileWriting = FALSE, output_haplotype_dosages = TRUE) [2019-05-13 21:37:00] Program start [2019-05-13 21:37:00] Get and validate pos and gen [2019-05-13 21:37:10] Done get and validate pos and gen [2019-05-13 21:37:10] Get BAM sample names [2019-05-13 21:37:11] Done getting BAM sample names [2019-05-13 21:37:11] Generate inputs [2019-05-13 21:38:28] Load and convert BAM 100 of 385 ... ... ( I omit some downsample log here) [2019-05-13 22:17:48] downsample sample 20296 - 4250 of 296848 reads removed [2019-05-13 22:17:49] Done generating inputs [2019-05-13 22:17:49] Copying files onto tempdir [2019-05-13 22:22:44] Done copying files onto tempdir [2019-05-13 22:22:44] Generate allele count [2019-05-13 22:24:45] Quantiles across SNPs of per-sample depth of coverage [2019-05-13 22:24:45] 5% 25% 50% 75% 95% [2019-05-13 22:24:45] 0.683 0.927 1.049 1.168 1.371 [2019-05-13 22:24:45] Done generating allele count [2019-05-13 22:24:53] Outputting will be done in 195 blocks with on average 9981.1 SNPs in them [2019-05-13 22:24:53] Begin parameter initialization [2019-05-13 22:24:53] Done parameter initialization [2019-05-13 22:24:53] Begin EM [2019-05-13 22:24:53] Number of samples: 385 [2019-05-13 22:24:53] Number of SNPs: 1946313 [2019-05-13 22:24:53] Start of iteration 1 [2019-05-13 22:28:54] Start of iteration 2 [2019-05-13 22:32:44] Start of iteration 3 [2019-05-13 22:38:13] Start of iteration 4 [2019-05-13 22:42:12] Shuffle haplotypes - Iteration 4 - change 2888 intervals out of 5912 considered [2019-05-13 22:42:21] Start of iteration 5 [2019-05-13 22:46:12] Start of iteration 6 [2019-05-13 22:50:02] Iteration - 6 - refill infrequently used haplotypes [2019-05-13 22:55:12] Refill infrequently used haplotypes - 24% of regions replaced [2019-05-13 22:55:12] Start of iteration 7 [2019-05-13 23:00:50] Start of iteration 8 [2019-05-13 23:04:45] Shuffle haplotypes - Iteration 8 - change 2026 intervals out of 5735 considered [2019-05-13 23:04:53] Start of iteration 9 [2019-05-13 23:08:51] Start of iteration 10 [2019-05-13 23:12:59] Iteration - 10 - refill infrequently used haplotypes [2019-05-13 23:16:14] Refill infrequently used haplotypes - 19.5% of regions replaced [2019-05-13 23:16:14] Start of iteration 11 Error in if (sum(how_many_cols_below_0) > 0) { : missing value where TRUE/FALSE needed Calls: STITCH -> completeSampleIteration -> calculate_updates Execution halted

N=385, but the average coverage histogram of the samples: image

rwdavies commented 5 years ago

Thanks for the additional information. It's still not 100% clear to me what the cause of the first problem is, but I now suspect a few things, listed in descending order of probability. I think it's probably the first or second option Option 1), most probable, out of RAM, manifesting in this weird error. Internally, in the standard diploid method, when running the HMM, there are matrices of size nSNPs K K, which here would be (1946313 8 8) * 8 / 1024 / 1024 ~= 950 MB each. There are a few such matrices, and with nCores = 32, each of them would be copied 32 times. So this could easily use >100GB of RAM. Is that supported on this machine? The way to tell if this is to try re-running this in windows (e.g. 5 mbp) and see if the problem goes away. Alternatively, try giving the machine more RAM, or running with fewer cores. If this is the case, I need to work to better capture this error, and print a better error message. Option 2), something in refilling infrequency used haplotypes is going wrong here. You could try setting refillIterations = NA to see if this consistently fixes the problem (if this does fix the problem, there is a bug in the code which I would then hope to isolate and fix) Option 3), the coverage at some loci, while being reasonable overall, still causes overflow problems. You could try setting downsampleToCov to a lower value, like 20. This feels like an unlikely cause of the problem though

suestring7 commented 5 years ago

Hi Rob,

Thank you for the detailed list. I am not quite sure what do you mean by windows in the first option? Is that the gridWindowSize or something else? I am trying the other two options.

rwdavies commented 5 years ago

No, for the first option, I mean running in windows / regions / chunks of the genome, i.e. instead of running an entire chromosome at once, run once from chr9, 1-5,000,000bp with a 250,000bp buffer, then run chr9 5,000,001-10,000,000 with 250,000bp buffer, etc

suestring7 commented 5 years ago

I see. But would that affect the result? I thought the imputation based on the context.

rwdavies commented 5 years ago

If buffer exceeds the distance over which you expect LD to deteriorate, you should be OK. See this answer for some more details https://github.com/rwdavies/STITCH/issues/2#issuecomment-290164254

suestring7 commented 5 years ago

Cool, I see what buffer mean here.

suestring7 commented 5 years ago

Hi Rob,

I've tried Option 1 and 2, both works. (But I've only tried 1 for the first 5Mb... Not sure if anything could be wrong for the sequence left. And still some data had bad_alloc errors for 2 which I could solve by giving more memory to it) Also, what does the refillIteration do... Would it affect the result by putting NA instead of the default?

Thank you, Yuan

rwdavies commented 5 years ago

The "refillIteration" is a heuristic which tries to identify infrequently used ancestral haplotypes and suggest alternate haplotypes that might be more frequently used, in an effort to lead to fewer jumps among sample haplotypes between ancestral haplotypes. I don't think it's very effective, I remember it having some impact when I originally developed STITCH, but more recently in testing, it's not added much. It's fine to disable

Good in general though? Sounds like usinng buffer / adding RAM leads to more consistent run throughs?

suestring7 commented 5 years ago

Yes! Thanks a lot, Rob. I now use the pipeline with "refillIteration=NA" and adding RAM/lower threads. It's quite stable now.