error in priorSum_m[, s]: incorrect number of dimensions

pdimens commented 2 weeks ago

Hello, I've been running stitch a bunch and this is the first time I'm seeing this particular error:

...
[2024-06-25 09:56:53] Done generating inputs
[2024-06-25 09:56:53] Copying files onto tempdir
[2024-06-25 09:56:57] Done copying files onto tempdir
[2024-06-25 09:56:57] Generate allele count
[2024-06-25 09:57:04] Quantiles across SNPs of per-sample depth of coverage
[2024-06-25 09:57:04]     5%   25%   50%   75%   95%
[2024-06-25 09:57:04]  0.401 0.896 1.086 1.216 1.442
[2024-06-25 09:57:04] Done generating allele count
[2024-06-25 09:57:04] Outputting will be done in 21 blocks with on average 9833.9 SNPs in them
[2024-06-25 09:57:04] Begin parameter initialization
[2024-06-25 09:57:07] Done parameter initialization
[2024-06-25 09:57:07] Start EM
[2024-06-25 09:57:07] Number of samples: 384
[2024-06-25 09:57:07] Number of SNPs: 206512
[2024-06-25 09:57:07] Start of iteration 1
[2024-06-25 10:31:12] Start of iteration 2
[2024-06-25 11:07:39] Start of iteration 3
[2024-06-25 11:42:08] Start of iteration 4
Error in priorSum_m[, s] : incorrect number of dimensions
Calls: do.call ... <Anonymous> -> completeSampleIteration -> calculate_updates
In addition: Warning message:
In mclapply(sampleRanges, mc.cores = nCores, FUN = subset_of_complete_iteration,  :
  scheduled cores 2, 12 did not deliver results, all values of the jobs will be affected
Execution halted

Is this something you have seen before?

For context, this is how STITCH was run:

[2024-06-25 09:55:38] Running STITCH(chr = CM031720.1, nGen = 100, posfile = Impute/workflow/input/stitch/CM031720.1.stitch, K = 35, S = 10, outputdir = /local/workdir/pd348/shad_haplotag/Impute/modeldiploid_usebxTrue_bxlimit100000_k35_s10_ngen100/contigs/CM031720.1, nStarts = , tempdir = /local/workdir/pd348/shad_haplotag/Impute/modeldiploid_usebxTrue_bxlimit100000_k35_s10_ngen100/contigs/CM031720.1/tmp, bamlist = Impute/workflow/input/samples.list, cramlist = , sampleNames_file = , reference = , genfile = , method = diploid, output_format = bgvcf, B_bit_prob = 16, outputInputInVCFFormat = FALSE, downsampleToCov = 50, downsampleFraction = 1, readAware = TRUE, chrStart = NA, chrEnd = NA, regionStart = NA, regionEnd = NA, buffer = NA, maxDifferenceBetweenReads = 1000, maxEmissionMatrixDifference = 1e+10, alphaMatThreshold = 1e-04, emissionThreshold = 1e-04, iSizeUpperLimit = 600, bqFilter = 17, niterations = 40, shuffleHaplotypeIterations = c(4, 8, 12, 16), splitReadIterations = NA, nCores = 20, expRate = 0.5, maxRate = 100, minRate = 0.1, Jmax = 1000, regenerateInput = TRUE, originalRegionName = NA, keepInterimFiles = FALSE, keepTempDir = FALSE, outputHaplotypeProbabilities = FALSE, switchModelIteration = 39, generateInputOnly = FALSE, restartIterations = NA, refillIterations = c(6, 10, 14, 18), downsampleSamples = 1, downsampleSamplesKeepList = NA, subsetSNPsfile = NA, useSoftClippedBases = FALSE, outputBlockSize = 1000, outputSNPBlockSize = 10000, inputBundleBlockSize = NA, genetic_map_file = , reference_haplotype_file = , reference_legend_file = , reference_sample_file = , reference_populations = NA, reference_phred = 20, reference_iterations = 40, reference_shuffleHaplotypeIterations = c(4, 8, 12, 16), output_filename = CM031720.1.vcf.gz, initial_min_hapProb = 0.2, initial_max_hapProb = 0.8, regenerateInputWithDefaultValues = FALSE, plotHapSumDuringIterations = FALSE, plot_shuffle_haplotype_attempts = FALSE, plotAfterImputation = TRUE, save_sampleReadsInfo = FALSE, gridWindowSize = NA, shuffle_bin_nSNPs = NULL, shuffle_bin_radius = 5000, keepSampleReadsInRAM = FALSE, useTempdirWhileWriting = FALSE, output_haplotype_dosages = FALSE, use_bx_tag = TRUE, bxTagUpperLimit = 1e+05)

rwdavies commented 2 weeks ago

Offhand, no, I don't think I've seen this before. Did you try re-running it and see if it worked? Does it seem to be happening consistently for a given region?

I often see weird errors driven by running out of memory in a multicore job but those error messages look different normally. Any chance that could be an issue here?

pdimens commented 1 week ago

Still not sure what caused this, but my guess is something to do with the container it ran in.

pdimens commented 1 week ago

Nope, I'm getting this error consistently for these data. Is there some kind of debugging information I can provide to help diagnose this?

Zilong-Li commented 1 week ago

Hi,

For debugging, please first show the sessionInfo().

I would also suggest

turn off multithreading, set nCores=1
use default options as possible as you can
try different regions

I am guessing there probably are NAs in priorSum_m returned in HMM for your data after running few iterations.

pdimens commented 1 week ago

@Zilong-Li here is the session info. In the meantime, I'm running it with a single core. Data's big, might take a while.

> sessionInfo()
R version 4.2.3 (2023-03-15)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: Rocky Linux 8.6 (Green Obsidian)

Matrix products: default
BLAS/LAPACK: /lustre1/home2/nt246_0001/pd348/.snakemake/conda/cc0299658c70184c6a34b389b25ac216_/lib/libopenblasp-r0.3.27.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
[1] STITCH_1.6.8 rrbgen_0.0.6

loaded via a namespace (and not attached):
[1] compiler_4.2.3 Rcpp_1.0.12

rwdavies commented 5 days ago

Can you re-try setting S=1, while S > 1 should be supported, perhaps there are some weird edge case going on?

Maybe after that, can you re-try with the latest version of STITCH, just in case that fixes it?

Since it crashed on iteration 4, and that is when the "shuffle haplotype" heuristic comes on, you might also want to change this shuffleHaplotypeIterations = c(4, 8, 12, 16) to something like shuffleHaplotypeIterations = c(8, 12, 16) and see what happens

rwdavies / STITCH

error in priorSum_m[, s]: incorrect number of dimensions #98