rwdavies / STITCH

STITCH - Sequencing To Imputation Through Constructing Haplotypes
http://www.nature.com/ng/journal/v48/n8/abs/ng.3594.html
GNU General Public License v3.0
76 stars 17 forks source link

STITCH crashed #84

Open rmott opened 1 year ago

rmott commented 1 year ago

I've just run stitch on 474 recombinant inbred arabidopsis genomes. As test case I am using just the first 1Mb of cars (about 11k SNP sites), with K=19 (the number of founders of the population), without a reference panel. It crashes after 26 iterations with a message:

error: Mat::operator(): index out of bounds

Here is the output log:

source("stitch.R") [2023-07-12 15:05:11] Running STITCH(chr = 1, nGen = 7, posfile = chr1.pos.txt, K = 19, S = 1, outputdir = ./tair10.STITCH/, nStarts = , tempdir = NA, bamlist = bamfiles.markdup.reduced.list, cramlist = , sampleNames_file = , reference = , genfile = , method = pseudoH aploid, output_format = bgvcf, B_bit_prob = 16, outputInputInVCFFormat = FALSE, downsampleToCov = 50, downsampleFraction = 1, readAware = TRUE, chrStart = NA, chrEnd = NA, regionStart = 1, regionEnd = 1000000, buffer = 1000, maxDifferenceBetweenReads = 1000, maxEmissi onMatrixDifference = 10000000000, alphaMatThreshold = 0.0001, emissionThreshold = 0.0001, iSizeUpperLimit = 600, bqFilter = 17, niterations = 40, shuffleHaplotypeIterations = c(4, 8, 12, 16), splitReadIterations = 25, nCores = 5, expRate = 0.5, maxRate = 100, minRate = 0.1, Jmax = 1000, regenerateInput = TRUE, originalRegionName = NA, keepInterimFiles = FALSE, keepTempDir = FALSE, outputHaplotypeProbabilities = FALSE, switchModelIteration = NA, generateInputOnly = FALSE, restartIterations = NA, refillIterations = c(6, 10, 14, 18), downsampleSamples = 1, downsampleSamplesKeepList = NA, subsetSNPsfile = NA, useSoftClippedBases = FALSE, outputBlockSize = 1000, outputSNPBlockSize = 10000, inputBundleBlockSize = NA, genetic_map_file = , reference_haplotype_file = , reference_legend_file = , referen ce_sample_file = , reference_populations = NA, reference_phred = 20, reference_iterations = 40, reference_shuffleHaplotypeIterations = c(4, 8, 12, 16), output_filename = NULL, initial_min_hapProb = 0.2, initial_max_hapProb = 0.8, regenerateInputWithDefaultValues = FAL SE, plotHapSumDuringIterations = FALSE, plot_shuffle_haplotype_attempts = FALSE, plotAfterImputation = TRUE, save_sampleReadsInfo = FALSE, gridWindowSize = NA, shuffle_bin_nSNPs = NULL, shuffle_bin_radius = 5000, keepSampleReadsInRAM = FALSE, useTempdirWhileWriting = FALSE, output_haplotype_dosages = FALSE, use_bx_tag = TRUE, bxTagUpperLimit = 50000) [2023-07-12 15:05:11] Program start [2023-07-12 15:05:11] Get and validate pos and gen [2023-07-12 15:05:12] Done get and validate pos and gen [2023-07-12 15:05:12] There are 0 variants in the left buffer region -999 <= position < 1 [2023-07-12 15:05:12] There are 11406 variants in the central region 1 <= position <= 1000000 [2023-07-12 15:05:12] There are 15 variants in the right buffer region 1000000 < position <= 1001000 [2023-07-12 15:05:12] Get BAM sample names [2023-07-12 15:05:13] Done getting BAM sample names [2023-07-12 15:05:13] Generate inputs [2023-07-12 15:05:17] Load and convert BAM 100 of 474 [2023-07-12 15:05:20] downsample sample A77622_1 - 158 of 26132 reads removed [2023-07-12 15:05:21] downsample sample A77910_1 - 84 of 20825 reads removed [2023-07-12 15:05:22] downsample sample A77529_1 - 209 of 30861 reads removed [2023-07-12 15:05:22] Load and convert BAM 200 of 474 [2023-07-12 15:05:23] downsample sample A77530_1 - 138 of 24400 reads removed [2023-07-12 15:05:27] downsample sample A77916_1 - 95 of 24217 reads removed [2023-07-12 15:05:28] Load and convert BAM 300 of 474 [2023-07-12 15:05:30] downsample sample A77536_1 - 8 of 30203 reads removed [2023-07-12 15:05:35] Load and convert BAM 400 of 474 [2023-07-12 15:05:43] downsample sample A77740_1 - 147 of 28398 reads removed [2023-07-12 15:05:43] downsample sample A77930_1 - 62 of 48928 reads removed [2023-07-12 15:05:51] downsample sample A77937_1 - 189 of 30677 reads removed [2023-07-12 15:05:52] downsample sample A77752_1 - 75 of 22978 reads removed [2023-07-12 15:05:55] downsample sample A77660_1 - 5 of 25248 reads removed [2023-07-12 15:05:58] downsample sample A77759_1 - 71 of 21445 reads removed [2023-07-12 15:06:01] downsample sample A77950_1 - 112 of 24604 reads removed [2023-07-12 15:06:04] downsample sample A77669_1 - 133 of 24651 reads removed [2023-07-12 15:06:20] downsample sample A77972_1 - 12 of 25401 reads removed [2023-07-12 15:06:23] downsample sample A77690_1 - 7 of 27502 reads removed [2023-07-12 15:06:28] downsample sample A77696_1 - 50 of 21002 reads removed [2023-07-12 15:06:29] downsample sample A77794_1 - 119 of 22716 reads removed [2023-07-12 15:06:31] downsample sample A77890_1 - 75 of 19710 reads removed [2023-07-12 15:06:33] downsample sample A77986_1 - 126 of 25146 reads removed [2023-07-12 15:06:37] downsample sample A77991_1 - 111 of 24507 reads removed [2023-07-12 15:06:38] downsample sample A77614_1 - 79 of 19718 reads removed [2023-07-12 15:06:41] downsample sample A77901_1 - 109 of 22712 reads removed [2023-07-12 15:06:44] Done generating inputs [2023-07-12 15:06:44] Copying files onto tempdir [2023-07-12 15:06:53] Done copying files onto tempdir [2023-07-12 15:06:53] Generate allele count [2023-07-12 15:07:02] Quantiles across SNPs of per-sample depth of coverage [2023-07-12 15:07:02] 5% 25% 50% 75% 95% [2023-07-12 15:07:02] 3.816 5.908 7.569 9.315 12.293 [2023-07-12 15:07:02] Done generating allele count [2023-07-12 15:07:02] Outputting will be done in 2 blocks with on average 5703 SNPs in them [2023-07-12 15:07:02] Begin parameter initialization [2023-07-12 15:07:02] Done parameter initialization [2023-07-12 15:07:02] Initialize readProbs [2023-07-12 15:07:11] Done initializing readProbs [2023-07-12 15:07:11] Start EM [2023-07-12 15:07:11] Number of samples: 474 [2023-07-12 15:07:11] Number of SNPs: 11421 [2023-07-12 15:07:11] Start of iteration 1 [2023-07-12 15:07:27] Start of iteration 2 [2023-07-12 15:07:43] Start of iteration 3 [2023-07-12 15:07:59] Start of iteration 4 [2023-07-12 15:08:16] Shuffle haplotypes - Iteration 4 - change on average 11 intervals out of 11 considered [2023-07-12 15:08:16] Start of iteration 5 [2023-07-12 15:08:33] Start of iteration 6 [2023-07-12 15:08:49] Iteration - 6 - refill infrequently used haplotypes [2023-07-12 15:08:49] Refill infrequently used haplotypes - on average, 29.6% of regions replaced [2023-07-12 15:08:49] Start of iteration 7 [2023-07-12 15:09:05] Start of iteration 8 [2023-07-12 15:09:22] Shuffle haplotypes - Iteration 8 - change on average 17 intervals out of 17 considered [2023-07-12 15:09:22] Start of iteration 9 [2023-07-12 15:09:39] Start of iteration 10 [2023-07-12 15:09:55] Iteration - 10 - refill infrequently used haplotypes [2023-07-12 15:09:55] Refill infrequently used haplotypes - on average, 30.3% of regions replaced [2023-07-12 15:09:55] Start of iteration 11 [2023-07-12 15:10:12] Start of iteration 12 [2023-07-12 15:10:29] Shuffle haplotypes - Iteration 12 - change on average 19 intervals out of 20 considered [2023-07-12 15:10:29] Start of iteration 13 [2023-07-12 15:10:45] Start of iteration 14 [2023-07-12 15:11:01] Iteration - 14 - refill infrequently used haplotypes [2023-07-12 15:11:01] Refill infrequently used haplotypes - on average, 26.3% of regions replaced [2023-07-12 15:11:01] Start of iteration 15 [2023-07-12 15:11:17] Start of iteration 16 [2023-07-12 15:11:35] Shuffle haplotypes - Iteration 16 - change on average 19 intervals out of 21 considered [2023-07-12 15:11:35] Start of iteration 17 [2023-07-12 15:11:51] Start of iteration 18 [2023-07-12 15:12:07] Iteration - 18 - refill infrequently used haplotypes [2023-07-12 15:12:07] Refill infrequently used haplotypes - on average, 27.2% of regions replaced [2023-07-12 15:12:07] Start of iteration 19 [2023-07-12 15:12:25] Start of iteration 20 [2023-07-12 15:12:43] Start of iteration 21 [2023-07-12 15:13:00] Start of iteration 22 [2023-07-12 15:13:17] Start of iteration 23 [2023-07-12 15:13:34] Start of iteration 24 [2023-07-12 15:13:51] Start of iteration 25 [2023-07-12 15:17:11] Split reads, average N=11 (0.045 %) [2023-07-12 15:17:11] Start of iteration 26

error: Mat::operator(): index out of bounds

error: Mat::operator(): index out of bounds

error: Mat::operator(): index out of bounds

error: Mat::operator(): index out of bounds

error: Mat::operator(): index out of bounds [2023-07-12 15:17:11] Error in forwardBackwardHaploid(sampleReads = sampleReads, eHapsCurrent_tc = eHapsCurrent_tc, : Mat::operator(): index out of bounds

Error in check_mclapply_OK(single_iteration_results) : An error occured during STITCH. The first such error is above In addition: Warning message: In mclapply(sampleRanges, mc.cores = nCores, FUN = subset_of_complete_iteration, : all scheduled cores encountered errors in user code

rwdavies commented 1 year ago

Hey Richard!

As a general comment, I'm not sure I'd advise running pseudoHaploid anymore. I think for most instances, including here, diploid is probably the best bet.

The error looks legit, but is happening at iteration 26, which is just after iteration 25, which is the default iteration for splitReadIterations. My suggestion would be to try and turn this off (possibly by setting to NA, or -1, I can't remember). What this tries to do (splitReadIterations) is detect long reads that span ancestral recombination breakpoints and split them, for an overall better model fit. But I feel it is more hassle than it's worth. It's fine to disable it.

Best, Robbie

rmott commented 1 year ago

Hi Robbie I ran haploid mode because K is quite large, but happy to do it in diploid mode(although these are inbred lines).

I’ll try disabling the splitReadIterations too.

Richard

From: rwdavies @.> Date: Wednesday, 12 July 2023 at 15:32 To: rwdavies/STITCH @.> Cc: Mott, Richard @.>, Author @.> Subject: Re: [rwdavies/STITCH] STITCH crashed (Issue #84)

⚠ Caution: External sender

Hey Richard!

As a general comment, I'm not sure I'd advise running pseudoHaploid anymore. I think for most instances, including here, diploid is probably the best bet.

The error looks legit, but is happening at iteration 26, which is just after iteration 25, which is the default iteration for splitReadIterations. My suggestion would be to try and turn this off (possibly by setting to NA, or -1, I can't remember). What this tries to do (splitReadIterations) is detect long reads that span ancestral recombination breakpoints and split them, for an overall better model fit. But I feel it is more hassle than it's worth. It's fine to disable it.

Best, Robbie

— Reply to this email directly, view it on GitHubhttps://github.com/rwdavies/STITCH/issues/84#issuecomment-1632641102, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AADK6MXGRWQH26UKSMLXJMTXP2YQPANCNFSM6AAAAAA2HSPAUY. You are receiving this because you authored the thread.Message ID: @.***>

rwdavies commented 1 year ago

For truly haploid data, diploid isn't so bad. What about "diploid-inbred"? That outputs diploid but assumes the samples are haploid (I think this is what you want to do)

p.s. if you want to test it out, you might want to turn on plot_shuffle_haplotype_attempts, then send me the plots, we can discuss. the default parameters are tuned for human / mice, and I'm not sure about Ne / recomb rate for arabidopsis, you might want to e.g. set shuffle_bin_radius lower, we can discuss

Robbie

rmott commented 1 year ago

Well, changing the two parameters has solved the crash problem – it’s now at iteration 35.

I’d bedelighted to chat – maybe I could come and see you in your office as I still live in Oxford!

From: rwdavies @.> Date: Wednesday, 12 July 2023 at 15:45 To: rwdavies/STITCH @.> Cc: Mott, Richard @.>, Author @.> Subject: Re: [rwdavies/STITCH] STITCH crashed (Issue #84)

⚠ Caution: External sender

For truly haploid data, diploid isn't so bad. What about "diploid-inbred"? That outputs diploid but assumes the samples are haploid (I think this is what you want to do)

p.s. if you want to test it out, you might want to turn on plot_shuffle_haplotype_attempts, then send me the plots, we can discuss. the default parameters are tuned for human / mice, and I'm not sure about Ne / recomb rate for arabidopsis, you might want to e.g. set shuffle_bin_radius lower, we can discuss

Robbie

— Reply to this email directly, view it on GitHubhttps://github.com/rwdavies/STITCH/issues/84#issuecomment-1632663630, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AADK6MW3ZWP2OUAEEW4OXTTXP22CJANCNFSM6AAAAAA2HSPAUY. You are receiving this because you authored the thread.Message ID: @.***>

rwdavies commented 1 year ago

Sure, do you want to come grab lunch at Somerville some day perhaps? I'm free quite often now during the summer

rmott commented 1 year ago

Yes, maybe next week?

Stitch crashed by the way, but over something probably trivial:

[2023-07-12 15:58:02] Start of iteration 32 [2023-07-12 15:58:25] Start of iteration 33 [2023-07-12 15:58:48] Start of iteration 34 [2023-07-12 15:59:12] Start of iteration 35 [2023-07-12 15:59:35] Start of iteration 36 [2023-07-12 15:59:58] Start of iteration 37 [2023-07-12 16:00:21] Start of iteration 38 [2023-07-12 16:00:44] Start of iteration 39 [2023-07-12 16:01:08] Start of iteration 40 [2023-07-12 16:01:23] End EM [2023-07-12 16:01:23] Begin making and writing output file [2023-07-12 16:01:23] Determine reads in output blocks [2023-07-12 16:01:28] Done determining reads in output blocks [2023-07-12 16:01:28] Initialize output file [2023-07-12 16:01:28] Done initializing output file [2023-07-12 16:01:28] Loop over and write output file [2023-07-12 16:01:28] Making output piece 1 / 2 [2023-07-12 16:01:48] Making output piece 2 / 2 [2023-07-12 16:01:54] Done looping over and writing output file [2023-07-12 16:01:54] bgzip output file and move to final location bgzip: invalid option -- '-' bgzip: invalid option -- 't'
Usage: bgzip [options] [file] ...
Options: -c write on standard output, keep original files unchanged
-d decompress
-f overwrite files without asking
-b INT decompress at virtual file pointer INT
-s INT decompress INT bytes in the uncompressed file
-h give this help
mv: cannot stat ‘./tair10.STITCH//stitch.1.1.1000000.vcf.gz.building.vcf.gz’: No such file or directory
[2023-07-12 16:01:54] Done making and writing output file
[2023-07-12 16:01:54] Remove buffer region from output
[2023-07-12 16:01:54] Save RData objects to disk
[2023-07-12 16:01:55] Make metrics plot
Error in .External2(C_X11, paste0("jpeg::", quality, ":", filename), g$width, :
unable to start device JPEG
In addition: Warning message:
In jpeg(file.path(outputdir, "plots", paste0("metricsForPostImputationQC.", :

unable to open connection to X11 display ''

From: rwdavies @.> Date: Wednesday, 12 July 2023 at 16:02 To: rwdavies/STITCH @.> Cc: Mott, Richard @.>, Author @.> Subject: Re: [rwdavies/STITCH] STITCH crashed (Issue #84)

⚠ Caution: External sender

Sure, do you want to come grab lunch at Somerville some day perhaps? I'm free quite often now during the summer

— Reply to this email directly, view it on GitHubhttps://github.com/rwdavies/STITCH/issues/84#issuecomment-1632708296, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AADK6MUOAYHYPEYZOV4OMPDXP24AHANCNFSM6AAAAAA2HSPAUY. You are receiving this because you authored the thread.Message ID: @.***>

rwdavies commented 1 year ago

how old is your bgzip? maybe update? bgzip is the easiest thing to install since sliced bread

this bit "unable to start device JPEG " is on you, google search for how to fix

you can potentially get around it using plotAfterImputation = FALSE I think, but plots are nice!

next week for lunch is fine, maybe email my gmail

rmott commented 1 year ago

Will do

Tuesday or Thursday next week work for me

From: rwdavies @.> Date: Wednesday, 12 July 2023 at 16:06 To: rwdavies/STITCH @.> Cc: Mott, Richard @.>, Author @.> Subject: Re: [rwdavies/STITCH] STITCH crashed (Issue #84)

⚠ Caution: External sender

how old is your bgzip? maybe update? bgzip is the easiest thing to install since sliced bread

this bit "unable to start device JPEG " is on you, google search for how to fix

you can potentially get around it using plotAfterImputation = FALSE I think, but plots are nice!

next week for lunch is fine, maybe email my gmail

— Reply to this email directly, view it on GitHubhttps://github.com/rwdavies/STITCH/issues/84#issuecomment-1632716980, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AADK6MRE4LJB6DRIUD6447LXP24QXANCNFSM6AAAAAA2HSPAUY. You are receiving this because you authored the thread.Message ID: @.***>

rmott commented 1 year ago

I’m probably being stupid but from where do I install bgzip ? is it

https://github.com/samtools/htslib/blob/develop/README.md

?

From: rwdavies @.> Date: Wednesday, 12 July 2023 at 16:06 To: rwdavies/STITCH @.> Cc: Mott, Richard @.>, Author @.> Subject: Re: [rwdavies/STITCH] STITCH crashed (Issue #84)

⚠ Caution: External sender

how old is your bgzip? maybe update? bgzip is the easiest thing to install since sliced bread

this bit "unable to start device JPEG " is on you, google search for how to fix

you can potentially get around it using plotAfterImputation = FALSE I think, but plots are nice!

next week for lunch is fine, maybe email my gmail

— Reply to this email directly, view it on GitHubhttps://github.com/rwdavies/STITCH/issues/84#issuecomment-1632716980, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AADK6MRE4LJB6DRIUD6447LXP24QXANCNFSM6AAAAAA2HSPAUY. You are receiving this because you authored the thread.Message ID: @.***>

rwdavies commented 1 year ago

http://www.htslib.org/download/

should be in htslib I think

rwdavies commented 1 year ago

I think also possibly

./scripts/install-dependencies.sh bgzip

should install a copy and symlink it to your STITCH github home dir