rwdavies / STITCH

STITCH - Sequencing To Imputation Through Constructing Haplotypes
http://www.nature.com/ng/journal/v48/n8/abs/ng.3594.html
GNU General Public License v3.0
74 stars 19 forks source link

"split reads" iteration causes error: Mat::operator(): index out of bounds #32

Closed TeresaPegan closed 1 year ago

TeresaPegan commented 4 years ago

Hello, Today I've run into an error that consistently occurs at iteration 26 for one of my chromosomes. It seems likely to be caused by something happening in iteration 25, which is the default split reads iteration. When I tried setting splitReadIterations = NULL, STITCH ran with no errors.

Do you have any ideas about what might trigger this error? (See details below).

Also, is setting splitReadIterations = NULL an ok way to circumvent the problem, or do you think skipping the split reads iteration could significantly detract from my results? I don't have high coverage results to compare my output to, but I did find that the average INFO score from the run where I set splitReadIterations = NULL was 0.93.

Thanks again! :)

My input code:

STITCH(tempdir = tempdir(), chr = "MDLI01000001.1", bamlist = "bams.txt", posfile = "posfile.txt", outputdir = paste0(getwd(), "/", "stitchout"), K = 40, nGen = 15000, nCores = 1, switchModelIteration = 39, method="pseudoHaploid", splitReadIterations=NULL)

Detailed output:

[2020-02-29 09:52:06] Running STITCH(chr = MDLI01000001.1, nGen = 15000, posfile = posfile.txt, K = 40, S = 1, outputdir = /Users/tmpegan/Dropbox/WGLC_Plate_1/Setophaga_aligned/stitchout, nStarts = , tempdir = /var/folders/t6/4r8ccc2j7klb3sf2v6vqk6m02yxr_g/T//RtmpuG19oU, bamlist = bams.txt, cramlist = , sampleNames_file = , reference = , genfile = , method = pseudoHaploid, output_format = bgvcf, B_bit_prob = 16, outputInputInVCFFormat = FALSE, downsampleToCov = 50, downsampleFraction = 1, readAware = TRUE, chrStart = NA, chrEnd = NA, regionStart = NA, regionEnd = NA, buffer = NA, maxDifferenceBetweenReads = 1000, maxEmissionMatrixDifference = 1e+10, alphaMatThreshold = 1e-04, emissionThreshold = 1e-04, iSizeUpperLimit = 600, bqFilter = 17, niterations = 40, shuffleHaplotypeIterations = c(4, 8, 12, 16), splitReadIterations = 25, nCores = 1, expRate = 0.5, maxRate = 100, minRate = 0.1, Jmax = 1000, regenerateInput = TRUE, originalRegionName = NA, keepInterimFiles = FALSE, keepTempDir = FALSE, outputHaplotypeProbabilities = FALSE, switchModelIteration = 39, generateInputOnly = FALSE, restartIterations = NA, refillIterations = c(6, 10, 14, 18), downsampleSamples = 1, downsampleSamplesKeepList = NA, subsetSNPsfile = NA, useSoftClippedBases = FALSE, outputBlockSize = 1000, outputSNPBlockSize = 10000, inputBundleBlockSize = NA, reference_haplotype_file = , reference_legend_file = , reference_sample_file = , reference_populations = NA, reference_phred = 20, reference_iterations = 40, reference_shuffleHaplotypeIterations = c(4, 8, 12, 16), output_filename = NULL, initial_min_hapProb = 0.2, initial_max_hapProb = 0.8, regenerateInputWithDefaultValues = FALSE, plotHapSumDuringIterations = FALSE, plot_shuffle_haplotype_attempts = FALSE, plotAfterImputation = TRUE, save_sampleReadsInfo = FALSE, gridWindowSize = NA, shuffle_bin_nSNPs = NULL, shuffle_bin_radius = 5000, keepSampleReadsInRAM = FALSE, useTempdirWhileWriting = FALSE, output_haplotype_dosages = FALSE) [2020-02-29 09:52:06] Program start [2020-02-29 09:52:06] Get and validate pos and gen [2020-02-29 09:52:06] Done get and validate pos and gen [2020-02-29 09:52:06] Get BAM sample names [2020-02-29 09:52:12] Done getting BAM sample names [2020-02-29 09:52:12] Generate inputs [2020-02-29 09:52:30] Done generating inputs [2020-02-29 09:52:30] Copying files onto tempdir [2020-02-29 09:52:31] Done copying files onto tempdir [2020-02-29 09:52:31] Generate allele count [2020-02-29 09:52:33] Quantiles across SNPs of per-sample depth of coverage [2020-02-29 09:52:33] 5% 25% 50% 75% 95% [2020-02-29 09:52:33] 0.333 0.583 0.750 0.999 1.333 [2020-02-29 09:52:33] Done generating allele count [2020-02-29 09:52:33] Outputting will be done in 18 blocks with on average 9780.3 SNPs in them [2020-02-29 09:52:33] Begin parameter initialization [2020-02-29 09:52:34] Done parameter initialization [2020-02-29 09:52:34] Initialize readProbs [2020-02-29 09:52:37] Done initializing readProbs [2020-02-29 09:52:37] Start EM [2020-02-29 09:52:37] Number of samples: 12 [2020-02-29 09:52:37] Number of SNPs: 176045 [2020-02-29 09:52:37] Start of iteration 1 [2020-02-29 09:52:48] Start of iteration 2 [2020-02-29 09:52:59] Start of iteration 3 [2020-02-29 09:53:11] Start of iteration 4 [2020-02-29 09:53:25] Shuffle haplotypes - Iteration 4 - change on average 530 intervals out of 530 considered [2020-02-29 09:53:25] Start of iteration 5 [2020-02-29 09:53:36] Start of iteration 6 [2020-02-29 09:53:47] Iteration - 6 - refill infrequently used haplotypes [2020-02-29 09:53:49] Refill infrequently used haplotypes - on average, 58.7% of regions replaced [2020-02-29 09:53:49] Start of iteration 7 [2020-02-29 09:54:03] Start of iteration 8 [2020-02-29 09:54:17] Shuffle haplotypes - Iteration 8 - change on average 485 intervals out of 602 considered [2020-02-29 09:54:17] Start of iteration 9 [2020-02-29 09:54:29] Start of iteration 10 [2020-02-29 09:54:40] Iteration - 10 - refill infrequently used haplotypes [2020-02-29 09:54:42] Refill infrequently used haplotypes - on average, 46.4% of regions replaced [2020-02-29 09:54:42] Start of iteration 11 [2020-02-29 09:54:55] Start of iteration 12 [2020-02-29 09:55:10] Shuffle haplotypes - Iteration 12 - change on average 473 intervals out of 635 considered [2020-02-29 09:55:10] Start of iteration 13 [2020-02-29 09:55:21] Start of iteration 14 [2020-02-29 09:55:32] Iteration - 14 - refill infrequently used haplotypes [2020-02-29 09:55:34] Refill infrequently used haplotypes - on average, 38% of regions replaced [2020-02-29 09:55:34] Start of iteration 15 [2020-02-29 09:55:48] Start of iteration 16 [2020-02-29 09:56:03] Shuffle haplotypes - Iteration 16 - change on average 544 intervals out of 628 considered [2020-02-29 09:56:03] Start of iteration 17 [2020-02-29 09:56:14] Start of iteration 18 [2020-02-29 09:56:25] Iteration - 18 - refill infrequently used haplotypes [2020-02-29 09:56:28] Refill infrequently used haplotypes - on average, 38.4% of regions replaced [2020-02-29 09:56:29] Start of iteration 19 [2020-02-29 09:56:40] Start of iteration 20 [2020-02-29 09:56:52] Start of iteration 21 [2020-02-29 09:57:04] Start of iteration 22 [2020-02-29 09:57:16] Start of iteration 23 [2020-02-29 09:57:27] Start of iteration 24 [2020-02-29 09:57:39] Start of iteration 25 [2020-02-29 09:59:08] Split reads, average N=11 (0.026 %) [2020-02-29 09:59:08] Start of iteration 26 error: Mat::operator(): index out of bounds Error in forwardBackwardHaploid(sampleReads = sampleReads, eHapsCurrent_tc = eHapsCurrent_tc, : Mat::operator(): index out of bounds

rwdavies commented 4 years ago

Hi, apologies that I never answered this! Term was very busy and this seems to have slipped through

This is almost certainly a bug with STITCH. The read splitting code is very old and not as well tested as other parts of the code base that have been re-worked or developed. You are fine to turn this off, in practice, it only really matters with very long reads and very high heterozygosity.

If you are able and willing, it would be useful if you're able to send me the data that got fed into this, so I can re-run the read splitting to understand the problem and fix it for good (and build a test for when I properly re-write the code). If you're able to install locally (i.e. clone the github repo and install), you could modify the R code so that just after the function declaration https://github.com/rwdavies/STITCH/blob/master/STITCH/R/heuristics.R#L1161 everything input to the function is saved

save(gammaK_t, eHapsCurrent_t, K, L, iSample, sampleReads, tempdir, regionName, grid, verbose, method, pRgivenH1_m, pRgivenH2_m, srp, file = file.path(tempdir, paste0("temp.", iSample, ".RData")))

and could send to me to investigate (assuming you are allowed to do that with your data)

Glad to hear your info score is 0.9 otherwise, that sounds quite promising

rwdavies commented 4 years ago

PS you might need to initialize tempdir to something that won't get destroyed when R closes, like file.path(outputdir, "temp")

TeresaPegan commented 4 years ago

Hi, I can send you the data. However, I was never able to get STITCH to install using the github repo, only using conda.

I thought I might be able to get this working by re-running the code for just the "findRecombinedReadsPerSample" function in my R environment, after loading the library, but before using STITCH on my data. See the code below. However, it does not seem to be working, or at least it does not produce the .Rdata file. Do you know if I should be able get it to save the data into a .Rdata file this way? If so, do you know I could change the code to make it work? Thanks!

library("STITCH")
findRecombinedReadsPerSample <- function(
    gammaK_t,
    eHapsCurrent_t,
    K,
    L,
    iSample,
    sampleReads,
    tempdir,
    regionName,
    grid,
    verbose = TRUE,
    method = "diploid",
    pRgivenH1_m = NULL,
    pRgivenH2_m = NULL,
    srp = NULL
) { save(gammaK_t, eHapsCurrent_t, K, L, iSample, sampleReads, tempdir, regionName, grid, verbose, method, pRgivenH1_m, pRgivenH2_m, srp, file = "/Users/tmpegan/TEST.Rdata"   )     
    K <- dim(eHapsCurrent_t)[1]
    ## needs a full run
    ## only do for some - need at least 3 SNPs to consider
    w <- get_reads_worse_than_50_50(
        sampleReads = sampleReads,
        eHapsCurrent_t = eHapsCurrent_t,
        K = K
    )
    w <- w[w != 1 & w != length(w)]
    count <- 0
    if (length(w) > 0) {
        for (w1 in w) {
            out <- split_a_read(
                sampleReads = sampleReads,
                read_to_split = w1,
                gammaK_t = gammaK_t,
                L = L,
                eHapsCurrent_t = eHapsCurrent_t,
                K = K,
                grid = grid,
                method = method,
                pRgivenH1_m = pRgivenH1_m,
                pRgivenH2_m = pRgivenH2_m,
                srp = srp
            )
            sampleReads <- out$sampleReads
            pRgivenH1_m <- out$pRgivenH1_m
            pRgivenH2_m <- out$pRgivenH2_m
            srp <- srp
            count <- count + as.integer(out$did_split)
        } # end of loop on reads
        new_order <- order(unlist(lapply(sampleReads,function(x) x[[2]])))        
        sampleReads <- sampleReads[new_order]
        save(sampleReads, file = file_sampleReads(tempdir, iSample, regionName), compress = FALSE)
        if (verbose) {
            print_message(paste0(
                "sample ", iSample, " readsSplit ", count, " readsTotal ", length(sampleReads)
            ))
        }
        if (method == "pseudoHaploid") {
            ## randomize those of split reads
            srp <- srp[new_order]
            pRgivenH1_m <- pRgivenH1_m[new_order, , drop = FALSE]
            pRgivenH2_m <- pRgivenH2_m[new_order, , drop = FALSE]
            save(
                srp, pRgivenH1_m, pRgivenH2_m,
                file = file_sampleProbs(tempdir, iSample, regionName)
            )
        }  
        }
    return(
        list(
            readsSplit = count,
            readsTotal = length(sampleReads)

        )

    )

}

STITCH(tempdir = tempdir(), chr = "MDLI01000001.1", bamlist = "S_coronata/S_coronata_all_RG.txt", posfile = "S_coronata/S_coronata_MDLI01000001.1.txt", outputdir = paste0(getwd(), "/", "stitchout_15k40"), K = 40, nGen = 15000, nCores = 1, switchModelIteration = 39, method="pseudoHaploid")
rwdavies commented 4 years ago

This approach should work, if you do something like the following. With the code above, the STITCH command still works off the library version, which uses the library version of findRecombinedReadsPerSample. The below will put everything in the global environment

So steps are Clone the repo somewhere locally Modify the file above in the way shown by the code you copied (i.e. adding the "save" bit) Manually load the R files after the library command in R into your session, using code like the below, copied below for reference

https://github.com/rwdavies/STITCH/blob/master/STITCH/tests/testthat/test-acceptance-one.R#L3

    library("testthat"); library("STITCH"); library("rrbgen")
    dir <- "~/proj/STITCH/" ## i.e. change this to global path where you've downloaded the repo, which now includes the manually modified version of the findRecombinedReadsPerSample
    setwd(paste0(dir, "/STITCH/R"))
    a <- dir(pattern = "*R")
    b <- grep("~", a)
    if (length(b) > 0) {
        a <- a[-b]
    }
    o <- sapply(a, source)
    setwd(dir)
    Sys.setenv(PATH = paste0(getwd(), ":", Sys.getenv("PATH")))

## then something like your 
STITCH(tempdir = tempdir(), chr = "MDLI01000001.1", bamlist = "S_coronata/S_coronata_all_RG.txt", posfile = "S_coronata/S_coronata_MDLI01000001.1.txt", outputdir = paste0(getwd(), "/", "stitchout_15k40"), K = 40, nGen = 15000, nCores = 1, switchModelIteration = 39, method="pseudoHaploid")