rwdavies / QUILT

GNU General Public License v3.0
45 stars 10 forks source link

Error in check_mclapply_OK(complete_set_of_results) #15

Open SABiagini opened 2 years ago

SABiagini commented 2 years ago

Hi,

I'm running QUILT using a scheduler that allows me to send one job per chromosome, parallelizing the run of each one of the windows provided. In other words, each chromosome is a job with multiple tasks (windows) that run QUILT in parallel.

I many cases things work easily, but sometimes some of the tasks fail and the error I see from QUILT is:

Error in check_mclapply_OK(complete_set_of_results) : An error occured during QUILT. The first such error is above Calls: QUILT -> check_mclapply_OK In addition: Warning message: In mclapply(1:length(sampleRanges), mc.cores = nCores, function(iCore) { : scheduled cores 2, 3 did not deliver results, all values of the jobs will be affected Execution halted

Each task is using 4 cores (4Gb each, i.e.,16Gb per task). I don't think is a memory related problem since bigger chromosomes ran smoothly with even less memory per core and the error I'm reporting is from smaller chromosomes with even less tasks per job.

I also noticed that this is a random behavior and that sometimes the same job does not generate the problem.

I wonder whether this is a QUILT related issue or something else.

Could you please help me with this?

Thanks

rwdavies commented 2 years ago

It definitely could be a QUILT related problem.

To double check, does your scheduler output error information? To see if it's run out of RAM

If you're parallelizing anyway, can you decrease to using 1 core per job? Then run using /usr/bin/time -vvv QUILT.R --etc and double check RAM usage

What's the nature of the data, is it uniform low-coverage, or something else? Just wondering if the intermittent error relates to either heuristics, or overflow issues. For instance I've sometimes seen problems with GBS type data with a lot of nearby SNPs (say 20 within 100bp) with lots of coverage (40X) that aren't downsampled aggressively enough (see option downsampleToCov)

SABiagini commented 2 years ago

Hi Robert, thank you for the answer and sorry for my late response.

To answer your questions:

However, after talking to the IT service, I think I understood that maybe the problem could be related to the way their system works; I assumed I had 4GB/core but they actually said that it's ~4GB/core. This means that when I was running a job on 4 cores with 4GB each (16GB in total per job), I was actually giving less than I expected. So, maybe what happened is that QUILT expected 4GB per core but some of them received less.

I don't know if this makes any sense, I'm just trying to figure it out!

However, if you think this is not a possible explanation and you have some suggestion, please let me know what I can do.

Thank you very much.

rwdavies commented 2 years ago

OK interesting, thanks. So unclear on whether it's RAM related then? The scheduler doesn't store whether a particular job runs out of RAM, but when you checked memory usage in general it seemed OK?

About the 5000 samples, QUILT imputes samples independently, so you can split into smaller batches and run those independently, then merge the VCFs afterwards.

SABiagini commented 2 years ago

When I check the memory usage after a run it seems ok. However, I found that another error is reported together with the one I mentioned at the beginning:

[2022-04-07 06:18:01] i_gibbs=1, i_it = 2 full
[2022-04-07 06:18:01] i_gibbs=2, i_it = 3 small gibbs
[2022-04-07 06:18:02] There are 15 out of 53 regions that have been flipped by consensus
[2022-04-07 06:18:02] Phasing it
[2022-04-07 06:18:02] i_gibbs=8, i_it = 1 small gibbs

 *** caught bus error ***
address 0x14bf3e0bd7a0, cause 'non-existent physical address'

Traceback:
 1: rcpp_forwardBackwardGibbsNIPT(sampleReads = sampleReads, priorCurrent_m = small_priorCurrent_m,     alphaMatCurrent_tc = small_alphaMatCurrent_tc, eHapsCurrent_tc = small_eHapsCurrent_tc,     transMatRate_tc_H = small_transMatRate_tc_H, hapMatcher = hapMatcher,     distinctHapsIE = distinctHapsIE, rhb_t = rhb_t, ref_error = ref_error,     which_haps_to_use = which_haps_to_use, ff = 0, maxDifferenceBetweenReads = maxDifferenceBetweenReads,     Jmax = Jmax, maxEmissionMatrixDifference = maxEmissionMatrixDifference,     run_fb_grid_offset = FALSE, blocks_for_output = array(0,         c(1, 1)), grid = grid, pass_in_alphaBeta = TRUE, alphaHat_t1 = alphaHat_t1,     alphaHat_t2 = alphaHat_t2, alphaHat_t3 = alphaHat_t3, betaHat_t1 = betaHat_t1,     betaHat_t2 = betaHat_t2, betaHat_t3 = betaHat_t3, eMatGrid_t1 = eMatGrid_t1,     eMatGrid_t2 = eMatGrid_t2, eMatGrid_t3 = eMatGrid_t3, gammaMT_t_local = gammaMT_t_local,     gammaMU_t_local = gammaMU_t_local, gammaP_t_local = gammaP_t_local,     hapSum_tc = array(0, c(1, 1, 1)), snp_start_1_based = -1,     snp_end_1_based = -1, generate_fb_snp_offsets = FALSE, suppressOutput = suppressOutput,     n_gibbs_burn_in_its = n_gibbs_burn_in_its, n_gibbs_sample_its = n_gibbs_sample_its,     n_gibbs_starts = n_gibbs_starts, double_list_of_starting_read_labels = double_list_of_starting_read_labels,     prev_list_of_alphaBetaBlocks = as.list(c(1, 2)), i_snp_block_for_alpha_beta = -1,     haploid_gibbs_equal_weighting = TRUE, gibbs_initialize_iteratively = gibbs_initialize_iteratively,     gibbs_initialize_at_first_read = gibbs_initialize_at_first_read,     do_block_resampling = FALSE, perform_block_gibbs = perform_block_gibbs,     seed_vector = 0, update_hapSum = FALSE, class_sum_cutoff = 0.06,     record_read_set = TRUE, wif0 = wif0, grid_has_read = grid_has_read,     shuffle_bin_radius = shuffle_bin_radius, L_grid = L_grid,     block_gibbs_iterations = block_gibbs_iterations, rescale_eMatRead_t = rescale_eMatRead_t,     smooth_cm = smooth_cm, param_list = param_list, use_smooth_cm_in_block_gibbs = use_smooth_cm_in_block_gibbs,     block_gibbs_quantile_prob = block_gibbs_quantile_prob, use_small_eHapsCurrent_tc = use_small_eHapsCurrent_tc)
 2: impute_one_sample(distinctHapsIE = distinctHapsIE, hapMatcher = hapMatcher,     rhb_t = rhb_t, ref_error = ref_error, nSNPs = nSNPs, sampleReads = sampleReads,     small_eHapsCurrent_tc = small_eHapsCurrent_tc, small_transMatRate_tc_H = small_transMatRate_tc_H,     alphaHat_t1 = alphaHat_t1, betaHat_t1 = betaHat_t1, eMatGrid_t1 = eMatGrid_t1,     alphaHat_t2 = alphaHat_t2, betaHat_t2 = betaHat_t2, eMatGrid_t2 = eMatGrid_t2,     alphaHat_t3 = alphaHat_t3, betaHat_t3 = betaHat_t3, eMatGrid_t3 = eMatGrid_t3,     gammaMT_t_local = gammaMT_t_local, gammaMU_t_local = gammaMU_t_local,     gammaP_t_local = gammaP_t_local, small_alphaMatCurrent_tc = small_alphaMatCurrent_tc,     small_priorCurrent_m = small_priorCurrent_m, smooth_cm = smooth_cm,     which_haps_to_use = which_haps_to_use, n_gibbs_starts = n_gibbs_starts,     n_gibbs_burn_in_its = n_gibbs_burn_in_its, n_gibbs_sample_its = 1,     double_list_of_starting_read_labels = double_list_of_starting_read_labels,     block_gibbs_iterations = block_gibbs_iterations, perform_block_gibbs = TRUE,     make_plots = make_plots, wif0 = wif0, grid_has_read = grid_has_read,     plot_description = paste0("it", i_it, ".gibbs"), ancAlleleFreqAll = ancAlleleFreqAll,     grid = grid, L_grid = L_grid, L = L, inRegion2 = inRegion2,     cM_grid = cM_grid, outplotprefix = outplotprefix, have_truth_haplotypes = have_truth_haplotypes,     truth_haps = truth_haps, truth_labels = truth_labels, uncertain_truth_labels = uncertain_truth_labels,     verbose = FALSE, maxEmissionMatrixDifference = 10000000000000000159028911097599180468360808563945281389781327557747838772170381060813469985856815104,     return_p_store = FALSE, return_extra = FALSE, return_genProbs = return_genProbs,     return_hapProbs = return_hapProbs, return_gibbs_block_output = FALSE,     gibbs_initialize_iteratively = gibbs_initialize_iteratively,     gibbs_initialize_at_first_read = FALSE, maxDifferenceBetweenReads = maxDifferenceBetweenReads,     rescale_eMatRead_t = TRUE, rescale_eMatGrid_t = FALSE, Jmax = 10000,     suppressOutput = suppressOutput, shuffle_bin_radius = shuffle_bin_radius,     make_plots_block_gibbs = make_plots_block_gibbs, sample_name = sample_name,     regionStart = regionStart, regionEnd = regionEnd, buffer = buffer,     use_small_eHapsCurrent_tc = use_small_eHapsCurrent_tc)
 3: get_and_impute_one_sample(rhb_t = rhb_t, outputdir = outputdir,     nGibbsSamples = nGibbsSamples, n_seek_its = n_seek_its, full_alphaHat_t = full_alphaHat_t,     full_betaHat_t = full_betaHat_t, full_gamma_t = full_gamma_t,     full_gammaSmall_t = full_gammaSmall_t, full_gammaSmall_cols_to_get = full_gammaSmall_cols_to_get,     full_transMatRate_t_H = full_transMatRate_t_H, small_transMatRate_tc_H = small_transMatRate_tc_H,     alphaHat_t1 = alphaHat_t1, betaHat_t1 = betaHat_t1, eMatGrid_t1 = eMatGrid_t1,     alphaHat_t2 = alphaHat_t2, betaHat_t2 = betaHat_t2, eMatGrid_t2 = eMatGrid_t2,     alphaHat_t3 = alphaHat_t3, betaHat_t3 = betaHat_t3, eMatGrid_t3 = eMatGrid_t3,     gammaMT_t_local = gammaMT_t_local, gammaMU_t_local = gammaMU_t_local,     gammaP_t_local = gammaP_t_local, small_alphaMatCurrent_tc = small_alphaMatCurrent_tc,     small_priorCurrent_m = small_priorCurrent_m, small_eHapsCurrent_tc = small_eHapsCurrent_tc,     bam_files = bam_files, L = L, pos = pos, chr = chr, tempdir = tempdir,     regionName = regionName, regionStart = regionStart, regionEnd = regionEnd,     buffer = buffer, verbose = verbose, gen = gen, phase = phase,     iSample = iSample, grid = grid, ancAlleleFreqAll = ancAlleleFreqAll,     L_grid = L_grid, shuffle_bin_radius = shuffle_bin_radius,     Ksubset = Ksubset, Knew = Knew, K_top_matches = K_top_matches,     heuristic_match_thin = heuristic_match_thin, record_interim_dosages = record_interim_dosages,     have_truth_haplotypes = have_truth_haplotypes, bqFilter = bqFilter,     record_read_label_usage = record_read_label_usage, sampleNames = sampleNames,     smooth_cm = smooth_cm, iSizeUpperLimit = iSizeUpperLimit,     maxDifferenceBetweenReads = maxDifferenceBetweenReads, make_plots = make_plots,     ref_error = ref_error, distinctHapsB = distinctHapsB, distinctHapsIE = distinctHapsIE,     hapMatcher = hapMatcher, eMatDH_special_grid_which = eMatDH_special_grid_which,     eMatDH_special_values_list = eMatDH_special_values_list,     inRegion2 = inRegion2, cM_grid = cM_grid, af = af, use_bx_tag = use_bx_tag,     bxTagUpperLimit = bxTagUpperLimit, addOptimalHapsToVCF = addOptimalHapsToVCF,     make_plots_block_gibbs = make_plots_block_gibbs, estimate_bq_using_truth_read_labels = estimate_bq_using_truth_read_labels,     chrStart = chrStart, chrEnd = chrEnd, gamma_physically_closest_to = gamma_physically_closest_to,     hla_run = hla_run, downsampleToCov = downsampleToCov, minGLValue = minGLValue,     minimum_number_of_sample_reads = minimum_number_of_sample_reads,     print_extra_timing_information = print_extra_timing_information,     n_gibbs_burn_in_its = n_gibbs_burn_in_its, block_gibbs_iterations = block_gibbs_iterations,     plot_per_sample_likelihoods = plot_per_sample_likelihoods,     use_small_eHapsCurrent_tc = use_small_eHapsCurrent_tc, output_gt_phased_genotypes = output_gt_phased_genotypes)
 4: FUN(X[[i]], ...)
 5: lapply(X = S, FUN = FUN, ...)
 6: doTryCatch(return(expr), name, parentenv, handler)
 7: tryCatchOne(expr, names, parentenv, handlers[[1L]])
 8: tryCatchList(expr, classes, parentenv, handlers)
 9: tryCatch(expr, error = function(e) {    call <- conditionCall(e)    if (!is.null(call)) {        if (identical(call[[1L]], quote(doTryCatch)))             call <- sys.call(-4L)        dcall <- deparse(call)[1L]        prefix <- paste("Error in", dcall, ": ")        LONG <- 75L        sm <- strsplit(conditionMessage(e), "\n")[[1L]]        w <- 14L + nchar(dcall, type = "w") + nchar(sm[1L], type = "w")        if (is.na(w))             w <- 14L + nchar(dcall, type = "b") + nchar(sm[1L],                 type = "b")        if (w > LONG)             prefix <- paste0(prefix, "\n  ")    }    else prefix <- "Error : "    msg <- paste0(prefix, conditionMessage(e), "\n")    .Internal(seterrmessage(msg[1L]))    if (!silent && isTRUE(getOption("show.error.messages"))) {        cat(msg, file = outFile)        .Internal(printDeferredWarnings())    }    invisible(structure(msg, class = "try-error", condition = e))})
10: try(lapply(X = S, FUN = FUN, ...), silent = TRUE)
11: sendMaster(try(lapply(X = S, FUN = FUN, ...), silent = TRUE))
12: FUN(X[[i]], ...)
13: lapply(seq_len(cores), inner.do)
14: mclapply(1:length(sampleRanges), mc.cores = nCores, function(iCore) {    sampleRange <- sampleRanges[[iCore]]    hweCount <- array(0, c(nSNPs, 3))    infoCount <- array(0, c(nSNPs, 2))    afCount <- array(0, nSNPs)    alleleCount <- array(0, c(nSNPs, 2))    K <- nrow(rhb_t)    full_alphaHat_t <- array(0, c(K, nGrids))    full_betaHat_t <- array(0, c(1, 1))    if (make_plots) {        full_gamma_t <- array(0, c(K, nGrids))    }    else {        full_gamma_t <- array(0, c(1, 1))    }    iSample <- 1    ww <- seq(1, nGrids, length.out = max(1, round(heuristic_match_thin *         nGrids)))    full_gammaSmall_cols_to_get <- array(-1, nGrids)    full_gammaSmall_cols_to_get[ww] <- 0:(length(ww) - 1)    full_gammaSmall_t <- array(0, c(K, length(ww)))    K <- Ksubset    S <- 1    alphaHat_t1 <- array(0, c(K, nGrids))    betaHat_t1 <- array(0, c(K, nGrids))    eMatGrid_t1 <- array(0, c(K, nGrids))    alphaHat_t2 <- array(0, c(K, nGrids))    betaHat_t2 <- array(0, c(K, nGrids))    eMatGrid_t2 <- array(0, c(K, nGrids))    alphaHat_t3 <- array(0, c(K, nGrids))    betaHat_t3 <- array(0, c(K, nGrids))    eMatGrid_t3 <- array(0, c(K, nGrids))    gammaMT_t_local <- array(0, c(K, nGrids))    gammaMU_t_local <- array(0, c(K, nGrids))    gammaP_t_local <- array(0, c(K, nGrids))    small_priorCurrent_m <- array(1/K, c(K, S))    small_alphaMatCurrent_tc <- array(1/K, c(K, nGrids - 1, S))    if (use_small_eHapsCurrent_tc) {        small_eHapsCurrent_tc <- array(0, c(K, nSNPs, S))    }    else {        small_eHapsCurrent_tc <- array(0, c(1, 1, 1))    }    results_across_samples <- as.list(sampleRange[2] - sampleRange[1] +         1)    for (iSample in sampleRange[1]:sampleRange[2]) {        print_message(paste0("Imputing sample: ", iSample))        out <- get_and_impute_one_sample(rhb_t = rhb_t, outputdir = outputdir,             nGibbsSamples = nGibbsSamples, n_seek_its = n_seek_its,             full_alphaHat_t = full_alphaHat_t, full_betaHat_t = full_betaHat_t,             full_gamma_t = full_gamma_t, full_gammaSmall_t = full_gammaSmall_t,             full_gammaSmall_cols_to_get = full_gammaSmall_cols_to_get,             full_transMatRate_t_H = full_transMatRate_t_H, small_transMatRate_tc_H = small_transMatRate_tc_H,             alphaHat_t1 = alphaHat_t1, betaHat_t1 = betaHat_t1,             eMatGrid_t1 = eMatGrid_t1, alphaHat_t2 = alphaHat_t2,             betaHat_t2 = betaHat_t2, eMatGrid_t2 = eMatGrid_t2,             alphaHat_t3 = alphaHat_t3, betaHat_t3 = betaHat_t3,             eMatGrid_t3 = eMatGrid_t3, gammaMT_t_local = gammaMT_t_local,             gammaMU_t_local = gammaMU_t_local, gammaP_t_local = gammaP_t_local,             small_alphaMatCurrent_tc = small_alphaMatCurrent_tc,             small_priorCurrent_m = small_priorCurrent_m, small_eHapsCurrent_tc = small_eHapsCurrent_tc,             bam_files = bam_files, L = L, pos = pos, chr = chr,             tempdir = tempdir, regionName = regionName, regionStart = regionStart,             regionEnd = regionEnd, buffer = buffer, verbose = verbose,             gen = gen, phase = phase, iSample = iSample, grid = grid,             ancAlleleFreqAll = ancAlleleFreqAll, L_grid = L_grid,             shuffle_bin_radius = shuffle_bin_radius, Ksubset = Ksubset,             Knew = Knew, K_top_matches = K_top_matches, heuristic_match_thin = heuristic_match_thin,             record_interim_dosages = record_interim_dosages,             have_truth_haplotypes = have_truth_haplotypes, bqFilter = bqFilter,             record_read_label_usage = record_read_label_usage,             sampleNames = sampleNames, smooth_cm = smooth_cm,             iSizeUpperLimit = iSizeUpperLimit, maxDifferenceBetweenReads = maxDifferenceBetweenReads,             make_plots = make_plots, ref_error = ref_error, distinctHapsB = distinctHapsB,             distinctHapsIE = distinctHapsIE, hapMatcher = hapMatcher,             eMatDH_special_grid_which = eMatDH_special_grid_which,             eMatDH_special_values_list = eMatDH_special_values_list,             inRegion2 = inRegion2, cM_grid = cM_grid, af = af,             use_bx_tag = use_bx_tag, bxTagUpperLimit = bxTagUpperLimit,             addOptimalHapsToVCF = addOptimalHapsToVCF, make_plots_block_gibbs = make_plots_block_gibbs,             estimate_bq_using_truth_read_labels = estimate_bq_using_truth_read_labels,             chrStart = chrStart, chrEnd = chrEnd, gamma_physically_closest_to = gamma_physically_closest_to,             hla_run = hla_run, downsampleToCov = downsampleToCov,             minGLValue = minGLValue, minimum_number_of_sample_reads = minimum_number_of_sample_reads,             print_extra_timing_information = print_extra_timing_information,             n_gibbs_burn_in_its = n_gibbs_burn_in_its, block_gibbs_iterations = block_gibbs_iterations,             plot_per_sample_likelihoods = plot_per_sample_likelihoods,             use_small_eHapsCurrent_tc = use_small_eHapsCurrent_tc,             output_gt_phased_genotypes = output_gt_phased_genotypes)        if (out[["sample_was_imputed"]]) {            infoCount[, 1] <- infoCount[, 1, drop = FALSE] +                 out[["eij"]]            infoCount[, 2] <- infoCount[, 2, drop = FALSE] +                 (out[["fij"]] - out[["eij"]]^2)            afCount <- afCount + (out[["eij"]])/2            hweCount[out[["max_gen"]]] <- hweCount[out[["max_gen"]]] +                 1            alleleCount <- alleleCount + out[["per_sample_alleleCount"]]            out[["eij"]] <- NULL            out[["fij"]] <- NULL            out[["per_sample_alleleCount"]] <- NULL            out[["max_gen"]] <- NULL        }        results_across_samples[[iSample - sampleRange[1] + 1]] <- out        rm(out)        if (as.numeric(K) * as.numeric(nSNPs) > (1000000)) {            for (i in 1:5) {                gc(reset = TRUE)            }        }    }    return(list(results_across_samples = results_across_samples,         infoCount = infoCount, afCount = afCount, hweCount = hweCount,         alleleCount = alleleCount))})
15: QUILT(outputdir = opt$outputdir, chr = opt$chr, regionStart = opt$regionStart,     regionEnd = opt$regionEnd, buffer = opt$buffer, bamlist = opt$bamlist,     cramlist = opt$cramlist, sampleNames_file = opt$sampleNames_file,     reference = opt$reference, nCores = opt$nCores, nGibbsSamples = opt$nGibbsSamples,     n_seek_its = opt$n_seek_its, Ksubset = opt$Ksubset, Knew = opt$Knew,     K_top_matches = opt$K_top_matches, output_gt_phased_genotypes = opt$output_gt_phased_genotypes,     heuristic_match_thin = opt$heuristic_match_thin, output_filename = opt$output_filename,     RData_objects_to_save = eval(parse(text = opt$RData_objects_to_save)),     output_RData_filename = opt$output_RData_filename, prepared_reference_filename = opt$prepared_reference_filename,     save_prepared_reference = opt$save_prepared_reference, tempdir = opt$tempdir,     bqFilter = opt$bqFilter, panel_size = opt$panel_size, posfile = opt$posfile,     genfile = opt$genfile, phasefile = opt$phasefile, maxDifferenceBetweenReads = opt$maxDifferenceBetweenReads,     make_plots = opt$make_plots, verbose = opt$verbose, shuffle_bin_radius = opt$shuffle_bin_radius,     iSizeUpperLimit = opt$iSizeUpperLimit, record_read_label_usage = opt$record_read_label_usage,     record_interim_dosages = opt$record_interim_dosages, use_bx_tag = opt$use_bx_tag,     bxTagUpperLimit = opt$bxTagUpperLimit, addOptimalHapsToVCF = opt$addOptimalHapsToVCF,     estimate_bq_using_truth_read_labels = opt$estimate_bq_using_truth_read_labels,     override_default_params_for_small_ref_panel = opt$override_default_params_for_small_ref_panel,     gamma_physically_closest_to = opt$gamma_physically_closest_to,     seed = opt$seed, hla_run = opt$hla_run, downsampleToCov = opt$downsampleToCov,     minGLValue = opt$minGLValue, minimum_number_of_sample_reads = opt$minimum_number_of_sample_reads,     nGen = opt$nGen, reference_haplotype_file = opt$reference_haplotype_file,     reference_legend_file = opt$reference_legend_file, reference_sample_file = opt$reference_sample_file,     reference_populations = opt$reference_populations, reference_phred = opt$reference_phred,     reference_exclude_samplelist_file = opt$reference_exclude_samplelist_file,     region_exclude_file = opt$region_exclude_file, genetic_map_file = opt$genetic_map_file,     nMaxDH = opt$nMaxDH, make_fake_vcf_with_sites_list = opt$make_fake_vcf_with_sites_list,     output_sites_filename = opt$output_sites_filename, expRate = opt$expRate,     maxRate = opt$maxRate, minRate = opt$minRate, print_extra_timing_information = opt$print_extra_timing_information,     block_gibbs_iterations = eval(parse(text = opt$block_gibbs_iterations)),     n_gibbs_burn_in_its = opt$n_gibbs_burn_in_its, plot_per_sample_likelihoods = opt$plot_per_sample_likelihoods,     use_small_eHapsCurrent_tc = opt$use_small_eHapsCurrent_tc)
An irrecoverable exception occurred. R is aborting now ...
[2022-04-07 06:18:03] i_gibbs=1, i_it = 1 full
[2022-04-07 06:18:04] i_gibbs=4, i_it = 3 full
[2022-04-07 06:18:05] i_gibbs=2, i_it = 2 full
[2022-04-07 06:18:05] i_gibbs=3, i_it = 2 small gibbs
[2022-04-07 06:18:05] i_gibbs=5, i_it = 3 small gibbs

I've found that working on a shared computing environment might create this issue, since mclapply sometimes fails on shared linux systems, when using n.cores, to automatically determine the number of cores to use.

If this is true, it would explain why this issue is so random. In fact, when I repeat the same exact job twice, most of the times it doesn't fail. So, the only possible explanation is some sort of internal conflict on a shared environment. Maybe, I should try occupying the entire node, it's the only way I'm thinking to prevent people to "steal" resources from my job.

I also found that it could be solved by strictly defining the number of cores to use using mclapply mc.cores function, but I have no access to that internal option of the software and I don't know how QUILT communicate to mclapply what number of cores has been chosen.

Regarding your suggestion of splitting the 5000 samples in small batches, I already did it to fix these problems related to some windows. Actually, I'm working with >75K samples that I've already split in smaller batches of 5K. The system I work on has not so many resources and I also have a limited amount of time to work on it, so I designed a pipeline that parallelize the run of the 22 autosomes for the 5000 samples. So far, it allowed me to finsh a batch in 3 days (which is also the walltime limit I have on the system I use). But thanks for the suggestion.

SABiagini commented 2 years ago

Hi Robert, sorry to bother you but I rerun QUILT on just a couple of windows that previously crashed. This time I used a reduced batch of 500 samples, but I still had problems; this error keeps popping out:

Error in check_mclapply_OK(complete_set_of_results) : 
  An error occured during QUILT. The first such error is above
Calls: QUILT -> check_mclapply_OK
In addition: Warning message:
In mclapply(1:length(sampleRanges), mc.cores = nCores, function(iCore) { :
  scheduled cores 2, 19, 23 did not deliver results, all values of the jobs will be affected
Execution halted

I've read that mclapply() is known to be unstable with code that multi-thread and, since I'm using multi-threading within QUILT, I wonder whether this might be the problem. Normally, I use an nCore=4 which should not be so intensive and avoiding it would be not time efficient.

Do you have any suggestion?

Thanks

rwdavies commented 2 years ago

If you use nCores = 1, mclapply will just fall back to lapply, which is more stable

I have definitely in the past seen un-stable bugs in QUILT though, e.g. jobs that run fine once then fail if re-run seemingly exactly the same way, related to randomness in the read assignment. I haven't seen any recently though, and they mostly have been due to underflow / overflow in the past (hence me asking if the coverage was truly uniform, or would have weird high spikes like with GBS). It's possible there are more weird edge cases out there. The best way to sort these out is to set a seed (e.g. using the seed option in QUILT.R), using 1 core, and see if you can see consistently see a failure with one seed and consistently not see a failure with another seed.

One other thing, this line

  scheduled cores 2, 19, 23 did not deliver results, all values of the jobs will be affected

suggests you were using >=23 cores though, far more than 4?

SABiagini commented 2 years ago

Indeed, I was using more cores just because I needed to speed up the process, but eventually I see it was not a good choice. My pipeline is however designed with nCore=4 that sometimes also gives the same issue we've been discussing about. However, since the multicore option seems off the table, I guess I'll try with the nCore=1 hoping it won't take too long. Thank you very much for the help!

rwdavies commented 2 years ago

There's nothing inherently wrong with using >1 cores, I do it all the time, but normally for testing (when I want the wall clock time to be low). IMO for debugging 1 core is usually better (to isolate the problem), as well as for production jobs (which, 75000 samples definitely qualifies as!).