Terminate error seems randomly happen

suestring7 commented 6 years ago

Hi, I am trying to use stitch to impute missing genotypes. But I encountered some problems when running on the cluster with some large founder size numbers. The example works fine and so do some small founder size numbers. I am running stitch with Peromyscus(a kind of mouse) data and nGen=80, K=4n(where n varies from 1 to 20) to get the best condition of the founder size.

The problem is that the program failed mostly with large founder size number. And sometimes rerunning the program can help solve the problem... (which is weird) Below are some detailed information about how I ran the program.

The script

The bash script that I used to run the program is: STITCH.R --chr=C0000000010 --bamlist=bamlist_colony_noS46.txt --posfile=pos_colony_noS46.txt --outputdir=./noS46_$SGE_TASK_ID --K=$((4*$SGE_TASK_ID)) --nGen=80 --nCores=$CORES where $SGE_TASK_ID is the task ID ranges from 1 to 20, $CORES is the assigned core number

The error message:

Sometimes

caught segfault address 0x25a373748, cause 'memory not mapped

Traceback: 1: .Call("STITCH_forwardBackwardDiploid", PACKAGE = "STITCH", sampleReads, nReads, pi, transMatRate_t, alphaMat_t, eHaps_t, maxDifferenceBetweenReads, whatToReturn, Jmax, suppressOutput) 2: forwardBackwardDiploid(sampleReads = sampleReads, nReads = as.integer(length(sampleReads)), pi = priorCurrent, eHaps_t = eHapsCurrent_t, alphaMat_t = alphaMatCurrent_t, transMatRate_t = transMatRate_t, Jmax = Jmax, maxDifferenceBetweenReads = maxDifferenceBetweenReads, whatToReturn = whatToReturn, suppressOutput = as.integer(1)) 3: FUN(X[[i]], ...) 4: lapply(X = S, FUN = FUN, ...) 5: doTryCatch(return(expr), name, parentenv, handler) 6: tryCatchOne(expr, names, parentenv, handlers[[1L]]) 7: tryCatchList(expr, classes, parentenv, handlers) 8: tryCatch(expr, error = function(e) { call <- conditionCall(e) if (!is.null(call)) { if (identical(call[[1L]], quote(doTryCatch))) call <- sys.call(-4L) dcall <- deparse(call)[1L] prefix <- paste("Error in", dcall, ": ") LONG <- 75L msg <- conditionMessage(e) sm <- strsplit(msg, "\n")[[1L]] w <- 14L + nchar(dcall, type = "w") + nchar(sm[1L], type = "w") if (is.na(w)) w <- 14L + nchar(dcall, type = "b") + nchar(sm[1L], type = "b") if (w > LONG) prefix <- paste0(prefix, "\n ") } else prefix <- "Error : " msg <- paste0(prefix, conditionMessage(e), "\n") .Internal(seterrmessage(msg[1L])) if (!silent && identical(getOption("show.error.messages"), TRUE)) { cat(msg, file = stderr()) .Internal(printDeferredWarnings()) } invisible(structure(msg, class = "try-error", condition = e))}) 9: try(lapply(X = S, FUN = FUN, ...), silent = TRUE) 10: sendMaster(try(lapply(X = S, FUN = FUN, ...), silent = TRUE)) 11: FUN(X[[i]], ...) 12: lapply(seq_len(cores), inner.do) 13: mclapply(x3, mc.cores = nCores, tempdir = tempdir, chr = chr, K = K, T = T, priorCurrent = priorCurrent, eHapsCurrent_t = eHapsCurrent_t, alphaMatCurrent_t = alphaMatCurrent_t, sigmaCurrent = sigmaCurrent, maxDifferenceBetweenReads = maxDifferenceBetweenReads, whatToReturn = whatToReturn, Jmax = Jmax, highCovInLow = highCovInLow, iteration = iteration, method = method, nsplit = nsplit, expRate = expRate, minRate = minRate, maxRate = maxRate, gen = gen, outputdir = outputdir, pseudoHaploidModel = pseudoHaploidModel, outputHaplotypeProbabilities = outputHaplotypeProbabilities, switchModelIteration = switchModelIteration, regionName = regionName, restartIterations = restartIterations, refillIterations = refillIterations, hapSumCurrentL = hapSumCurrentL, outputBlockSize = outputBlockSize, bundling_info = bundling_info, transMatRate_t = transMatRate_t, x3 = x3, FUN = subset_of_complete_iteration, N = N, shuffleHaplotypeIterations = shuffleHaplotypeIterations, niterations = niterations, L = L, samples_with_phase = samples_with_phase, nbreaks = nbreaks, breaks = breaks) 14: completeSampleIteration(N = N, tempdir = tempdir, chr = chr, K = K, T = T, priorCurrent = priorCurrent, eHapsCurrent = eHapsCurrent, alphaMatCurrent = alphaMatCurrent, sigmaCurrent = sigmaCurrent, maxDifferenceBetweenReads = maxDifferenceBetweenReads, whatToReturn = whatToReturn, Jmax = Jmax, highCovInLow = highCovInLow, iteration = iteration, method = method, expRate = expRate, minRate = minRate, maxRate = maxRate, niterations = niterations, splitReadIterations = splitReadIterations, shuffleHaplotypeIterations = shuffleHaplotypeIterations, nCores = nCores, L = L, nGen = nGen, emissionThreshold = emissionThreshold, alphaMatThreshold = alphaMatThreshold, gen = gen, outputdir = outputdir, environment = environment, pseudoHaploidModel = pseudoHaploidModel, outputHaplotypeProbabilities = outputHaplotypeProbabilities, switchModelIteration = switchModelIteration, regionName = regionName, restartIterations = restartIterations, refillIterations = refillIterations, hapSumCurrent = hapSumCurrent, outputBlockSize = outputBlockSize, bundling_info = bundling_info, alleleCount = alleleCount, phase = phase, samples_with_phase = samples_with_phase)

More commonly

Error in completeSampleIteration(N = N, tempdir = tempdir, chr = chr, : An error occured during STITCH. The first such error is above Calls: STITCH -> completeSampleIteration In addition: Warning message: In mclapply(x3, mc.cores = nCores, tempdir = tempdir, chr = chr, : scheduled cores 39, 44, 43, 25, 21, 15, 12, 6, 2, 19, 24, 45, 31, 23, 4, 42, 17, 41, 20, 9, 40 encountered errors in user code, all values of the jobs will be affected Execution halted

The environment that I ran the program

For the R environment

sessionInfo() R version 3.3.2 (2016-10-31) Platform: x86_64-pc-linux-gnu (64-bit) Running under: CentOS release 6.8 (Final)

locale: [1] LC_CTYPE=en_US.iso885915 LC_NUMERIC=C
[3] LC_TIME=en_US.iso885915 LC_COLLATE=en_US.iso885915
[5] LC_MONETARY=en_US.iso885915 LC_MESSAGES=en_US.iso885915
[7] LC_PAPER=en_US.iso885915 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.iso885915 LC_IDENTIFICATION=C

attached base packages: [1] stats graphics grDevices utils datasets methods base packageVersion("STITCH") [1] '1.3.6'

For bcftools version

bcftools --version bcftools 1.3 Using htslib 1.3 Copyright (C) 2015 Genome Research Ltd.

I tried it on both AMD and Intel CPU cores with OpenMP to deal with the multiprocessing... It seems to have no relevance to that...

The result right now

tree | grep -E 'stitch|noS' |-- bamlist_colony_noS46.txt |-- noS46_1 | -- stitch.C0000000010.vcf.gz |-- noS46_10 |-- noS46_11 | -- stitch.C0000000010.vcf.gz rerun work |-- noS46_12 | -- stitch.C0000000010.vcf.gz |-- noS46_13 |-- noS46_14 |-- noS46_15 | -- stitch.C0000000010.vcf.gz rerun work |-- noS46_16 |-- noS46_17 |-- noS46_18 |-- noS46_19 |-- noS46_2 | -- stitch.C0000000010.vcf.gz |-- noS46_20 |-- noS46_3 | -- stitch.C0000000010.vcf.gz |-- noS46_4 | -- stitch.C0000000010.vcf.gz |-- noS46_5 | -- stitch.C0000000010.vcf.gz |-- noS46_6 | -- stitch.C0000000010.vcf.gz |-- noS46_7 | -- stitch.C0000000010.vcf.gz |-- noS46_8 |-- noS46_9 | -- stitch.C0000000010.vcf.gz rerun work |-- pos_colony_noS46.txt

noS46_ID is the format of the subfolder, and stitch.C00000000010.vcf.gz is the output As you can see, a few ID didn't generate the file and the program juct terminated...

Thank you for any helpful ideas!

rwdavies commented 6 years ago

Hey, thanks for the message and sorry for the errors. Two potential sources of semi-reproducible error in the C++ code I've seen before are 1) long reads / high depth causing underflow issues and 2) running out of memory for high K causing weird crashes of the R multicore mclapply function.

For 1, is this short read data, and are there any samples with high coverage, or if this is GBS or similar, are there any sites with really high coverage? There should be downsampling to prevent any sites from having > 50 X (default, controlled by parameter --downsampleToCov), but maybe for different computational environments this isn't aggressive enough.

For 2, looks like it's starting to crash with K >30, which might cause the RAM to get higher than standard cluster limits. can you check qacct -j to and see what the "failed" and "maxvmem" options are for failing jobs? Do you have a larger RAM interactive environment you could try the e.g. 10 (K=40) option a few times and see if it proceeds without error?

One last thing, when you say that $CORES is the assigned core number, what do you mean? The number of cores the job has been assigned? For example in the past I've run STITCH using SGE using something like qsub -pe smp 4 with --nCores=4

rwdavies commented 6 years ago

Hi @daisyme, were you able to resolve your reproducibility issues?

suestring7 commented 6 years ago

Hi @rwdavies , I am sorry for the late reply. I was on a trip and didn't get a chance to check it.

For 1, this is short read data and the highest coverage is only about 10 X. But maybe I could try to use the downsample parameter.

For 2, I've assigned the memory size to 512 GB each core and monitored the process by ssh to the node directly and using the top command to see the real-time condition which was all good.

The qacct command just didn't work and I just got some other people's program data in 2014... Not sure what's wrong.

I tried to run other data with K=8 in another experiment and it works just fine...

#!/bin/bash
#$ -N STITCH_test_less46
#$ -q bio,adl,abio,pub64,free64 
#$ -pe openmp 8-32
#$ -ckpt blcr
#$ -R y
#$ -t 1-5

module load R/3.3.2
module load bcftools
mkdir ./$((2**${SGE_TASK_ID}))S46
STITCH.R --chr=C0000000010 --bamlist=bamlist_colony_$((2**${SGE_TASK_ID}))S46.txt --posfile=pos_colony_noS46.txt --outputdir=./$((2**${SGE_TASK_ID}))S46 --K=8 --nGen=120 --nCores=$CORES

And you are correct about the $CORES, it's just the assigned core numbers which I requested by using -pe openmp 8-32

suestring7 commented 6 years ago

Just come up with some possible reason about the problem with qacct command...(which can't help actually...)

In my working environment, the people who support HPC seem to update the qacct log after a certain time period. And also the number of total jobs exceeds the maximum amount which causes the repeated job-id. I tried several different job-id numbers and the qacct -j -b YYMMDDhhmm command. And found the period might be about two months since I could find job ran in early July but not late July.

So the conclusion is that I might not be able to get the data from the command until one month later... I will try to see if there's any log document in SGE that I can just look into.

suestring7 commented 6 years ago

And another news... I got the 8 (K=32) run successfully after rerun all the failed parameters with the same scripts(which means other parameters failed again)...

(?Maybe I should just write a script to resubmit them until success? I am quite confused with this...)

suestring7 commented 6 years ago

I found the message file of SGE -- accounting and I still couldn't find any latest job info in it...

rwdavies commented 6 years ago

Hmm, shame that qacct isn't working in your setup. Maybe it would be worthwhile to repeat the initial set of jobs with a large range of K that didn't work for you, but prefixing runs with /usr/bin/time -v, so that at least for successful jobs, you can see the RAM usage, and if that is approaching your limit (512 Mb (per node not per code right?)) around when the jobs start to fail. Also, how many SNPs are you running? If it is RAM, and there are a lot of SNPs, you could chunk the genome into smaller sub-chromosome chunks, or if you are using an external list of SNPs, use something like the GATK or samtools to get an allele frequency estimate from the pileup and then only run those that are within a desirable frequency range (like 1-99%). Also, if you chunk the genome into smaller chunks, maybe you could run each with only 1 core, hopefully mitigating any issues with the cluster environment and multicore jobs?

suestring7 commented 6 years ago

Thanks!

Hmmm, the limit is 512 GB per node if I didn't read my HPC's document wrong... And the SNP number is 216020. I did only run the SNPs with frequency 5%-95%...

I would try the chunk idea later...

suestring7 commented 6 years ago

Hi, @rwdavies

When I used the /usr/bin/time -v, I got this:

     Command being timed: "STITCH.R --chr=C0000000010 --bamlist=bamlist_colony_noS46.txt --posfile=pos_colony_noS46.txt --outputdir=./noS46_20 --K=80 --nGen=80 --nCores=40"
    User time (seconds): 19528.45
    System time (seconds): 7969.93
    Percent of CPU this job got: 1207%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 37:56.68
    Average shared text size (kbytes): 0
    Average unshared data size (kbytes): 0
    Average stack size (kbytes): 0
    Average total size (kbytes): 0
    **Maximum resident set size (kbytes): 67350768**
    Average resident set size (kbytes): 0
    Major (requiring I/O) page faults: 89467
    Minor (reclaiming a frame) page faults: 1306334144
    Voluntary context switches: 340316
    Involuntary context switches: 145845
    Swaps: 0
    File system inputs: 7532208
    File system outputs: 227560
    Socket messages sent: 0
    Socket messages received: 0
    Signals delivered: 0
    Page size (bytes): 4096
    Exit status: 1

The bold line is what I considered as the maximum memory use during the program, and it was about 64.23 GB... So how do you think?

rwdavies commented 6 years ago

So quite high RAM usage, which makes sense given there are 216,020 SNPs. Also, the nodes have 512 GB of RAM, but how much is that per-core? Just wondering if it's something like 4 GB / core, then 8 cores might not be enough (32 GB < 64.23 GB), depending on how much RAM the 8 core vs 40 core version would use.

One other thought, it might be worthwhile to re-install Rcpp, RcppArmadillo and STITCH, in that order, in case anything weird happened with the compiler settings for those libraries between different package installations

I'm also happy at this point if you are willing and able to send me the data for one chromosome (the "input" RData files in ./input/" and the posfile), I can try to run the data many times on my end and see if I can generate a reproducible error

Also, in general, how many iterations does it get through before it crashes?

suestring7 commented 6 years ago

Well, I give up! The link of my files is here: https://hpc.oit.uci.edu/~ytao7/notWork.tar.gz

I cut out the contig that I want in the BAM file. And tried cut the SNP list into smaller files. Neither of them worked out.

The compilers are within the HPC module, which I would give my trust... Hmmm, I would try it at the end of this issue. :'(

Only one generation... Sorry, there were too many error files I just deleted a lot... Not sure how was it for other tries.

Thanks.

rwdavies commented 6 years ago

Hey, sorry got delayed a few days on this. I tried running your data for K=4,8,12,16,20, each 5 times, using 4 cores on Univa Grid Engine Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz with 256 GB RAM / node (16 GB RAM / core). The K=4, 8 and 12 each ran fine, while 2/5 K=16 ran OK with the other 3 dying from what I assume is lack of RAM. All K=20 died. I will try re-running this to confirm the K=16 and K=20 died of RAM, as I left this too late and the job entries are no longer visible using qstat, however the error messages in the log files contain no R errors and just end with Start of iteration <number> - <time>, which based on past experiences indicates the job ran out of RAM and was killed.

One interesting thing is that the RAM reported by /usr/bin/time for the succeeding jobs is way lower than the maximum that should be available. For the 2 succeeding K=16 jobs, the maximum RAM usage according to /usr/bin/time is only 5 GB, while there should be 64 GB available (!). I'm wondering if this has something to do with fragmentation of memory in RAM. I'll see if I can understand this in more detail.

In any case, I didn't obviously see a bug in STITCH other than potentially unexpectedly high RAM requirements, which I still suspect is the driver of your problem, versus some other more simple programmatic bug.

rwdavies commented 6 years ago

Hey, so I spent a good chunk of the last few days on this. I re-wrote some chunks of STITCH to use pre-allocated matrices to reduce matrix re-allocation in the hopes this would lessen the impact of memory fragmentation and solve this problem. This has the nice side benefit of making things slightly faster (and I think I will push these changes when done), but didn't impact whether runs succeeded in running through grid engine.

What I instead learned is that my interpretation of /usr/bin/time was incorrect - it is only measuring the parent process and not the child processes forked in R from using R multicore. In my testing, using nCores=4 and K=8, I saw maxvmem of about 18 GB while /usr/bin/time was consistent at about 4. Not quite a perfect fit (4 * 4 < 18), but plausibly close up to some other overhead grid engine is measuring, and this trend was consistent across the range of K I looked at and across 5 replicates for each K (see image, RAM is "Maximum resident set size" and time is "Elapsed (wall clock) time from /usr/bin/time -v, while maxvmem and exit_status is from Univa Grid Engine qacct -j). As expected, my jobs started dying when using >64GB RAM, as there is 16 GB RAM available per core in my environment stitch_with_pre_allocated_matrices_test_runs

In conclusion, with your data I never encountered a programmatic bug of STITCH, just program failure due to RAM use outside system boundaries. I speculate that the errors you saw were from child processes being killed when using too much RAM. Imputing 200,000 SNPs at once with high K is likely not feasible using the current implementation, and although this might change with time, as there are more heroic ways to lower RAM usage at the expense of run time, I don't think I'll have the chance to implement them in the near future.

In general in your situation I would recommend chunking and using buffers. Also, since you are imputing a relatively small number of samples, I would suggeset only considering K up to about 10.

suestring7 commented 6 years ago

Thanks a lot! Happy that you have some good side effect.

And did you indicate that if I impute fewer SNPs at once, the problem would be fixed?

Yeah, I am going to test how well it imputes with different Ks. Maybe I just don't need that many. And it's always good to know the boundary size of the inputs.

rwdavies commented 6 years ago

I'm not 100%, but fairly confident that if you use a smaller number of SNPs at once (like 1Mbp with 500kbp buffers), you'll be fine

rwdavies / STITCH