trimReads segmentation fault

johnomics commented 7 years ago

trimReads failed with a segmentation fault, running commit fa3eede13415829dbb5c8abc769e73d18f3be658. From canu-scripts/canu.07.out:

-- Starting command on Sat May 13 03:37:44 2017 with 2291.659 GB free disk space

    cd trimming/3-overlapbasedtrimming
    /cluster/john/bin/canu/canu/Linux-amd64/bin/trimReads \
      -G  ../asm.gkpStore \
      -O  ../asm.ovlStore \
      -Co ./asm.1.trimReads.clear \
      -e  0.065 \
      -minlength 3000 \
      -ol 1 \
      -oc 1 \
      -o  ./asm.1.trimReads \
    >     ./asm.1.trimReads.err 2>&1
sh: line 9: 35085 Segmentation fault      (core dumped) /cluster/john/bin/canu/canu/Linux-amd64/bin/trimReads -G ../asm.gkpStore -O ../asm.ovlStore -Co ./asm.1.trimReads.clear -e 0.065 -minlength 3000 -ol 1 -oc 1 -o ./asm.1.trimReads > ./asm.1.trimReads.err 2>&1

-- Finished on Sat May 13 03:37:46 2017 (2 seconds) with 2291.659 GB free disk space
----------------------------------------
ERROR:
ERROR:  Failed with exit code 139.  (rc=35584)
ERROR:
Use of uninitialized value $error[0] in join or string at /usr/share/perl5/vendor_perl/Carp.pm line 458.
================================================================================
Please panic.  Canu failed, and it shouldn't have.

Stack trace:

 at /cluster/john/bin/canu/canu/Linux-amd64/bin/lib/canu/OverlapBasedTrimming.pm line 85.
        canu::OverlapBasedTrimming::trimReads("asm") called at /cluster/john/bin/canu/canu/Linux-amd64/bin/canu line 571

Last few lines of the relevant log file (trimming/3-overlapbasedtrimming/asm.1.trimReads.err):

Processing from ID 1 to 53505 out of 53505 reads.

Failed with 'Segmentation fault'; backtrace (libbacktrace):
AS_UTL/AS_UTL_stackTrace.C::102 in _Z17AS_UTL_catchCrashiP9siginfo_tPv()
(null)::0 in (null)()
(null)::0 in (null)()
overlapBasedTrimming/trimReads.C::331 in main()
(null)::0 in (null)()
(null)::0 in (null)()
(null)::0 in (null)()

Canu snapshot v1.5 +12 changes (r8212 fa3eede13415829dbb5c8abc769e73d18f3be658) failed with:
  trimReads failed

overlapBasedTrimming/trimReads.C::331 is the last line of this call from main:

      isGood = largestCovered(ovl, ovlLen,
                              read,
                              ibgn, iend, fbgn, fend,
                              logMsg,
                              errorValue,
                              minEvidenceOverlap,
                              minEvidenceCoverage,
                              minReadLength);

Here's the canu command:

/cluster/john/bin/canu/canu/Linux-amd64/bin/canu gridOptions="-q cluster" gridOptionscormhap="-l h_vmem=20G" stopOnReadQuality=false stageDirectory=/data/tmp/john-\$JOB_ID-\$SGE_TASK_ID gnuplotImageFormat=svg -p asm -d asm_canu genomeSize=13.3m  obtOvlHashBlockLength=10000000 utgOvlHashBlockLength=10000000 corMhapSensitivity=normal corOutCoverage=100 minOverlapLength=3000 minReadLength=3000 correctedErrorRate=0.065 -nanopore-raw reads.fastq.gz

I'm trying to optimise our assembly but maybe some of these options are poor choices - not sure what's relevant so I'll try to explain them all. obtOvlHashBlockLength is reduced to speed up the trimming and unitigging overlapping step; previous runs with the default of 100 Mb only set off 5 jobs for overlapping, which then took a few days to run even though there was plenty of room on the cluster; on this run, 136 jobs were run and all completed in 3 hours. The overlap store appears to have been created successfully - file sizes look sensible and logs are clean.

minOverlapLength and minReadLength are increased because the minimum read Length for our top 100x coverage is 14 kb, and we have some complex repeats. I just increased minReadLength so it wasn't shorter than minOverlapLength. But is it a bad idea for these to be the same? Should I increase minReadLength a little?

This is R9.4 1D data so I started with correctedErrorRate=0.075 (from the FAQ) and then reduced it by 1% to 0.065 because we have high coverage (from the Parameter Reference). The coverage and MhapSensitivity are increased because I'm trying to resolve some complex haplotypes. The gridOptionscormhap setting is another clumsy attempt to get java to play nicely with our cluster, still not really resolved (https://github.com/marbl/canu/issues/298), although it worked this time, no sign of any errors during the correction stage.

Canu configuration from initial job:

-- Canu snapshot v1.5 +12 changes (r8212 fa3eede13415829dbb5c8abc769e73d18f3be658)
-- Detected Java(TM) Runtime Environment '1.8.0_112' (from 'java').
-- Detected gnuplot version '5.0 patchlevel 5' (from 'gnuplot') and image format 'svg'.
-- Detected 64 CPUs and 252 gigabytes of memory.
-- Detected Sun Grid Engine in '/SGE/sge8.1.8/default'.
-- Detected Grid Engine environment 'smp'.
-- Detected Grid Engine consumable 'h_vmem'.
--
-- Found   1 host  with   8 cores and    7 GB memory under Sun Grid Engine control.
-- Found   1 host  with  48 cores and  346 GB memory under Sun Grid Engine control.
-- Found   1 host  with  20 cores and  251 GB memory under Sun Grid Engine control.
-- Found   1 host  with 160 cores and  755 GB memory under Sun Grid Engine control.
-- Found  26 hosts with   4 cores and    3 GB memory under Sun Grid Engine control.
-- Found   7 hosts with  64 cores and  251 GB memory under Sun Grid Engine control.
-- Found   1 host  with   4 cores and    5 GB memory under Sun Grid Engine control.
-- Found   4 hosts with   8 cores and   15 GB memory under Sun Grid Engine control.
-- Found   1 host  with  24 cores and   62 GB memory under Sun Grid Engine control.
--
-- Run under grid control using    5 GB and   4 CPUs for stage 'meryl'.
-- Run under grid control using    5 GB and   4 CPUs for stage 'mhap (cor)'.
-- Run under grid control using    5 GB and   4 CPUs for stage 'overlapper (obt)'.
-- Run under grid control using    5 GB and   4 CPUs for stage 'overlapper (utg)'.
-- Run under grid control using    6 GB and   2 CPUs for stage 'falcon_sense'.
-- Run under grid control using    2 GB and   1 CPU  for stage 'ovStore bucketizer'.
-- Run under grid control using    8 GB and   1 CPU  for stage 'ovStore sorting'.
-- Run under grid control using    2 GB and   4 CPUs for stage 'read error detection'.
-- Run under grid control using    1 GB and   1 CPU  for stage 'overlap error adjustment'.
-- Run under grid control using    3 GB and   4 CPUs for stage 'bogart'.
-- Run under grid control using    2 GB and   1 CPU  for stage 'GFA alignment and processing'.
-- Run under grid control using   10 GB and   4 CPUs for stage 'consensus'.
--
-- Generating assembly 'asm' in '/cluster/john/assembly/asm_canu'
--
-- Parameters:
--
--  genomeSize        13300000
--
--  Overlap Generation Limits:
--    corOvlErrorRate 0.3200 ( 32.00%)
--    obtOvlErrorRate 0.0650 (  6.50%)
--    utgOvlErrorRate 0.0650 (  6.50%)
--
--  Overlap Processing Limits:
--    corErrorRate    0.5000 ( 50.00%)
--    obtErrorRate    0.0650 (  6.50%)
--    utgErrorRate    0.0650 (  6.50%)
--    cnsErrorRate    0.0650 (  6.50%)
----------------------------------------

brianwalenz commented 7 years ago

Just to answer the questions:

The obtOvlHashBlockLength optimization won't impact anything. We probably should make it smaller for smaller assemblies. If this is too small on large assemblies, you can end up with tens of thousands of jobs, which is a pain to run.

There are no algorithmic restrictions on minOverlapLength and minReadLength. It doesn't make a whole lot of sense to have minOverlap > minRead (reads shorter than minOverlap won't have any overlaps) so that isn't allowed. minRead (obviously) throws out data that you're claiming wont' be useful for assembly, while minOverlap throws out data that could be used to get across low coverage areas, at risk of confusing the assembler. If your coverage supports it, increasing this past the nasty repeats is ideal. However, any repeat not spanned by a read won't be assembled through, and edges won't exist in the GFA outputs.

johnomics commented 7 years ago

Thanks for the clarifications. I'll continue to use these options.

Any thoughts on what's causing the crash? I now have three separate assemblies for three separate species failing at this point (I re-ran the assembly above with a larger minReadLength just in case this was causing problems). Here are the command lines (sorry for being cagey about the species, they're not my samples):

/cluster/john/bin/canu/canu/Linux-amd64/bin/canu gridOptions="-q cluster" gridOptionscormhap="-l h_vmem=4G -pe smp 12" stopOnReadQuality=false stageDirectory=/data/tmp/john-\$JOB_ID-\$SGE_TASK_ID gnuplotImageFormat=svg -p asm -d asm_canu genomeSize=13.3m  obtOvlHashBlockLength=10000000 utgOvlHashBlockLength=10000000 corMhapSensitivity=normal corOutCoverage=100 minOverlapLength=3000 minReadLength=10000 correctedErrorRate=0.065 -nanopore-raw asm.fastq.gz

/cluster/john/bin/canu/canu/Linux-amd64/bin/canu gridOptions="-q cluster" gridOptionscormhap="-l h_vmem=4g -pe smp 12" stopOnReadQuality=false stageDirectory=/data/tmp/john-\$JOB_ID-\$SGE_TASK_ID -p asm2 -d asm2_canu genomeSize=14.1m obtOvlHashBlockLength=10000000 utgOvlHashBlockLength=10000000 correctedErrorRate=0.075 -nanopore-raw asm2.fastq.gz

/cluster/john/bin/canu/canu/Linux-amd64/bin/canu gridOptions="-q cluster" gridOptionscor="-l h_vmem=4g -pe smp 12" stopOnReadQuality=false stageDirectory=/data/tmp/john-\$JOB_ID-\$SGE_TASK_ID -p asm3 -d asm3_canu genomeSize=100m correctedErrorRate=0.075 obtOvlHashBlockLength=10000000 utgOvlHashBlockLength=10000000 -nanopore-raw asm3.fastq.gz

Corrected read stats going into trimming, from seqkit:

file                          format  type  num_seqs        sum_len  min_len   avg_len  max_len  sum_gap     N50      L50
asm.correctedReads.fasta.gz   FASTA   DNA     53,521  1,353,218,428   10,097  25,283.9   98,585        0  25,676   20,259
asm2.correctedReads.fasta.gz  FASTA   DNA      9,011    552,864,587    1,121  61,354.4  217,770        0  61,160    3,661
asm3.correctedReads.fasta.gz  FASTA   DNA    431,393  3,016,419,984    1,001   6,992.3   62,835        0   9,948  104,858

All overlap stores appear to build successfully, and the errors are exactly the same as above.

Worth retrying with standard OvlHashBlockLength settings? Or rolling back to 1.5, or forward to newest commit?

brianwalenz commented 7 years ago

I don't see anything obvious that could break right there. I can instrument the code and let you run that.

Can you recompile with BUILDDEBUG=1 and manually run the trimReads command? Bonus points if you can run in gdb (gdb -silent --args <trimReads and options>) then type 'where' to get a better stack trace.

johnomics commented 7 years ago

Ugh, sorry, dumb mistake - didn't make clean before make after pull. Recompiled after cleaning and it runs fine. Apologies. Thanks for your help.

brianwalenz commented 7 years ago

I had that happen too - overlapInCore was crashing spectacularly and oddly. Not sure what caused this; the Makefile should have rebuilt everything. Odd dates left over from the pull?

Anyway, you got one 'bug fix' out of this - that uninitialized value in Carp.pm.

marbl / canu

trimReads segmentation fault #495