marbl / canu

A single molecule sequence assembler for genomes large and small.
http://canu.readthedocs.io/
658 stars 179 forks source link

Canu fail: overlap error adj. job #1448

Closed WallyL closed 5 years ago

WallyL commented 5 years ago

I'm assembling several PacBio seq'd multiplexed bacterial genomes (Sequel data) using Canu ver. 1.7. All jobs were run on a Linux cluster using a single node, 32 core, 500 GB ram.

The assemblies have been refractory and too fragmented, so I've been testing different assembly approaches after correcting and trimming the raw data separately for each sample.

All assemblies completed except one (it has completed for all other tests using different correction/trim/assy parameters). I realize that the coverage is way in excess here, but at this point I'm trying anything.

I got the "Don't panic/restart" error, so I restarted the assembly 3-4 times, but to no avail.

Correction and trimming both ran to completion, assembly failed using the following cmds:


ml canu/1.7-foss-2016b

canu -correct \
-p sample5 -d sample5 \
genomeSize=5.0m \
corMhapSensitivity=high corOutCoverage=999 corMinCoverage=0 \
-pacbio-raw ../Raw_BAMs/sample5.fa \
gnuplotTested=true useGrid=false \
maxMemory=500 maxThreads=30

canu -trim \
-p sample5 -d sample5 \
genomeSize=5.0m \
-pacbio-corrected sample5/sample5.correctedReads.fasta.gz \
gnuplotTested=true useGrid=false \
maxMemory=500 maxThreads=30

canu -assemble \
-p sample5 -d sample5_ALL_erate_0.020 \
genomeSize=5.0m \
correctedErrorRate=0.020 \
-pacbio-corrected ../Corrected_Data/sample5/sample5.trimmedReads.fasta.gz \
gnuplotTested=true useGrid=false \
maxMemory=500 maxThreads=30 1>std.error 2>std.out```

###############################################
And here's the std.out for the assembly failure:
###############################################

```-- CONFIGURE CANU
--
-- Detected Java(TM) Runtime Environment '1.8.0_112' (from '/usr/local/apps/eb/Java/1.8.0_112/bin/java').
-- Detected 32 CPUs and 503 gigabytes of memory.
-- Limited to 500 gigabytes from maxMemory option.
-- Limited to 30 CPUs from maxThreads option.
-- Detected PBS/Torque '6.1.3' with 'pbsnodes' binary in /opt/apps/torque/6.1.1.1/bin/pbsnodes.
-- Grid engine disabled per useGrid=false option.
--
--                            (tag)Concurrency
--                     (tag)Threads          |
--            (tag)Memory         |          |
--        (tag)         |         |          |     total usage     algorithm
--        -------  ------  --------   --------  -----------------  -----------------------------
-- Local: meryl      8 GB    4 CPUs x   1 job      8 GB    4 CPUs  (k-mer counting)
-- Local: cormhap    6 GB   15 CPUs x   2 jobs    12 GB   30 CPUs  (overlap detection with mhap)
-- Local: obtovl     4 GB    6 CPUs x   5 jobs    20 GB   30 CPUs  (overlap detection)
-- Local: utgovl     4 GB    6 CPUs x   5 jobs    20 GB   30 CPUs  (overlap detection)
-- Local: ovb        4 GB    1 CPU  x  30 jobs   120 GB   30 CPUs  (overlap store bucketizer)
-- Local: ovs        8 GB    1 CPU  x  30 jobs   240 GB   30 CPUs  (overlap store sorting)
-- Local: red        4 GB    3 CPUs x  10 jobs    40 GB   30 CPUs  (read error detection)
-- Local: oea        4 GB    1 CPU  x  30 jobs   120 GB   30 CPUs  (overlap error adjustment)
-- Local: bat       16 GB    4 CPUs x   1 job     16 GB    4 CPUs  (contig construction)
-- Local: gfa        8 GB    4 CPUs x   1 job      8 GB    4 CPUs  (GFA alignment and processing)
--
-- In 'Ah_51307.gkpStore', found PacBio reads:
--   Raw:        0
--   Corrected:  601120
--   Trimmed:    601120
--
-- Generating assembly 'Ah_51307' in '/scratch/qbcg/McGarey_project/Canu_tests/Ah_51307_ALL_erate_0.020'
--
-- Parameters:
--
--  genomeSize        5000000
--
--  Overlap Generation Limits:
--    corOvlErrorRate 0.2400 ( 24.00%)
--    obtOvlErrorRate 0.0200 (  2.00%)
--    utgOvlErrorRate 0.0200 (  2.00%)
--
--  Overlap Processing Limits:
--    corErrorRate    0.3000 ( 30.00%)
--    obtErrorRate    0.0200 (  2.00%)
--    utgErrorRate    0.0200 (  2.00%)
--    cnsErrorRate    0.0200 (  2.00%)
--
--
-- BEGIN ASSEMBLY
--
--
-- Running jobs.  First attempt out of 2.
----------------------------------------
-- Starting 'oea' concurrent execution on Mon Aug 19 16:52:46 2019 with 419679.705 GB free disk space (1 processes; 30 concurrently)

    cd unitigging/3-overlapErrorAdjustment
    ./oea.sh 4 > ./oea.000004.out 2>&1

-- Finished on Mon Aug 19 16:52:54 2019 (8 seconds) with 419674.213 GB free disk space
----------------------------------------
--
-- Overlap error adjustment jobs failed, retry.
--   job 00004.oea FAILED.
--
--
-- Running jobs.  Second attempt out of 2.
----------------------------------------
-- Starting 'oea' concurrent execution on Mon Aug 19 16:52:54 2019 with 419674.213 GB free disk space (1 processes; 30 concurrently)

    cd unitigging/3-overlapErrorAdjustment
    ./oea.sh 4 > ./oea.000004.out 2>&1

-- Finished on Mon Aug 19 16:52:57 2019 (3 seconds) with 419671.793 GB free disk space
----------------------------------------
--
-- Overlap error adjustment jobs failed, tried 2 times, giving up.
--   job 00004.oea FAILED.
--

ABORT:
ABORT: Canu 1.7
ABORT: Don't panic, but a mostly harmless error occurred and Canu stopped.
ABORT: Try restarting.  If that doesn't work, ask for help.
ABORT:```
brianwalenz commented 5 years ago

I suspect you won't want to redo analyses, but there is a newer version available.

> git clone https://github.com/marbl/canu.git
> cd canu/src
> git checkout v1.9
> make -j 8

Back to your failure. What's in unitigging/3-overlapErrorAdjustment/oea.000004.out? It's failing quickly, which is usually a good sign.

WallyL commented 5 years ago

Thanks, Brian, I will definitely update to the latest version for future runs... haven't run Canu in awhile.

Here is the output. Looks like the dreaded segmentation fault...


Initializing.
Opening gkpStore '../Ah_51307.gkpStore'.
Correcting reads 159526 to 212935.
Reading 3572082 corrections from './red.red'.
Correcting 300057619 bases with 2192621 indel adjustments.
--Allocate 286 + 16 + 1 MB for bases, adjusts and reads.
Corrected 299859587 bases with 68891 substitutions, 198035 deletions and 3 insertions.
Loading overlaps.
Read_Olaps()--  Loading 39844812 overlaps from '../Ah_51307.ovlStore' for reads 159526 to 212935
--Allocate 1215 MB for overlaps.
Read_Olaps()--  Loaded 39844812 overlaps -- 19956174 normal and 19888638 innie.
Sorting overlaps.

Failed with 'Segmentation fault'; backtrace (libbacktrace):
AS_UTL/AS_UTL_stackTrace.C::97 in _Z17AS_UTL_catchCrashiP9siginfo_tPv()
(null)::0 in (null)()
overlapErrorAdjustment/correctOverlaps.H::172 in _ZN18Olap_Info_t_by_bIDclERK11Olap_Info_tS2_()
/usr/local/apps/eb/GCCcore/5.4.0/include/c++/5.4.0/bits/predefined_ops.h::123 in _ZN9__gnu_cxx5__ops15_Iter_comp_iterI18Olap_Info_t_by_bIDEclIP11Olap_Info_tS6_EEbT_T0_()
/usr/local/apps/eb/GCCcore/5.4.0/include/c++/5.4.0/bits/stl_algo.h::1897 in _ZSt21__unguarded_partitionIP11Olap_Info_tN9__gnu_cxx5__ops15_Iter_comp_iterI18Olap_Info_t_by_bIDEEET_S7_S7_S7_T0_()
/usr/local/apps/eb/GCCcore/5.4.0/include/c++/5.4.0/bits/stl_algo.h::1918 in _ZSt27__unguarded_partition_pivotIP11Olap_Info_tN9__gnu_cxx5__ops15_Iter_comp_iterI18Olap_Info_t_by_bIDEEET_S7_S7_T0_()
/usr/local/apps/eb/GCCcore/5.4.0/include/c++/5.4.0/bits/stl_algo.h::1948 in _ZSt16__introsort_loopIP11Olap_Info_tlN9__gnu_cxx5__ops15_Iter_comp_iterI18Olap_Info_t_by_bIDEEEvT_S7_T0_T1_()
/usr/local/apps/eb/GCCcore/5.4.0/include/c++/5.4.0/bits/stl_algo.h::1949 in _ZSt16__introsort_loopIP11Olap_Info_tlN9__gnu_cxx5__ops15_Iter_comp_iterI18Olap_Info_t_by_bIDEEEvT_S7_T0_T1_()
/usr/local/apps/eb/GCCcore/5.4.0/include/c++/5.4.0/bits/stl_algo.h::1949 in _ZSt16__introsort_loopIP11Olap_Info_tlN9__gnu_cxx5__ops15_Iter_comp_iterI18Olap_Info_t_by_bIDEEEvT_S7_T0_T1_()
/usr/local/apps/eb/GCCcore/5.4.0/include/c++/5.4.0/bits/stl_algo.h::1949 in _ZSt16__introsort_loopIP11Olap_Info_tlN9__gnu_cxx5__ops15_Iter_comp_iterI18Olap_Info_t_by_bIDEEEvT_S7_T0_T1_()
/usr/local/apps/eb/GCCcore/5.4.0/include/c++/5.4.0/bits/stl_algo.h::1949 in _ZSt16__introsort_loopIP11Olap_Info_tlN9__gnu_cxx5__ops15_Iter_comp_iterI18Olap_Info_t_by_bIDEEEvT_S7_T0_T1_()
/usr/local/apps/eb/GCCcore/5.4.0/include/c++/5.4.0/bits/stl_algo.h::1949 in _ZSt16__introsort_loopIP11Olap_Info_tlN9__gnu_cxx5__ops15_Iter_comp_iterI18Olap_Info_t_by_bIDEEEvT_S7_T0_T1_()
/usr/local/apps/eb/GCCcore/5.4.0/include/c++/5.4.0/bits/stl_algo.h::1949 in _ZSt16__introsort_loopIP11Olap_Info_tlN9__gnu_cxx5__ops15_Iter_comp_iterI18Olap_Info_t_by_bIDEEEvT_S7_T0_T1_()
/usr/local/apps/eb/GCCcore/5.4.0/include/c++/5.4.0/bits/stl_algo.h::1949 in _ZSt16__introsort_loopIP11Olap_Info_tlN9__gnu_cxx5__ops15_Iter_comp_iterI18Olap_Info_t_by_bIDEEEvT_S7_T0_T1_()
/usr/local/apps/eb/GCCcore/5.4.0/include/c++/5.4.0/bits/stl_algo.h::1949 in _ZSt16__introsort_loopIP11Olap_Info_tlN9__gnu_cxx5__ops15_Iter_comp_iterI18Olap_Info_t_by_bIDEEEvT_S7_T0_T1_()
/usr/local/apps/eb/GCCcore/5.4.0/include/c++/5.4.0/bits/stl_algo.h::1949 in _ZSt16__introsort_loopIP11Olap_Info_tlN9__gnu_cxx5__ops15_Iter_comp_iterI18Olap_Info_t_by_bIDEEEvT_S7_T0_T1_()
/usr/local/apps/eb/GCCcore/5.4.0/include/c++/5.4.0/bits/stl_algo.h::1949 in _ZSt16__introsort_loopIP11Olap_Info_tlN9__gnu_cxx5__ops15_Iter_comp_iterI18Olap_Info_t_by_bIDEEEvT_S7_T0_T1_()
/usr/local/apps/eb/GCCcore/5.4.0/include/c++/5.4.0/bits/stl_algo.h::1949 in _ZSt16__introsort_loopIP11Olap_Info_tlN9__gnu_cxx5__ops15_Iter_comp_iterI18Olap_Info_t_by_bIDEEEvT_S7_T0_T1_()
/usr/local/apps/eb/GCCcore/5.4.0/include/c++/5.4.0/bits/stl_algo.h::1949 in _ZSt16__introsort_loopIP11Olap_Info_tlN9__gnu_cxx5__ops15_Iter_comp_iterI18Olap_Info_t_by_bIDEEEvT_S7_T0_T1_()
/usr/local/apps/eb/GCCcore/5.4.0/include/c++/5.4.0/bits/stl_algo.h::1949 in _ZSt16__introsort_loopIP11Olap_Info_tlN9__gnu_cxx5__ops15_Iter_comp_iterI18Olap_Info_t_by_bIDEEEvT_S7_T0_T1_()
/usr/local/apps/eb/GCCcore/5.4.0/include/c++/5.4.0/bits/stl_algo.h::1949 in _ZSt16__introsort_loopIP11Olap_Info_tlN9__gnu_cxx5__ops15_Iter_comp_iterI18Olap_Info_t_by_bIDEEEvT_S7_T0_T1_()
/usr/local/apps/eb/GCCcore/5.4.0/include/c++/5.4.0/bits/stl_algo.h::1949 in _ZSt16__introsort_loopIP11Olap_Info_tlN9__gnu_cxx5__ops15_Iter_comp_iterI18Olap_Info_t_by_bIDEEEvT_S7_T0_T1_()
/usr/local/apps/eb/GCCcore/5.4.0/include/c++/5.4.0/bits/stl_algo.h::1949 in _ZSt16__introsort_loopIP11Olap_Info_tlN9__gnu_cxx5__ops15_Iter_comp_iterI18Olap_Info_t_by_bIDEEEvT_S7_T0_T1_()
/usr/local/apps/eb/GCCcore/5.4.0/include/c++/5.4.0/bits/stl_algo.h::1949 in _ZSt16__introsort_loopIP11Olap_Info_tlN9__gnu_cxx5__ops15_Iter_comp_iterI18Olap_Info_t_by_bIDEEEvT_S7_T0_T1_()
/usr/local/apps/eb/GCCcore/5.4.0/include/c++/5.4.0/bits/stl_algo.h::1949 in _ZSt16__introsort_loopIP11Olap_Info_tlN9__gnu_cxx5__ops15_Iter_comp_iterI18Olap_Info_t_by_bIDEEEvT_S7_T0_T1_()
/usr/local/apps/eb/GCCcore/5.4.0/include/c++/5.4.0/bits/stl_algo.h::1949 in _ZSt16__introsort_loopIP11Olap_Info_tlN9__gnu_cxx5__ops15_Iter_comp_iterI18Olap_Info_t_by_bIDEEEvT_S7_T0_T1_()
/usr/local/apps/eb/GCCcore/5.4.0/include/c++/5.4.0/bits/stl_algo.h::1949 in _ZSt16__introsort_loopIP11Olap_Info_tlN9__gnu_cxx5__ops15_Iter_comp_iterI18Olap_Info_t_by_bIDEEEvT_S7_T0_T1_()
/usr/local/apps/eb/GCCcore/5.4.0/include/c++/5.4.0/bits/stl_algo.h::1949 in _ZSt16__introsort_loopIP11Olap_Info_tlN9__gnu_cxx5__ops15_Iter_comp_iterI18Olap_Info_t_by_bIDEEEvT_S7_T0_T1_()
/usr/local/apps/eb/GCCcore/5.4.0/include/c++/5.4.0/bits/stl_algo.h::1963 in _ZSt6__sortIP11Olap_Info_tN9__gnu_cxx5__ops15_Iter_comp_iterI18Olap_Info_t_by_bIDEEEvT_S7_T0_()
/usr/local/apps/eb/GCCcore/5.4.0/include/c++/5.4.0/bits/stl_algo.h::4729 in _ZNSt9__cxx19984sortIP11Olap_Info_t18Olap_Info_t_by_bIDEEvT_S4_T0_()
overlapErrorAdjustment/correctOverlaps.C::169 in main()
(null)::0 in (null)()
(null)::0 in (null)()
(null)::0 in (null)()
./oea.sh: line 109: 29775 Segmentation fault      (core dumped) $bin/correctOverlaps -G ../Ah_51307.gkpStore -O ../Ah_51307.ovlStore -R $minid $maxid -e 0.020 -l 500 -c ./red.red -o ./$jobid.oea.WORKING```
brianwalenz commented 5 years ago

If you can change code, change line 178 of src/overlapErrorAdjustment/correctOverlaps.H from

    return(a.innie != b.innie);

to

    return(a.innie < b.innie);

and recompile.

WallyL commented 5 years ago

Will do, thanks again!

brianwalenz commented 5 years ago

Please let me know if this works or not. It's just a guess.

WallyL commented 5 years ago

Brian, I changed the code as you suggested, but it didn't work...

Since my other 4 assemblies actually got worse with more data, doubtful this sample will improve. If you have other suggestions for code changes and want me to test them to fix this particular bug in v1.7, I'll be happy to do so, but I'm going to run any new tests with one of the later versions.

It looks like version 1.8 is the latest released version, would you still recommend that I move to v1.9, which is developmental? I have both installed.

Best, Walt

brianwalenz commented 5 years ago

Well, darn. Would you be willing to share the failing assembly so I can debug it? FTP directions are in the FAQ.

v1.9 is suggested. It is essentially the next release - I'm not entirely sure why we aren't making an actual release for it.

WallyL commented 5 years ago

Not sure what all you needed, so I made a tar file, WallyL_failed_v1.7_run.tar, with everything in it, i.e. correction_trim dir. and the assy dir. It's ~ 50GB and the upload is complete.

skoren commented 5 years ago

Saw the file there but had some trouble extracting it, do you mind re-sending and posting the md5 here to confirm? The unitigging folder should be sufficient for debugging, no need for the correction/trimming folders. You can also gzip to make it smaller for the upload if you want the transfer to be faster.

WallyL commented 5 years ago

Hi Sergey, We had some connectivity issues in our building last week, so it possibly got corrupted during transfer. I am uploading a tarball of the unitigging dir. now.
The md5 = 7731a794ba14ba5190c09fd836b2a0c8 Walt

skoren commented 5 years ago

Got it now thanks, the MD5 looks correct as well.

skoren commented 5 years ago

Can you also send the Ah_51307.gkpStore from one level up from unitigging, I forgot to tell you to run tar with following symlinks. It also looks like you have also run canu 1.8 on this folder after 1.7 failed? That would likely not work but shouldn't have caused your initial error.

WallyL commented 5 years ago

Yeah, I was trying different things to get past the assy error, so thought I'd try to assemble the trimmed reads with the later version. So, as a rule, this is not a good idea?

The reason I ask is because I have corrected and trimmed these 5 samples 3 diff. ways (the v1.7 assembly error occurred only on the 51307 sample corrected as described above).

However, for another v1.7 corrected set (corMhapSensitivity=high corOutCoverage=200 corMinCoverage=6) I ran the v 1.9 assembler and pointed to corrected/trimmed dirs. and all appear to have run just fine after removing gnuplotTested=true, which is no longer recognized..

Ah_51307.gkpStore.tgz file is now uploaded. The md5 is 44c70e7af05a0833df428366d4ce5afa

skoren commented 5 years ago

Sorry for getting back so late, we've made some large changes in this code and I tested your assembly with the new code and couldn't reproduce the crash. I think it's been fixed as part of the changes. The assembly isn't great, but it seems you've got super-high coverage here, downsampling to 100 or 200x before assembly may help. Here are the stats:

Total units: 326
BasesInFasta: 8465850
Min: 2,420
Max: 677,648
N25: 279,570 COUNT: 5
N50: 138,022 COUNT: 14
N75: 18,472 COUNT: 78

Were you ever able to get an assembly, do you want me to share this one?

WallyL commented 5 years ago

Thanks, no worries. I have generated several assys for this sample using ver. 1.9, but I am getting similar metrics to yours... very fragmented for PB bacterial data.

All 5 of the genomes in this run are refractory to single contig assy- we've also run them through PB HGAP4 and got similar results to Canu. We are assuming that it is either a sample qual. / lib. issue or there's some weird structural/repetition effects... definitely not E. coli! Size should be around 4.8-5.2 mB. If I run Knot software, post-Canu, I can get some down to 5 -12 contigs.

We are in the process of generating some ONT data to throw in the mix and hopefully that will resolve things. I'll keep you posted, if you're interested.

skoren commented 5 years ago

Is this a clonal sample or a plate scrape or something else? It seems given the genome size that there may be more than one strain that's causing the assembly to be split/generating a bigger genome. If that's the case, ONT isn't going to help since it will capture the same variations again.

I'm going to close the issue since the oea issue is fixed, feel free to comment with any updates on it though. I'll also try down-sampling the coverage on your data and see if that results in any improvements.