marbl / canu

A single molecule sequence assembler for genomes large and small.
http://canu.readthedocs.io/
654 stars 179 forks source link

Canu hanging on trimming process #653

Closed Sashanity closed 6 years ago

Sashanity commented 6 years ago

I'm running canu 1.6 released version. I am running all 3 steps (correction, trimming, assembling) separately. I am assembling nanopore data. I am trying to trim the corrected data that was produced by canu, and the trimming process running about 3 days by know without any progress in it, so I think that something is wrong with it. Here is the command:

/Users/katjasch/canu-1.6/Darwin-amd64/bin/canu -trim \
-p DB146_6_2_925 \
-d DB146_6_2_925 \
genomeSize=12.1m \
-nanopore-corrected /Users/katjasch/canu-1.6/src/src/DB146_6/DB146_6.correctedReads.fasta.gz \
gnuplotTested=true \
> canu_trim.log 2>&1

Here is the .log file for the process.

-- Canu 1.6
--
-- CITATIONS
--
-- Koren S, Walenz BP, Berlin K, Miller JR, Phillippy AM.
-- Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation.
-- Genome Res. 2017 May;27(5):722-736.
-- http://doi.org/10.1101/gr.215087.116
-- 
-- Read and contig alignments during correction, consensus and GFA building use:
--   Šošic M, Šikic M.
--   Edlib: a C/C ++ library for fast, exact sequence alignment using edit distance.
--   Bioinformatics. 2017 May 1;33(9):1394-1395.
--   http://doi.org/10.1093/bioinformatics/btw753
-- 
-- Overlaps are generated using:
--   Berlin K, et al.
--   Assembling large genomes with single-molecule sequencing and locality-sensitive hashing.
--   Nat Biotechnol. 2015 Jun;33(6):623-30.
--   http://doi.org/10.1038/nbt.3238
-- 
--   Myers EW, et al.
--   A Whole-Genome Assembly of Drosophila.
--   Science. 2000 Mar 24;287(5461):2196-204.
--   http://doi.org/10.1126/science.287.5461.2196
-- 
--   Li H.
--   Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences.
--   Bioinformatics. 2016 Jul 15;32(14):2103-10.
--   http://doi.org/10.1093/bioinformatics/btw152
-- 
-- Corrected read consensus sequences are generated using an algorithm derived from FALCON-sense:
--   Chin CS, et al.
--   Phased diploid genome assembly with single-molecule real-time sequencing.
--   Nat Methods. 2016 Dec;13(12):1050-1054.
--   http://doi.org/10.1038/nmeth.4035
-- 
-- Contig consensus sequences are generated using an algorithm derived from pbdagcon:
--   Chin CS, et al.
--   Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data.
--   Nat Methods. 2013 Jun;10(6):563-9
--   http://doi.org/10.1038/nmeth.2474
-- 
-- CONFIGURE CANU
--
-- Detected Java(TM) Runtime Environment '1.8.0_51' (from 'java').
-- Detected 24 CPUs and 64 gigabytes of memory.
-- No grid engine detected, grid disabled.
--
--                            (tag)Concurrency
--                     (tag)Threads          |
--            (tag)Memory         |          |
--        (tag)         |         |          |  algorithm
--        -------  ------  --------   --------  -----------------------------
-- Local: meryl      8 GB    4 CPUs x   6 jobs  (k-mer counting)
-- Local: cormhap    6 GB   12 CPUs x   2 jobs  (overlap detection with mhap)
-- Local: obtovl     8 GB    8 CPUs x   3 jobs  (overlap detection)
-- Local: utgovl     8 GB    8 CPUs x   3 jobs  (overlap detection)
-- Local: cor        6 GB    2 CPUs x  12 jobs  (read correction)
-- Local: ovb        2 GB    1 CPU  x  24 jobs  (overlap store bucketizer)
-- Local: ovs        8 GB    1 CPU  x  24 jobs  (overlap store sorting)
-- Local: red        2 GB    4 CPUs x   6 jobs  (read error detection)
-- Local: oea        1 GB    1 CPU  x  24 jobs  (overlap error adjustment)
-- Local: bat       10 GB    4 CPUs x   6 jobs  (contig construction)
-- Local: cns       10 GB    4 CPUs x   6 jobs  (consensus)
-- Local: gfa        8 GB    4 CPUs x   6 jobs  (GFA alignment and processing)
--
-- Found Nanopore corrected reads in the input files.
--
-- Generating assembly 'DB146_6_2_925' in '/Users/katjasch/canu-1.6/src/src/DB146_6_2_925/DB146_6_2_925/trimRes'
--
-- Parameters:
--
--  genomeSize        12100000
--
--  Overlap Generation Limits:
--    corOvlErrorRate 0.3200 ( 32.00%)
--    obtOvlErrorRate 0.1440 ( 14.40%)
--    utgOvlErrorRate 0.1440 ( 14.40%)
--
--  Overlap Processing Limits:
--    corErrorRate    0.5000 ( 50.00%)
--    obtErrorRate    0.1440 ( 14.40%)
--    utgErrorRate    0.1440 ( 14.40%)
--    cnsErrorRate    0.1920 ( 19.20%)
--
--
-- BEGIN TRIMMING
--
----------------------------------------
-- Starting command on Mon Sep 25 20:17:41 2017 with 402.086 GB free disk space

    cd trimming
    /Users/katjasch/canu-1.6/Darwin-amd64/bin/gatekeeperCreate \
      -minlength 1000 \
      -o ./DB146_6_2_925.gkpStore.BUILDING \
      ./DB146_6_2_925.gkpStore.gkp \
    > ./DB146_6_2_925.gkpStore.BUILDING.err 2>&1

-- Finished on Mon Sep 25 20:17:46 2017 (5 seconds) with 401.978 GB free disk space
----------------------------------------
--
-- WARNING: gnuplot failed; no plots will appear in HTML output.
--
----------------------------------------
--
-- In gatekeeper store 'trimming/DB146_6_2_925.gkpStore':
--   Found 9591 reads.
--   Found 453158363 bases (37.45 times coverage).
--
--   Read length histogram (one '*' equals 35.42 reads):
--        0   4999    109 ***
--     5000   9999     41 *
--    10000  14999     23 
--    15000  19999     18 
--    20000  24999     13 
--    25000  29999     19 
--    30000  34999   1053 *****************************
--    35000  39999   2480 **********************************************************************
--    40000  44999   1749 *************************************************
--    45000  49999   1204 *********************************
--    50000  54999    809 **********************
--    55000  59999    590 ****************
--    60000  64999    415 ***********
--    65000  69999    260 *******
--    70000  74999    225 ******
--    75000  79999    136 ***
--    80000  84999     99 **
--    85000  89999     92 **
--    90000  94999     64 *
--    95000  99999     38 *
--   100000 104999     35 
--   105000 109999     24 
--   110000 114999     14 
--   115000 119999     12 
--   120000 124999     14 
--   125000 129999     11 
--   130000 134999      9 
--   135000 139999      8 
--   140000 144999      6 
--   145000 149999      2 
--   150000 154999      3 
--   155000 159999      2 
--   160000 164999      3 
--   165000 169999      2 
--   170000 174999      2 
--   175000 179999      1 
--   180000 184999      0 
--   185000 189999      0 
--   190000 194999      0 
--   195000 199999      0 
--   200000 204999      0 
--   205000 209999      1 
--   210000 214999      0 
--   215000 219999      1 
--   220000 224999      1 
--   225000 229999      0 
--   230000 234999      0 
--   235000 239999      0 
--   240000 244999      0 
--   245000 249999      0 
--   250000 254999      0 
--   255000 259999      0 
--   260000 264999      0 
--   265000 269999      0 
--   270000 274999      0 
--   275000 279999      1 
--   280000 284999      0 
--   285000 289999      0 
--   290000 294999      0 
--   295000 299999      0 
--   300000 304999      1 
--   305000 309999      0 
--   310000 314999      0 
--   315000 319999      0 
--   320000 324999      0 
--   325000 329999      0 
--   330000 334999      0 
--   335000 339999      0 
--   340000 344999      0 
--   345000 349999      1 
-- Finished stage 'obt-gatekeeper', reset canuIteration.
-- Finished stage 'merylConfigure', reset canuIteration.
--
-- Running jobs.  First attempt out of 2.
----------------------------------------
-- Starting 'meryl' concurrent execution on Mon Sep 25 20:17:46 2017 with 401.977 GB free disk space (1 processes; 6 concurrently)

    cd trimming/0-mercounts
    ./meryl.sh 1 > ./meryl.000001.out 2>&1

-- Finished on Mon Sep 25 20:20:11 2017 (145 seconds) with 401.641 GB free disk space
----------------------------------------
-- Meryl finished successfully.
-- Finished stage 'merylCheck', reset canuIteration.
--
-- WARNING: gnuplot failed; no plots will appear in HTML output.
--
----------------------------------------
--
-- WARNING: gnuplot failed; no plots will appear in HTML output.
--
----------------------------------------
--
--  22-mers                                                                                           Fraction
--    Occurrences   NumMers                                                                         Unique Total
--       1-     1  39274206 *******************************************************************--> 0.6491 0.0867
--       2-     2   4135646 ********************************************************************** 0.7174 0.1050
--       3-     4   2614202 ********************************************                           0.7449 0.1160
--       5-     7   1538528 **************************                                             0.7713 0.1315
--       8-    11   1193334 ********************                                                   0.7918 0.1503
--      12-    16   1162802 *******************                                                    0.8098 0.1754
--      17-    22   1548774 **************************                                             0.8288 0.2133
--      23-    29   3092400 ****************************************************                   0.8564 0.2898
--      30-    37   3519697 ***********************************************************            0.9103 0.4858
--      38-    46   1743053 *****************************                                          0.9648 0.7340
--      47-    56    452705 *******                                                                0.9901 0.8763
--      57-    67     85324 *                                                                      0.9965 0.9204
--      68-    79     25864                                                                        0.9976 0.9299
--      80-    92     17695                                                                        0.9981 0.9340
--      93-   106     14819                                                                        0.9983 0.9372
--     107-   121     17880                                                                        0.9986 0.9405
--     122-   137     16748                                                                        0.9989 0.9450
--     138-   154     13833                                                                        0.9992 0.9498
--     155-   172      8791                                                                        0.9994 0.9542
--     173-   191      3286                                                                        0.9995 0.9572
--     192-   211      2505                                                                        0.9996 0.9585
--     212-   232      1518                                                                        0.9996 0.9596
--     233-   254      1065                                                                        0.9996 0.9603
--     255-   277      1205                                                                        0.9997 0.9609
--     278-   301       928                                                                        0.9997 0.9616
--     302-   326       850                                                                        0.9997 0.9621
--     327-   352       981                                                                        0.9997 0.9627
--     353-   379      1150                                                                        0.9997 0.9635
--     380-   407      1117                                                                        0.9997 0.9644
--     408-   436      1291                                                                        0.9998 0.9654
--     437-   466      1624                                                                        0.9998 0.9666
--     467-   497       886                                                                        0.9998 0.9682
--     498-   529       465                                                                        0.9998 0.9691
--     530-   562       526                                                                        0.9998 0.9696
--     563-   596       361                                                                        0.9998 0.9703
--     597-   631       346                                                                        0.9998 0.9707
--     632-   667       269                                                                        0.9998 0.9712
--     668-   704       301                                                                        0.9998 0.9716
--     705-   742       223                                                                        0.9999 0.9721
--     743-   781       248                                                                        0.9999 0.9724
--     782-   821       231                                                                        0.9999 0.9728
--
--       27464 (max occurrences)
--   413682746 (total mers, non-unique)
--    21231623 (distinct mers, non-unique)
--    39274206 (unique mers)
----------------------------------------
-- Starting command on Mon Sep 25 20:20:11 2017 with 401.641 GB free disk space

    cd trimming/0-mercounts
    /Users/katjasch/canu-1.6/Darwin-amd64/bin/meryl \
      -Dt \
      -n 1851 \
      -s ./DB146_6_2_925.ms22 \
    > ./DB146_6_2_925.ms22.frequentMers.fasta \
    2> ./DB146_6_2_925.ms22.frequentMers.fasta.err

-- Finished on Mon Sep 25 20:20:13 2017 (2 seconds) with 401.641 GB free disk space
----------------------------------------
-- Reset obtOvlMerThreshold from auto to 1851.
--
-- Found 452956952 22-mers; 60505829 distinct and 39274206 unique.  Largest count 27464.
-- Finished stage 'obt-meryl', reset canuIteration.
--
-- OVERLAPPER (normal) (trimming) erate=0.144
--
----------------------------------------
-- Starting command on Mon Sep 25 20:20:14 2017 with 401.977 GB free disk space

    cd trimming/1-overlapper
    /Users/katjasch/canu-1.6/Darwin-amd64/bin/overlapInCorePartition \
     -g  ../DB146_6_2_925.gkpStore \
     -bl 100000000 \
     -bs 0 \
     -rs 2000000 \
     -rl 0 \
     -ol 500 \
     -o  ./DB146_6_2_925.partition \
    > ./DB146_6_2_925.partition.err 2>&1

-- Finished on Mon Sep 25 20:20:14 2017 (lickety-split) with 401.977 GB free disk space
----------------------------------------
--
-- Configured 5 overlapInCore jobs.
-- Finished stage 'obt-overlapConfigure', reset canuIteration.
--
-- Running jobs.  First attempt out of 2.
----------------------------------------
-- Starting 'obtovl' concurrent execution on Mon Sep 25 20:20:14 2017 with 401.977 GB free disk space (5 processes; 3 concurrently)

    cd trimming/1-overlapper
    ./overlap.sh 1 > ./overlap.000001.out 2>&1
    ./overlap.sh 2 > ./overlap.000002.out 2>&1
    ./overlap.sh 3 > ./overlap.000003.out 2>&1
skoren commented 6 years ago

This is a known issue when you have very long reads (e.g. the 300k+ reads in your dataset) and are using a high error rate for the corrected nanopore reads. See issue #521. You can try the fast options recommended there (overlapper=mhap utgReAlign=true). You can also lower the correctedErrorRate from the default of 0.144 which is meant to handle most nanopore data including older R7 to something like 0.1 or 0.12 assuming you are using a recent chemistry and base caller.

Sashanity commented 6 years ago

Thank you for the quick respond! I'll try to do it as you suggested.

Sashanity commented 6 years ago

@skoren does it make sense to run it toghether? lower correctedErrorRate and overlapper=mhap utgReAlign=true

skoren commented 6 years ago

The alternate parameters are fast enough that you don't have to lower the error rate so you can run either or.