marbl / canu

A single molecule sequence assembler for genomes large and small.
http://canu.readthedocs.io/
660 stars 179 forks source link

Two similar nanopore datasets, different assembly outcomes. #781

Closed ml3958 closed 6 years ago

ml3958 commented 6 years ago

Hi, I am using Canu 1.6 to assemble two closely related bacterial genome of 2.5mb from nanopore whole genome sequencing. However, I was able to get a close and circularized for one genome but not the other.

The genome I was able to close had 240X coverage, while the unclosed one had 150X. For the latter case, there's 13% repeat-cont in the unitigging step. I try to increase the continuity by setting corOutCoverage=100 but it actually worsened the assembly.

I wonder

Any suggestions will be highly appreciated! I put the report for the closed genome(first) and unclosed genome(second) below.

[CORRECTION/READS]
--
-- In gatekeeper store 'correction/HC1_reanalyze.gkpStore':
--   Found 124108 reads.
--   Found 605503596 bases (242.2 times coverage).
--
--   Read length histogram (one '*' equals 1046.92 reads):
--        0   4999  73285 **********************************************************************
--     5000   9999  43306 *****************************************
--    10000  14999   6238 *****
--    15000  19999   1061 *
--    20000  24999    159 
--    25000  29999     38 
--    30000  34999     11 
--    35000  39999      3 
--    40000  44999      1 
--    45000  49999      1 
--    50000  54999      3 
--    55000  59999      0 
--    60000  64999      0 
--    65000  69999      1 
--    70000  74999      0 
--    75000  79999      0 
--    80000  84999      0 
--    85000  89999      0 
--    90000  94999      0 
--    95000  99999      0 
--   100000 104999      0 
--   105000 109999      0 
--   110000 114999      0 
--   115000 119999      1

[CORRECTION/MERS]
--
--  16-mers                                                                                           Fraction
--    Occurrences   NumMers                                                                         Unique Total
--       1-     1 190531714 *******************************************************************--> 0.7433 0.3156
--       2-     2  36141893 ********************************************************************** 0.8843 0.4354
--       3-     4  16949672 ********************************                                       0.9300 0.4935
--       5-     7   5837604 ***********                                                            0.9617 0.5522
--       8-    11   2309377 ****                                                                   0.9765 0.5947
--      12-    16   1058660 **                                                                     0.9834 0.6246
--      17-    22    539822 *                                                                      0.9869 0.6460
--      23-    29    315268                                                                        0.9887 0.6619
--      30-    37    237540                                                                        0.9898 0.6747
--      38-    46    264947                                                                        0.9908 0.6879
--      47-    56    361368                                                                        0.9918 0.7072
--      57-    67    476709                                                                        0.9932 0.7395
--      68-    79    511147                                                                        0.9951 0.7896
--      80-    92    405914                                                                        0.9971 0.8517
--      93-   106    224921                                                                        0.9986 0.9078
--     107-   121     85255                                                                        0.9994 0.9426
--     122-   137     25809                                                                        0.9997 0.9574
--     138-   154     10573                                                                        0.9998 0.9625
--     155-   172      6620                                                                        0.9999 0.9649
--     173-   191      4767                                                                        0.9999 0.9667
--     192-   211      3222                                                                        0.9999 0.9681
--     212-   232      2164                                                                        0.9999 0.9691
--     233-   254      1522                                                                        0.9999 0.9699
--     255-   277      1272                                                                        0.9999 0.9705
--     278-   301       991                                                                        1.0000 0.9711
--     302-   326       823                                                                        1.0000 0.9716
--     327-   352       736                                                                        1.0000 0.9720
--     353-   379       608                                                                        1.0000 0.9724
--     380-   407       552                                                                        1.0000 0.9728
--     408-   436       455                                                                        1.0000 0.9731
--     437-   466       392                                                                        1.0000 0.9734
--     467-   497       376                                                                        1.0000 0.9737
--     498-   529       367                                                                        1.0000 0.9740
--     530-   562       321                                                                        1.0000 0.9743
--     563-   596       234                                                                        1.0000 0.9746
--     597-   631       203                                                                        1.0000 0.9749
--     632-   667       196                                                                        1.0000 0.9751
--     668-   704       177                                                                        1.0000 0.9753
--     705-   742       150                                                                        1.0000 0.9755
--     743-   781       157                                                                        1.0000 0.9757
--     782-   821       127                                                                        1.0000 0.9759
--
--       32981 (max occurrences)
--   413110262 (total mers, non-unique)
--    65787660 (distinct mers, non-unique)
--   190531714 (unique mers)

[CORRECTION/CORRECTIONS]
--
-- Reads to be corrected:
--   8905 reads longer than 9745 bp
--   103573568 bp
-- Expected corrected reads:
--   8905 reads
--   100005637 bp
--   8706 bp minimum length
--   11230 bp mean length
--   22410 bp n50 length

[TRIMMING/READS]
--
-- In gatekeeper store 'trimming/HC1_reanalyze.gkpStore':
--   Found 9098 reads.
--   Found 99172441 bases (39.66 times coverage).
--
--   Read length histogram (one '*' equals 35.62 reads):
--        0    999      0 
--     1000   1999     31 
--     2000   2999     23 
--     3000   3999     18 
--     4000   4999     33 
--     5000   5999     53 *
--     6000   6999     75 **
--     7000   7999     90 **
--     8000   8999   1578 ********************************************
--     9000   9999   2494 **********************************************************************
--    10000  10999   1523 ******************************************
--    11000  11999    953 **************************
--    12000  12999    623 *****************
--    13000  13999    469 *************
--    14000  14999    333 *********
--    15000  15999    224 ******
--    16000  16999    168 ****
--    17000  17999    134 ***
--    18000  18999     91 **
--    19000  19999     65 *
--    20000  20999     34 
--    21000  21999     24 
--    22000  22999     17 
--    23000  23999     12 
--    24000  24999      6 
--    25000  25999     11 
--    26000  26999      6 
--    27000  27999      3 
--    28000  28999      1 
--    29000  29999      2 
--    30000  30999      1 
--    31000  31999      0 
--    32000  32999      2 
--    33000  33999      1

[TRIMMING/MERS]
--
--  22-mers                                                                                           Fraction
--    Occurrences   NumMers                                                                         Unique Total
--       1-     1   6652858 *******************************************************************--> 0.6195 0.0672
--       2-     2    675910 **************************************************************         0.6825 0.0809
--       3-     4    449882 *****************************************                              0.7088 0.0894
--       5-     7    255675 ***********************                                                0.7347 0.1018
--       8-    11    172525 ***************                                                        0.7531 0.1155
--      12-    16    150759 *************                                                          0.7673 0.1315
--      17-    22    219148 ********************                                                   0.7811 0.1539
--      23-    29    480985 ********************************************                           0.8034 0.2043
--      30-    37    758893 ********************************************************************** 0.8519 0.3478
--      38-    46    626803 *********************************************************              0.9225 0.6120
--      47-    56    233082 *********************                                                  0.9764 0.8599
--      57-    67     44715 ****                                                                   0.9950 0.9633
--      68-    79      9550                                                                        0.9985 0.9866
--      80-    92      4710                                                                        0.9994 0.9933
--      93-   106      1763                                                                        0.9998 0.9970
--     107-   121       804                                                                        0.9999 0.9988
--     122-   137        77                                                                        1.0000 0.9996
--     138-   154        18                                                                        1.0000 0.9997
--     155-   172        12                                                                        1.0000 0.9997
--     173-   191        17                                                                        1.0000 0.9997
--     192-   211        16                                                                        1.0000 0.9998
--     212-   232         5                                                                        1.0000 0.9998
--     233-   254         2                                                                        1.0000 0.9998
--     255-   277         3                                                                        1.0000 0.9998
--     278-   301         4                                                                        1.0000 0.9998
--     302-   326         4                                                                        1.0000 0.9998
--     327-   352         1                                                                        1.0000 0.9999
--     353-   379         2                                                                        1.0000 0.9999
--     380-   407         0                                                                        0.0000 0.0000
--     408-   436         1                                                                        1.0000 0.9999
--     437-   466         1                                                                        1.0000 0.9999
--     467-   497         9                                                                        1.0000 0.9999
--     498-   529         3                                                                        1.0000 0.9999
--     530-   562         0                                                                        0.0000 0.0000
--     563-   596         0                                                                        0.0000 0.0000
--     597-   631         0                                                                        0.0000 0.0000
--     632-   667         0                                                                        0.0000 0.0000
--     668-   704         2                                                                        1.0000 0.9999
--     705-   742         0                                                                        0.0000 0.0000
--     743-   781         0                                                                        0.0000 0.0000
--     782-   821         2                                                                        1.0000 0.9999
--
--         858 (max occurrences)
--    92328525 (total mers, non-unique)
--     4085388 (distinct mers, non-unique)
--     6652858 (unique mers)

[TRIMMING/TRIMMING]
--  PARAMETERS:
--  ----------
--     1000    (reads trimmed below this many bases are deleted)
--   0.1440    (use overlaps at or below this fraction error)
--        1    (break region if overlap is less than this long, for 'largest covered' algorithm)
--        1    (break region if overlap coverage is less than this many read, for 'largest covered' algorithm)
--  
--  INPUT READS:
--  -----------
--    9098 reads     99172441 bases (reads processed)
--       0 reads            0 bases (reads not processed, previously deleted)
--       0 reads            0 bases (reads not processed, in a library where trimming isn't allowed)
--  
--  OUTPUT READS:
--  ------------
--    7995 reads     71907257 bases (trimmed reads output)
--    1100 reads     11298140 bases (reads with no change, kept as is)
--       1 reads         1125 bases (reads with no overlaps, deleted)
--       2 reads         2636 bases (reads with short trimmed length, deleted)
--  
--  TRIMMING DETAILS:
--  ----------------
--    4059 reads      6687043 bases (bases trimmed from the 5' end of a read)
--    6964 reads      9276240 bases (bases trimmed from the 3' end of a read)

[TRIMMING/SPLITTING]
--  PARAMETERS:
--  ----------
--     1000    (reads trimmed below this many bases are deleted)
--   0.1440    (use overlaps at or below this fraction error)
--  INPUT READS:
--  -----------
--    9095 reads     99168680 bases (reads processed)
--       3 reads         3761 bases (reads not processed, previously deleted)
--       0 reads            0 bases (reads not processed, in a library where trimming isn't allowed)
--  
--  PROCESSED:
--  --------
--       0 reads            0 bases (no overlaps)
--       0 reads            0 bases (no coverage after adjusting for trimming done already)
--       0 reads            0 bases (processed for chimera)
--       0 reads            0 bases (processed for spur)
--    9095 reads     99168680 bases (processed for subreads)
--  
--  READS WITH SIGNALS:
--  ------------------
--       0 reads            0 signals (number of 5' spur signal)
--       0 reads            0 signals (number of 3' spur signal)
--       0 reads            0 signals (number of chimera signal)
--     726 reads          726 signals (number of subread signal)
--  
--  SIGNALS:
--  -------
--       0 reads            0 bases (size of 5' spur signal)
--       0 reads            0 bases (size of 3' spur signal)
--       0 reads            0 bases (size of chimera signal)
--     726 reads       164795 bases (size of subread signal)
--  
--  TRIMMING:
--  --------
--     259 reads      1569963 bases (trimmed from the 5' end of the read)
--     467 reads      2790246 bases (trimmed from the 3' end of the read)

[UNITIGGING/READS]
--
-- In gatekeeper store 'unitigging/HC1_reanalyze.gkpStore':
--   Found 9095 reads.
--   Found 78845188 bases (31.53 times coverage).
--
--   Read length histogram (one '*' equals 30.4 reads):
--        0    999      0 
--     1000   1999     29 
--     2000   2999     28 
--     3000   3999     29 
--     4000   4999    273 ********
--     5000   5999    823 ***************************
--     6000   6999   1050 **********************************
--     7000   7999    836 ***************************
--     8000   8999   1793 **********************************************************
--     9000   9999   2128 **********************************************************************
--    10000  10999   1090 ***********************************
--    11000  11999    534 *****************
--    12000  12999    224 *******
--    13000  13999    116 ***
--    14000  14999     59 *
--    15000  15999     29 
--    16000  16999     18 
--    17000  17999     16 
--    18000  18999      7 
--    19000  19999      5 
--    20000  20999      4 
--    21000  21999      1 
--    22000  22999      0 
--    23000  23999      0 
--    24000  24999      0 
--    25000  25999      0 
--    26000  26999      2 
--    27000  27999      0 
--    28000  28999      0 
--    29000  29999      1

[UNITIGGING/MERS]
--
--  22-mers                                                                                           Fraction
--    Occurrences   NumMers                                                                         Unique Total
--       1-     1   3274640 *******************************************************************--> 0.4788 0.0416
--       2-     2    444192 ***************************************                                0.5437 0.0529
--       3-     4    314520 ***************************                                            0.5724 0.0604
--       5-     7    199397 *****************                                                      0.6019 0.0717
--       8-    11    154350 *************                                                          0.6253 0.0857
--      12-    16    170876 ***************                                                        0.6462 0.1046
--      17-    22    366877 ********************************                                       0.6726 0.1394
--      23-    29    789268 *********************************************************************  0.7341 0.2513
--      30-    37    795551 ********************************************************************** 0.8543 0.5351
--      38-    46    278499 ************************                                               0.9604 0.8485
--      47-    56     36502 ***                                                                    0.9940 0.9701
--      57-    67      7551                                                                        0.9980 0.9877
--      68-    79      5016                                                                        0.9990 0.9933
--      80-    92      1608                                                                        0.9997 0.9977
--      93-   106       356                                                                        0.9999 0.9993
--     107-   121        40                                                                        1.0000 0.9997
--     122-   137        19                                                                        1.0000 0.9997
--     138-   154        14                                                                        1.0000 0.9997
--     155-   172        15                                                                        1.0000 0.9998
--     173-   191         2                                                                        1.0000 0.9998
--     192-   211         2                                                                        1.0000 0.9998
--     212-   232         1                                                                        1.0000 0.9998
--     233-   254         3                                                                        1.0000 0.9998
--     255-   277         1                                                                        1.0000 0.9998
--     278-   301         2                                                                        1.0000 0.9998
--     302-   326         1                                                                        1.0000 0.9998
--     327-   352         0                                                                        0.0000 0.0000
--     353-   379         1                                                                        1.0000 0.9998
--     380-   407         1                                                                        1.0000 0.9999
--     408-   436         5                                                                        1.0000 0.9999
--     437-   466         4                                                                        1.0000 0.9999
--     467-   497         3                                                                        1.0000 0.9999
--     498-   529         0                                                                        0.0000 0.0000
--     530-   562         0                                                                        0.0000 0.0000
--     563-   596         2                                                                        1.0000 0.9999
--     597-   631         0                                                                        0.0000 0.0000
--     632-   667         0                                                                        0.0000 0.0000
--     668-   704         1                                                                        1.0000 0.9999
--     705-   742         3                                                                        1.0000 1.0000
--     743-   781         3                                                                        1.0000 1.0000
--     782-   821         0                                                                        0.0000 0.0000
--
--         760 (max occurrences)
--    75379553 (total mers, non-unique)
--     3564686 (distinct mers, non-unique)
--     3274640 (unique mers)

[UNITIGGING/OVERLAPS]
--   category            reads     %          read length        feature size or coverage  analysis
--   ----------------  -------  -------  ----------------------  ------------------------  --------------------
--   middle-missing          0    0.00        0.00 +- 0.00             0.00 +- 0.00       (bad trimming)
--   middle-hump             0    0.00        0.00 +- 0.00             0.00 +- 0.00       (bad trimming)
--   no-5-prime              0    0.00        0.00 +- 0.00             0.00 +- 0.00       (bad trimming)
--   no-3-prime              0    0.00        0.00 +- 0.00             0.00 +- 0.00       (bad trimming)
--   
--   low-coverage            0    0.00        0.00 +- 0.00             0.00 +- 0.00       (easy to assemble, potential for lower quality consensus)
--   unique               8542   93.92     8646.61 +- 2234.96         31.31 +- 6.70       (easy to assemble, perfect, yay)
--   repeat-cont             6    0.07     5326.00 +- 954.48          68.09 +- 7.54       (potential for consensus errors, no impact on assembly)
--   repeat-dove             0    0.00        0.00 +- 0.00             0.00 +- 0.00       (hard to assemble, likely won't assemble correctly or even at all)
--   
--   span-repeat           409    4.50     9231.35 +- 2482.60       1912.52 +- 1873.64    (read spans a large repeat, usually easy to assemble)
--   uniq-repeat-cont      104    1.14     8069.54 +- 1813.81                             (should be uniquely placed, low potential for consensus errors, no impact on assembly)
--   uniq-repeat-dove       33    0.36    10067.21 +- 2421.49                             (will end contigs, potential to misassemble)
--   uniq-anchor             1    0.01     6800.00 +- 0.00           762.00 +- 0.00       (repeat read, with unique section, probable bad read)

[UNITIGGING/ADJUSTMENT]
-- No report available.

[UNITIGGING/CONTIGS]
-- Found, in version 1, after unitig construction:
--   contigs:      4 sequences, total length 2503743 bp (including 0 repeats of total length 0 bp).
--   bubbles:      0 sequences, total length 0 bp.
--   unassembled:  812 sequences, total length 6655931 bp.
--
-- Contig sizes based on genome size --
--            NG (bp)  LG (contigs)    sum (bp)
--         ----------  ------------  ----------
--     10     2453376             1     2453376
--     20     2453376             1     2453376
--     30     2453376             1     2453376
--     40     2453376             1     2453376
--     50     2453376             1     2453376
--     60     2453376             1     2453376
--     70     2453376             1     2453376
--     80     2453376             1     2453376
--     90     2453376             1     2453376
--    100       11275             4     2503743
--

[UNITIGGING/CONSENSUS]
-- Found, in version 2, after consensus generation:
--   contigs:      4 sequences, total length 2507673 bp (including 0 repeats of total length 0 bp).
--   bubbles:      0 sequences, total length 0 bp.
--   unassembled:  812 sequences, total length 6655931 bp.
--
-- Contig sizes based on genome size --
--            NG (bp)  LG (contigs)    sum (bp)
--         ----------  ------------  ----------
--     10     2457261             1     2457261
--     20     2457261             1     2457261
--     30     2457261             1     2457261
--     40     2457261             1     2457261
--     50     2457261             1     2457261
--     60     2457261             1     2457261
--     70     2457261             1     2457261
--     80     2457261             1     2457261
--     90     2457261             1     2457261
--    100       11261             4     2507673
--
[CORRECTION/READS]
--
-- In gatekeeper store 'correction/oxk_reanalyze.gkpStore':
--   Found 87002 reads.
--   Found 373323764 bases (149.92 times coverage).
--
--   Read length histogram (one '*' equals 980.4 reads):
--        0   4999  68628 **********************************************************************
--     5000   9999  13389 *************
--    10000  14999   3092 ***
--    15000  19999   1152 *
--    20000  24999    456 
--    25000  29999    165 
--    30000  34999     63 
--    35000  39999     25 
--    40000  44999     12 
--    45000  49999     10 
--    50000  54999      3 
--    55000  59999      2 
--    60000  64999      1 
--    65000  69999      0 
--    70000  74999      0 
--    75000  79999      2 
--    80000  84999      0 
--    85000  89999      0 
--    90000  94999      0 
--    95000  99999      0 
--   100000 104999      0 
--   105000 109999      0 
--   110000 114999      0 
--   115000 119999      0 
--   120000 124999      0 
--   125000 129999      1 
--   130000 134999      0 
--   135000 139999      0 
--   140000 144999      0 
--   145000 149999      0 
--   150000 154999      1

[CORRECTION/MERS]
--
--  16-mers                                                                                           Fraction
--    Occurrences   NumMers                                                                         Unique Total
--       1-     1 139467612 *******************************************************************--> 0.8056 0.3749
--       2-     2  19571809 ********************************************************************** 0.9186 0.4801
--       3-     4   7533803 **************************                                             0.9493 0.5230
--       5-     7   2400405 ********                                                               0.9689 0.5626
--       8-    11   1109326 ***                                                                    0.9781 0.5916
--      12-    16    896286 ***                                                                    0.9835 0.6176
--      17-    22    889387 ***                                                                    0.9885 0.6524
--      23-    29    637103 **                                                                     0.9934 0.6987
--      30-    37    273892                                                                        0.9967 0.7394
--      38-    46     95985                                                                        0.9981 0.7608
--      47-    56     50063                                                                        0.9986 0.7706
--      57-    67     35173                                                                        0.9988 0.7772
--      68-    79     26339                                                                        0.9990 0.7829
--      80-    92     20711                                                                        0.9992 0.7880
--      93-   106     17006                                                                        0.9993 0.7927
--     107-   121     13955                                                                        0.9994 0.7972
--     122-   137     11568                                                                        0.9995 0.8014
--     138-   154      9695                                                                        0.9995 0.8054
--     155-   172      7955                                                                        0.9996 0.8092
--     173-   191      6828                                                                        0.9996 0.8127
--     192-   211      5775                                                                        0.9997 0.8160
--     212-   232      4929                                                                        0.9997 0.8191
--     233-   254      4350                                                                        0.9997 0.8220
--     255-   277      3849                                                                        0.9998 0.8249
--     278-   301      3352                                                                        0.9998 0.8276
--     302-   326      2906                                                                        0.9998 0.8302
--     327-   352      2648                                                                        0.9998 0.8326
--     353-   379      2285                                                                        0.9998 0.8350
--     380-   407      2045                                                                        0.9998 0.8373
--     408-   436      1755                                                                        0.9999 0.8394
--     437-   466      1590                                                                        0.9999 0.8414
--     467-   497      1454                                                                        0.9999 0.8433
--     498-   529      1236                                                                        0.9999 0.8452
--     530-   562      1125                                                                        0.9999 0.8469
--     563-   596      1050                                                                        0.9999 0.8486
--     597-   631       900                                                                        0.9999 0.8502
--     632-   667       867                                                                        0.9999 0.8517
--     668-   704       827                                                                        0.9999 0.8532
--     705-   742       789                                                                        0.9999 0.8547
--     743-   781       706                                                                        0.9999 0.8562
--     782-   821       623                                                                        0.9999 0.8577
--
--       24742 (max occurrences)
--   232551122 (total mers, non-unique)
--    33664021 (distinct mers, non-unique)
--   139467612 (unique mers)

[CORRECTION/CORRECTIONS]
--
-- Reads to be corrected:
--   10582 reads longer than 7215 bp
--   108805317 bp
-- Expected corrected reads:
--   10582 reads
--   99600765 bp
--   5415 bp minimum length
--   9412 bp mean length
--   32937 bp n50 length

[TRIMMING/READS]
--
-- In gatekeeper store 'trimming/oxk_reanalyze.gkpStore':
--   Found 11213 reads.
--   Found 99490521 bases (39.95 times coverage).
--
--   Read length histogram (one '*' equals 32.45 reads):
--        0    999      0 
--     1000   1999    136 ****
--     2000   2999    123 ***
--     3000   3999    371 ***********
--     4000   4999    179 *****
--     5000   5999   1757 ******************************************************
--     6000   6999   2272 **********************************************************************
--     7000   7999   1554 ***********************************************
--     8000   8999   1026 *******************************
--     9000   9999    780 ************************
--    10000  10999    582 *****************
--    11000  11999    440 *************
--    12000  12999    387 ***********
--    13000  13999    298 *********
--    14000  14999    231 *******
--    15000  15999    187 *****
--    16000  16999    171 *****
--    17000  17999    143 ****
--    18000  18999    127 ***
--    19000  19999    109 ***
--    20000  20999     63 *
--    21000  21999     65 **
--    22000  22999     31 
--    23000  23999     30 
--    24000  24999     19 
--    25000  25999     26 
--    26000  26999     18 
--    27000  27999     15 
--    28000  28999     14 
--    29000  29999     13 
--    30000  30999     12 
--    31000  31999      3 
--    32000  32999      3 
--    33000  33999      2 
--    34000  34999      5 
--    35000  35999      5 
--    36000  36999      2 
--    37000  37999      4 
--    38000  38999      2 
--    39000  39999      3 
--    40000  40999      1 
--    41000  41999      1 
--    42000  42999      1 
--    43000  43999      0 
--    44000  44999      0 
--    45000  45999      1 
--    46000  46999      0 
--    47000  47999      0 
--    48000  48999      0 
--    49000  49999      0 
--    50000  50999      1

[TRIMMING/MERS]
--
--  22-mers                                                                                           Fraction
--    Occurrences   NumMers                                                                         Unique Total
--       1-     1   6618613 *******************************************************************--> 0.6059 0.0667
--       2-     2    794170 ********************************************************************** 0.6786 0.0827
--       3-     4    527970 **********************************************                         0.7092 0.0928
--       5-     7    310459 ***************************                                            0.7391 0.1073
--       8-    11    216780 *******************                                                    0.7613 0.1241
--      12-    16    200274 *****************                                                      0.7791 0.1444
--      17-    22    274324 ************************                                               0.7972 0.1743
--      23-    29    487021 ******************************************                             0.8236 0.2344
--      30-    37    786891 *********************************************************************  0.8716 0.3786
--      38-    46    542225 ***********************************************                        0.9434 0.6504
--      47-    56    134346 ***********                                                            0.9875 0.8552
--      57-    67     10373                                                                        0.9973 0.9096
--      68-    79      4718                                                                        0.9981 0.9153
--      80-    92      3416                                                                        0.9985 0.9187
--      93-   106      1852                                                                        0.9989 0.9216
--     107-   121      1482                                                                        0.9990 0.9234
--     122-   137      1928                                                                        0.9992 0.9252
--     138-   154      1229                                                                        0.9993 0.9277
--     155-   172       274                                                                        0.9994 0.9294
--     173-   191       326                                                                        0.9995 0.9298
--     192-   211       437                                                                        0.9995 0.9304
--     212-   232       212                                                                        0.9995 0.9313
--     233-   254       134                                                                        0.9995 0.9317
--     255-   277        91                                                                        0.9996 0.9321
--     278-   301       205                                                                        0.9996 0.9323
--     302-   326       128                                                                        0.9996 0.9329
--     327-   352        74                                                                        0.9996 0.9333
--     353-   379        36                                                                        0.9996 0.9336
--     380-   407        81                                                                        0.9996 0.9337
--     408-   436        86                                                                        0.9996 0.9340
--     437-   466        73                                                                        0.9996 0.9344
--     467-   497        37                                                                        0.9996 0.9347
--     498-   529        70                                                                        0.9996 0.9349
--     530-   562        67                                                                        0.9996 0.9353
--     563-   596        81                                                                        0.9996 0.9356
--     597-   631        70                                                                        0.9997 0.9361
--     632-   667        53                                                                        0.9997 0.9365
--     668-   704        37                                                                        0.9997 0.9369
--     705-   742        40                                                                        0.9997 0.9371
--     743-   781        19                                                                        0.9997 0.9374
--     782-   821        56                                                                        0.9997 0.9376
--
--        2458 (max occurrences)
--    92636435 (total mers, non-unique)
--     4305656 (distinct mers, non-unique)
--     6618613 (unique mers)

[TRIMMING/TRIMMING]
--  PARAMETERS:
--  ----------
--     1000    (reads trimmed below this many bases are deleted)
--   0.1440    (use overlaps at or below this fraction error)
--        1    (break region if overlap is less than this long, for 'largest covered' algorithm)
--        1    (break region if overlap coverage is less than this many read, for 'largest covered' algorithm)
--  
--  INPUT READS:
--  -----------
--   11213 reads     99490521 bases (reads processed)
--       0 reads            0 bases (reads not processed, previously deleted)
--       0 reads            0 bases (reads not processed, in a library where trimming isn't allowed)
--  
--  OUTPUT READS:
--  ------------
--    8891 reads     75133717 bases (trimmed reads output)
--    2309 reads     17880087 bases (reads with no change, kept as is)
--      11 reads        33505 bases (reads with no overlaps, deleted)
--       2 reads         2027 bases (reads with short trimmed length, deleted)
--  
--  TRIMMING DETAILS:
--  ----------------
--    4510 reads      2980639 bases (bases trimmed from the 5' end of a read)
--    7429 reads      3460546 bases (bases trimmed from the 3' end of a read)

[TRIMMING/SPLITTING]
--  PARAMETERS:
--  ----------
--     1000    (reads trimmed below this many bases are deleted)
--   0.1440    (use overlaps at or below this fraction error)
--  INPUT READS:
--  -----------
--   11200 reads     99454989 bases (reads processed)
--      13 reads        35532 bases (reads not processed, previously deleted)
--       0 reads            0 bases (reads not processed, in a library where trimming isn't allowed)
--  
--  PROCESSED:
--  --------
--       0 reads            0 bases (no overlaps)
--       0 reads            0 bases (no coverage after adjusting for trimming done already)
--       0 reads            0 bases (processed for chimera)
--       0 reads            0 bases (processed for spur)
--   11200 reads     99454989 bases (processed for subreads)
--  
--  READS WITH SIGNALS:
--  ------------------
--       0 reads            0 signals (number of 5' spur signal)
--       0 reads            0 signals (number of 3' spur signal)
--       0 reads            0 signals (number of chimera signal)
--     156 reads          157 signals (number of subread signal)
--  
--  SIGNALS:
--  -------
--       0 reads            0 bases (size of 5' spur signal)
--       0 reads            0 bases (size of 3' spur signal)
--       0 reads            0 bases (size of chimera signal)
--     157 reads        62245 bases (size of subread signal)
--  
--  TRIMMING:
--  --------
--      76 reads       374102 bases (trimmed from the 5' end of the read)
--      81 reads       338108 bases (trimmed from the 3' end of the read)

[UNITIGGING/READS]
--
-- In gatekeeper store 'unitigging/oxk_reanalyze.gkpStore':
--   Found 11199 reads.
--   Found 92300598 bases (37.06 times coverage).
--
--   Read length histogram (one '*' equals 32.84 reads):
--        0    999      0 
--     1000   1999    134 ****
--     2000   2999    165 *****
--     3000   3999    518 ***************
--     4000   4999    324 *********
--     5000   5999   1852 ********************************************************
--     6000   6999   2299 **********************************************************************
--     7000   7999   1560 ***********************************************
--     8000   8999   1031 *******************************
--     9000   9999    755 **********************
--    10000  10999    562 *****************
--    11000  11999    439 *************
--    12000  12999    341 **********
--    13000  13999    267 ********
--    14000  14999    190 *****
--    15000  15999    163 ****
--    16000  16999    149 ****
--    17000  17999    104 ***
--    18000  18999    112 ***
--    19000  19999     80 **
--    20000  20999     46 *
--    21000  21999     41 *
--    22000  22999     11 
--    23000  23999     14 
--    24000  24999      5 
--    25000  25999      9 
--    26000  26999      8 
--    27000  27999      5 
--    28000  28999      5 
--    29000  29999      4 
--    30000  30999      1 
--    31000  31999      1 
--    32000  32999      1 
--    33000  33999      0 
--    34000  34999      0 
--    35000  35999      1 
--    36000  36999      0 
--    37000  37999      2

[UNITIGGING/MERS]
--
--  22-mers                                                                                           Fraction
--    Occurrences   NumMers                                                                         Unique Total
--       1-     1   5428849 *******************************************************************--> 0.5686 0.0590
--       2-     2    710515 *************************************************************          0.6430 0.0744
--       3-     4    482923 *****************************************                              0.6749 0.0843
--       5-     7    291071 *************************                                              0.7065 0.0988
--       8-    11    212510 ******************                                                     0.7306 0.1159
--      12-    16    211710 ******************                                                     0.7508 0.1377
--      17-    22    308050 **************************                                             0.7733 0.1728
--      23-    29    586575 **************************************************                     0.8076 0.2464
--      30-    37    807032 ********************************************************************** 0.8734 0.4324
--      38-    46    412248 ***********************************                                    0.9543 0.7193
--      47-    56     70114 ******                                                                 0.9919 0.8829
--      57-    67      7574                                                                        0.9974 0.9117
--      68-    79      3992                                                                        0.9981 0.9162
--      80-    92      2990                                                                        0.9985 0.9193
--      93-   106      1767                                                                        0.9988 0.9220
--     107-   121      2139                                                                        0.9990 0.9238
--     122-   137      1570                                                                        0.9992 0.9265
--     138-   154       302                                                                        0.9993 0.9286
--     155-   172       236                                                                        0.9994 0.9290
--     173-   191       506                                                                        0.9994 0.9295
--     192-   211       249                                                                        0.9994 0.9305
--     212-   232       168                                                                        0.9995 0.9310
--     233-   254       103                                                                        0.9995 0.9314
--     255-   277       160                                                                        0.9995 0.9317
--     278-   301       169                                                                        0.9995 0.9321
--     302-   326        96                                                                        0.9995 0.9327
--     327-   352        35                                                                        0.9995 0.9330
--     353-   379        79                                                                        0.9996 0.9331
--     380-   407        89                                                                        0.9996 0.9335
--     408-   436        74                                                                        0.9996 0.9338
--     437-   466        45                                                                        0.9996 0.9342
--     467-   497        71                                                                        0.9996 0.9344
--     498-   529        59                                                                        0.9996 0.9347
--     530-   562        92                                                                        0.9996 0.9351
--     563-   596        76                                                                        0.9996 0.9356
--     597-   631        50                                                                        0.9996 0.9361
--     632-   667        32                                                                        0.9996 0.9364
--     668-   704        44                                                                        0.9996 0.9366
--     705-   742        32                                                                        0.9996 0.9370
--     743-   781        46                                                                        0.9996 0.9373
--     782-   821        41                                                                        0.9996 0.9376
--
--        2279 (max occurrences)
--    86636570 (total mers, non-unique)
--     4119092 (distinct mers, non-unique)
--     5428849 (unique mers)

[UNITIGGING/OVERLAPS]
--   category            reads     %          read length        feature size or coverage  analysis
--   ----------------  -------  -------  ----------------------  ------------------------  --------------------
--   middle-missing          0    0.00        0.00 +- 0.00             0.00 +- 0.00       (bad trimming)
--   middle-hump             0    0.00        0.00 +- 0.00             0.00 +- 0.00       (bad trimming)
--   no-5-prime              1    0.01     1025.00 +- 0.00            69.00 +- 0.00       (bad trimming)
--   no-3-prime              2    0.02     3170.50 +- 3048.34        360.50 +- 427.80     (bad trimming)
--   
--   low-coverage            0    0.00        0.00 +- 0.00             0.00 +- 0.00       (easy to assemble, potential for lower quality consensus)
--   unique               8648   77.22     8577.70 +- 3805.35         33.65 +- 6.88       (easy to assemble, perfect, yay)
--   repeat-cont          1480   13.22     5384.24 +- 1567.55       1888.96 +- 383.96     (potential for consensus errors, no impact on assembly)
--   repeat-dove            17    0.15     7505.12 +- 2095.58       1084.98 +- 484.73     (hard to assemble, likely won't assemble correctly or even at all)
--   
--   span-repeat           789    7.05     9620.21 +- 4432.27       2277.42 +- 2589.93    (read spans a large repeat, usually easy to assemble)
--   uniq-repeat-cont      188    1.68     8710.03 +- 3153.39                             (should be uniquely placed, low potential for consensus errors, no impact on assembly)
--   uniq-repeat-dove       47    0.42    11857.19 +- 5755.98                             (will end contigs, potential to misassemble)
--   uniq-anchor            27    0.24     8588.52 +- 3073.73       2899.00 +- 2692.22    (repeat read, with unique section, probable bad read)

[UNITIGGING/ADJUSTMENT]
-- No report available.

[UNITIGGING/CONTIGS]
-- Found, in version 1, after unitig construction:
--   contigs:      24 sequences, total length 2699686 bp (including 0 repeats of total length 0 bp).
--   bubbles:      0 sequences, total length 0 bp.
--   unassembled:  1707 sequences, total length 15372120 bp.
--
-- Contig sizes based on genome size --
--            NG (bp)  LG (contigs)    sum (bp)
--         ----------  ------------  ----------
--     10      744637             1      744637
--     20      744637             1      744637
--     30      625602             2     1370239
--     40      625602             2     1370239
--     50      625602             2     1370239
--     60      620290             3     1990529
--     70      620290             3     1990529
--     80      182551             4     2173080
--     90      112050             5     2285130
--    100       25721            12     2512717
--

[UNITIGGING/CONSENSUS]
-- Found, in version 2, after consensus generation:
--   contigs:      24 sequences, total length 2703896 bp (including 0 repeats of total length 0 bp).
--   bubbles:      0 sequences, total length 0 bp.
--   unassembled:  1707 sequences, total length 15372316 bp.
--
-- Contig sizes based on genome size --
--            NG (bp)  LG (contigs)    sum (bp)
--         ----------  ------------  ----------
--     10      744827             1      744827
--     20      744827             1      744827
--     30      626069             2     1370896
--     40      626069             2     1370896
--     50      626069             2     1370896
--     60      623545             3     1994441
--     70      623545             3     1994441
--     80      623545             3     1994441
--     90      112510             5     2290119
--    100       28367            11     2492301
--

Thanks a lot!

skoren commented 6 years ago

The contiguity of the assembly depends primarily on the read length, repeat content of the genome, and heterozygosity. If these genomes have similar repeat content and you have relatively similar average read lengths, it could be the heterozygosity.

The higher repeat-cont fraction in the second genome indicates there are lots of reads of about 5-7kbp that have extremely high coverage in overlaps (>1000) which could either be a very abundant repeat in the genome (the normal coverage is 35x) or a contaminant in the sample. The GFA output (asm.unitigging.gfa) should have more information, if the graph looks like it has lots of alternate paths it is likely a heterozygous sample. I suggest running mash screen (http://mash.readthedocs.io/en/latest/tutorials.html#screening-a-read-set-for-containment-of-refseq-genomes) to see what is in the sample, it won't discriminate similar strains but will identify mixtures of bacteria/viruses.

If it is the heterozygosity, you can try varying the unitigging parameters, try the separation option from the FAQ ('batOptions=-dg 3 -db 3 -dr 1 -ca 500 -cp 50'). You could also try smashing the heterozygous genomes together (corOutCoverage=100 overlapper=mhap utgReAlign=true correctedErrorRate=0.20 'batOptions=-dg 50 -db 50 -dr 1 -ca 500 -cp 50'). This also turns on the faster overlapping algorithm because the default will be slow at this high error rate. Keep in mind though even if you can smash the assembly into a single contig, the consensus will likely be a mix of all variation in your sample.

ml3958 commented 6 years ago

Thanks so much for the input!

As you suggested, I tried mash screen. The results are very surprising! We're assembly the genome for Oxalobacter formigenes. With the data yielding a good assembly, the highest/dominant hit is a O. formienges.

0.998305    965/1000    49  0   GCF_000158495.1_ASM15849v1_genomic.fna.gz   [9 seqs] NZ_GG658178.1 Oxalobacter formigenes OXCC13 genomic scaffold supercont1.9, whole genome shotgun sequence [...]
0.886148    79/1000 2066    1.12003e-201    GCF_000902695.1_ViralProj183144_genomic.fna.gz  NC_019711.1 Enterobacteria phage HK629, complete genome
0.872481    57/1000 74  1.14783e-137    ref|NZ_CP006375.1|  Aureimonas sp. AU20 plasmid pAU20rrn, complete sequence
0.837273    24/1000 34  9.10938e-50 ref|NZ_CP013748.1|  Arthrobacter sulfonivorans strain 

While in contrast, the other the data yielded a poor assembly had highest hits on phage, and plasmid.

0.889737    86/1000 5873    1.03654e-241    GCF_000903575.1_ViralProj183142_genomic.fna.gz  NC_019723.1 Enterobacteria phage HK630, complete genome
0.870998    55/1000 40  3.49853e-144    ref|NZ_CP006375.1|  Aureimonas sp. AU20 plasmid pAU20rrn, complete sequence
0.859885    42/1000 1   1.76113e-105    ref|NC_003789.1|    Klebsiella sp. KCL-2 plasmid pMGD2, complete sequence
0.841982    27/1000 11  3.83478e-63 GCF_000158475.2_Oxal_for_HOxBLS_2_V2_genomic.fna.gz [2 seqs] NZ_KI392030.1 Oxalobacter formigenes HOxBLS genomic scaffold supercont2.1, whole genome shotgun sequence [...]
0.837273    24/1000 19  4.58581e-55 ref|NZ_CP013748.1|  Arthrobacter sulfonivorans strain Ar51 plasmid, complete sequence

So I guess in this case, I should try to discard those reads first and then re-run Canu.

Thanks so much!

skoren commented 6 years ago

The other interesting thing is that the top hit shows 99% identity to Oxalobacter whereas the second sample is only 84%. The median multiplicity is also lower, down to 11 from 49. This multiplicity is measured using perfect k-mers so it is lower than true coverage due to sequencing error. However, this implies there is lower coverage of your target genome here or it's very diverse from what is in the DB. I would guess the super-high multiplicity Enterobacteria are lambda contamination. Removing the contamination can help, I'd also increase the corOutCoverage to 200 just to make sure you don't lose any data from your target genome. If you're able to share the reads (see FAQ for instructions to send it to us using FTP), we can take a look at the data here.

ml3958 commented 6 years ago

Thanks for pointing that out. You're exactly right. We believe we're assembly a novel strain that has not be characterized before, whose genome is different from the current ones even for same species.

I've put a seq.fna of 0.44 gb under incoming/sergek directory. Those are the nanopore 1D reads from those genome with the potential lambda contamination.

Many thanks!

skoren commented 6 years ago

Sorry for the delayed reply, after looking at the dataset, there does seem to be some variation in the sample which is preventing a more contiguous assembly. You can probably get a more contiguous assembly by using the heterozygous smash parameters from the FAQ and increasing the error rate to 0.25 from the default (but it will be slower). Another option is to assemble the Canu-corrected reads with something like smartdenovo which is more willing to collapse haplotypes.

ml3958 commented 6 years ago

No problem at all. I confirmed the existence of a phage by aligning my reads to the Enterobacteria phage HK630. The 10kb tail part of the page has coverage of 40,000 X, which suggest the phage existing in our sample might share similar sequence at the tail but have a vary head region.

I am trying to remove those reads and re-assemble. I'll post the results when I have them.

skoren commented 6 years ago

Closing, inactive.

ml3958 commented 6 years ago

Sorry for the delayed response.

As I said in the last comment, 39.97% of my reads partially mapped to a Enterobacteria phage. I removed those 40% reads and reassemble the genome with default Canu parameters for Nanopore data. Unfortunately this did not improve the assembly.

--
-- In gatekeeper store 'correction/oxk_reanalyze_filt_phageHK630_cov40.gkpStore':
--   Found 38120 reads.
--   Found 206984294 bases (83.12 times coverage).
--
--   Read length histogram (one '*' equals 328.68 reads):
--        0   4999  23008 **********************************************************************
--     5000   9999  10406 *******************************
--    10000  14999   2918 ********
--    15000  19999   1096 ***
--    20000  24999    423 *
--    25000  29999    157 
--    30000  34999     57 
--    35000  39999     24 
--    40000  44999     11 
--    45000  49999     10 
--    50000  54999      3 
--    55000  59999      2 
--    60000  64999      1 
--    65000  69999      0 
--    70000  74999      0 
--    75000  79999      2 
--    80000  84999      0 
--    85000  89999      0 
--    90000  94999      0 
--    95000  99999      0 
--   100000 104999      0 
--   105000 109999      0 
--   110000 114999      0 
--   115000 119999      0 
--   120000 124999      0 
--   125000 129999      1 
--   130000 134999      0 
--   135000 139999      0 
--   140000 144999      0 
--   145000 149999      0 
--   150000 154999      1

[CORRECTION/MERS]
--
--  16-mers                                                                                           Fraction
--    Occurrences   NumMers                                                                         Unique Total
--       1-     1 108390362 *******************************************************************--> 0.8372 0.5251
--       2-     2  12751429 ********************************************************************** 0.9357 0.6487
--       3-     4   4380637 ************************                                               0.9600 0.6945
--       5-     7   1246754 ******                                                                 0.9744 0.7334
--       8-    11    601027 ***                                                                    0.9806 0.7599
--      12-    16    646517 ***                                                                    0.9848 0.7872
--      17-    22    727511 ***                                                                    0.9898 0.8346
--      23-    29    492906 **                                                                     0.9952 0.9031
--      30-    37    170817                                                                        0.9985 0.9585
--      38-    46     34454                                                                        0.9996 0.9813
--      47-    56      9032                                                                        0.9998 0.9871
--      57-    67      4055                                                                        0.9999 0.9891
--      68-    79      2503                                                                        0.9999 0.9902
--      80-    92      1852                                                                        1.0000 0.9911
--      93-   106      1101                                                                        1.0000 0.9919
--     107-   121       757                                                                        1.0000 0.9924
--     122-   137       490                                                                        1.0000 0.9928
--     138-   154       352                                                                        1.0000 0.9931
--     155-   172       260                                                                        1.0000 0.9933
--     173-   191       201                                                                        1.0000 0.9935
--     192-   211       157                                                                        1.0000 0.9937
--     212-   232       110                                                                        1.0000 0.9938
--     233-   254       104                                                                        1.0000 0.9940
--     255-   277        74                                                                        1.0000 0.9941
--     278-   301        81                                                                        1.0000 0.9942
--     302-   326        80                                                                        1.0000 0.9943
--     327-   352        78                                                                        1.0000 0.9944
--     353-   379        81                                                                        1.0000 0.9945
--     380-   407        65                                                                        1.0000 0.9947
--     408-   436        46                                                                        1.0000 0.9948
--     437-   466        47                                                                        1.0000 0.9949
--     467-   497        40                                                                        1.0000 0.9950
--     498-   529        43                                                                        1.0000 0.9951
--     530-   562        37                                                                        1.0000 0.9952
--     563-   596        42                                                                        1.0000 0.9953
--     597-   631        28                                                                        1.0000 0.9954
--     632-   667        19                                                                        1.0000 0.9955
--     668-   704        16                                                                        1.0000 0.9956
--     705-   742        10                                                                        1.0000 0.9956
--     743-   781        18                                                                        1.0000 0.9957
--     782-   821        12                                                                        1.0000 0.9957
--
--       24559 (max occurrences)
--    98022132 (total mers, non-unique)
--    21074094 (distinct mers, non-unique)
--   108390362 (unique mers)

[CORRECTION/CORRECTIONS]
--
-- Reads to be corrected:
--   11558 reads longer than 6624 bp
--   110220812 bp
-- Expected corrected reads:
--   11558 reads
--   99601872 bp
--   4332 bp minimum length
--   8618 bp mean length
--   31857 bp n50 length

[TRIMMING/READS]
--
-- In gatekeeper store 'trimming/oxk_reanalyze_filt_phageHK630_cov40.gkpStore':
--   Found 12243 reads.
--   Found 100849920 bases (40.5 times coverage).
--
--   Read length histogram (one '*' equals 32.05 reads):
--        0    999      0 
--     1000   1999    135 ****
--     2000   2999    120 ***
--     3000   3999    140 ****
--     4000   4999   2015 **************************************************************
--     5000   5999   2244 **********************************************************************
--     6000   6999   1709 *****************************************************
--     7000   7999   1280 ***************************************
--     8000   8999    995 *******************************
--     9000   9999    746 ***********************
--    10000  10999    545 *****************
--    11000  11999    427 *************
--    12000  12999    360 ***********
--    13000  13999    291 *********
--    14000  14999    220 ******
--    15000  15999    183 *****
--    16000  16999    163 *****
--    17000  17999    135 ****
--    18000  18999    120 ***
--    19000  19999     99 ***
--    20000  20999     57 *
--    21000  21999     63 *
--    22000  22999     27 
--    23000  23999     30 
--    24000  24999     17 
--    25000  25999     24 
--    26000  26999     14 
--    27000  27999     14 
--    28000  28999     14 
--    29000  29999     12 
--    30000  30999     12 
--    31000  31999      4 
--    32000  32999      3 
--    33000  33999      1 
--    34000  34999      4 
--    35000  35999      6 
--    36000  36999      1 
--    37000  37999      2 
--    38000  38999      2 
--    39000  39999      4 
--    40000  40999      1 
--    41000  41999      1 
--    42000  42999      1 
--    43000  43999      0 
--    44000  44999      0 
--    45000  45999      1 
--    46000  46999      0 
--    47000  47999      0 
--    48000  48999      0 
--    49000  49999      0 
--    50000  50999      1

[TRIMMING/MERS]
--
--  22-mers                                                                                           Fraction
--    Occurrences   NumMers                                                                         Unique Total
--       1-     1   6583745 *******************************************************************--> 0.6023 0.0654
--       2-     2    808815 ********************************************************************** 0.6763 0.0815
--       3-     4    535948 **********************************************                         0.7073 0.0916
--       5-     7    310275 **************************                                             0.7376 0.1062
--       8-    11    211404 ******************                                                     0.7596 0.1225
--      12-    16    179937 ***************                                                        0.7766 0.1417
--      17-    22    212949 ******************                                                     0.7925 0.1674
--      23-    29    370652 ********************************                                       0.8129 0.2133
--      30-    37    657793 ********************************************************               0.8491 0.3206
--      38-    46    724870 **************************************************************         0.9117 0.5563
--      47-    56    282545 ************************                                               0.9741 0.8450
--      57-    67     35359 ***                                                                    0.9960 0.9665
--      68-    79      4750                                                                        0.9985 0.9830
--      80-    92      3853                                                                        0.9989 0.9864
--      93-   106      3592                                                                        0.9992 0.9895
--     107-   121      1080                                                                        0.9996 0.9930
--     122-   137      1367                                                                        0.9997 0.9941
--     138-   154      1693                                                                        0.9998 0.9959
--     155-   172       192                                                                        0.9999 0.9983
--     173-   191       120                                                                        0.9999 0.9985
--     192-   211       213                                                                        1.0000 0.9988
--     212-   232       156                                                                        1.0000 0.9992
--     233-   254        19                                                                        1.0000 0.9995
--     255-   277        12                                                                        1.0000 0.9996
--     278-   301         7                                                                        1.0000 0.9996
--     302-   326         5                                                                        1.0000 0.9996
--     327-   352        18                                                                        1.0000 0.9996
--     353-   379         0                                                                        0.0000 0.0000
--     380-   407         0                                                                        0.0000 0.0000
--     408-   436         0                                                                        0.0000 0.0000
--     437-   466         0                                                                        0.0000 0.0000
--     467-   497         2                                                                        1.0000 0.9997
--     498-   529         0                                                                        0.0000 0.0000
--     530-   562         1                                                                        1.0000 0.9997
--     563-   596         2                                                                        1.0000 0.9997
--     597-   631         7                                                                        1.0000 0.9997
--     632-   667         2                                                                        1.0000 0.9998
--     668-   704         0                                                                        0.0000 0.0000
--     705-   742         0                                                                        0.0000 0.0000
--     743-   781         0                                                                        0.0000 0.0000
--     782-   821         0                                                                        0.0000 0.0000
--
--        1753 (max occurrences)
--    94009072 (total mers, non-unique)
--     4347659 (distinct mers, non-unique)
--     6583745 (unique mers)

[TRIMMING/TRIMMING]
--  PARAMETERS:
--  ----------
--     1000    (reads trimmed below this many bases are deleted)
--   0.1440    (use overlaps at or below this fraction error)
--        1    (break region if overlap is less than this long, for 'largest covered' algorithm)
--        1    (break region if overlap coverage is less than this many read, for 'largest covered' algorithm)
--  
--  INPUT READS:
--  -----------
--   12243 reads    100849920 bases (reads processed)
--       0 reads            0 bases (reads not processed, previously deleted)
--       0 reads            0 bases (reads not processed, in a library where trimming isn't allowed)
--  
--  OUTPUT READS:
--  ------------
--    9650 reads     76407005 bases (trimmed reads output)
--    2579 reads     18653267 bases (reads with no change, kept as is)
--      11 reads        29427 bases (reads with no overlaps, deleted)
--       3 reads        11169 bases (reads with short trimmed length, deleted)
--  
--  TRIMMING DETAILS:
--  ----------------
--    4937 reads      2661208 bases (bases trimmed from the 5' end of a read)
--    7881 reads      3087844 bases (bases trimmed from the 3' end of a read)

[TRIMMING/SPLITTING]
--  PARAMETERS:
--  ----------
--     1000    (reads trimmed below this many bases are deleted)
--   0.1440    (use overlaps at or below this fraction error)
--  INPUT READS:
--  -----------
--   12229 reads    100809324 bases (reads processed)
--      14 reads        40596 bases (reads not processed, previously deleted)
--       0 reads            0 bases (reads not processed, in a library where trimming isn't allowed)
--  
--  PROCESSED:
--  --------
--       0 reads            0 bases (no overlaps)
--       0 reads            0 bases (no coverage after adjusting for trimming done already)
--       0 reads            0 bases (processed for chimera)
--       0 reads            0 bases (processed for spur)
--   12229 reads    100809324 bases (processed for subreads)
--  
--  READS WITH SIGNALS:
--  ------------------
--       0 reads            0 signals (number of 5' spur signal)
--       0 reads            0 signals (number of 3' spur signal)
--       0 reads            0 signals (number of chimera signal)
--      94 reads           94 signals (number of subread signal)
--  
--  SIGNALS:
--  -------
--       0 reads            0 bases (size of 5' spur signal)
--       0 reads            0 bases (size of 3' spur signal)
--       0 reads            0 bases (size of chimera signal)
--      94 reads        28870 bases (size of subread signal)
--  
--  TRIMMING:
--  --------
--      45 reads       253562 bases (trimmed from the 5' end of the read)
--      49 reads       282368 bases (trimmed from the 3' end of the read)

[UNITIGGING/READS]
--
-- In gatekeeper store 'unitigging/oxk_reanalyze_filt_phageHK630_cov40.gkpStore':
--   Found 12229 reads.
--   Found 94524342 bases (37.96 times coverage).
--
--   Read length histogram (one '*' equals 32.95 reads):
--        0    999      0 
--     1000   1999    134 ****
--     2000   2999    147 ****
--     3000   3999    232 *******
--     4000   4999   2141 ****************************************************************
--     5000   5999   2307 **********************************************************************
--     6000   6999   1776 *****************************************************
--     7000   7999   1316 ***************************************
--     8000   8999   1004 ******************************
--     9000   9999    729 **********************
--    10000  10999    529 ****************
--    11000  11999    429 *************
--    12000  12999    322 *********
--    13000  13999    259 *******
--    14000  14999    182 *****
--    15000  15999    155 ****
--    16000  16999    137 ****
--    17000  17999    105 ***
--    18000  18999    103 ***
--    19000  19999     73 **
--    20000  20999     43 *
--    21000  21999     42 *
--    22000  22999      9 
--    23000  23999     13 
--    24000  24999      6 
--    25000  25999      8 
--    26000  26999      6 
--    27000  27999      7 
--    28000  28999      5 
--    29000  29999      2 
--    30000  30999      2 
--    31000  31999      3 
--    32000  32999      1 
--    33000  33999      0 
--    34000  34999      0 
--    35000  35999      1 
--    36000  36999      0 
--    37000  37999      1

[UNITIGGING/MERS]
--
--  22-mers                                                                                           Fraction
--    Occurrences   NumMers                                                                         Unique Total
--       1-     1   5634635 *******************************************************************--> 0.5742 0.0598
--       2-     2    734273 *********************************************************************  0.6490 0.0754
--       3-     4    494517 **********************************************                         0.6807 0.0853
--       5-     7    292137 ***************************                                            0.7122 0.0997
--       8-    11    204997 *******************                                                    0.7353 0.1162
--      12-    16    182673 *****************                                                      0.7540 0.1364
--      17-    22    235910 **********************                                                 0.7722 0.1648
--      23-    29    431399 ****************************************                               0.7977 0.2196
--      30-    37    740960 ********************************************************************** 0.8452 0.3547
--      38-    46    644939 ************************************************************           0.9218 0.6305
--      47-    56    186371 *****************                                                      0.9815 0.8932
--      57-    67     15852 *                                                                      0.9973 0.9769
--      68-    79      4920                                                                        0.9985 0.9845
--      80-    92      3868                                                                        0.9990 0.9884
--      93-   106      2015                                                                        0.9994 0.9919
--     107-   121      1221                                                                        0.9996 0.9938
--     122-   137      1886                                                                        0.9997 0.9953
--     138-   154       469                                                                        0.9999 0.9979
--     155-   172       119                                                                        0.9999 0.9985
--     173-   191       169                                                                        1.0000 0.9987
--     192-   211       218                                                                        1.0000 0.9991
--     212-   232        24                                                                        1.0000 0.9995
--     233-   254         6                                                                        1.0000 0.9996
--     255-   277        13                                                                        1.0000 0.9996
--     278-   301         6                                                                        1.0000 0.9996
--     302-   326        18                                                                        1.0000 0.9996
--     327-   352         1                                                                        1.0000 0.9997
--     353-   379         0                                                                        0.0000 0.0000
--     380-   407         0                                                                        0.0000 0.0000
--     408-   436         0                                                                        0.0000 0.0000
--     437-   466         1                                                                        1.0000 0.9997
--     467-   497         1                                                                        1.0000 0.9997
--     498-   529         0                                                                        0.0000 0.0000
--     530-   562         3                                                                        1.0000 0.9997
--     563-   596         6                                                                        1.0000 0.9997
--     597-   631         2                                                                        1.0000 0.9998
--     632-   667         1                                                                        1.0000 0.9998
--     668-   704         0                                                                        0.0000 0.0000
--     705-   742         0                                                                        0.0000 0.0000
--     743-   781         0                                                                        0.0000 0.0000
--     782-   821         0                                                                        0.0000 0.0000
--
--        1147 (max occurrences)
--    88632898 (total mers, non-unique)
--     4179016 (distinct mers, non-unique)
--     5634635 (unique mers)

[UNITIGGING/OVERLAPS]
--   category            reads     %          read length        feature size or coverage  analysis
--   ----------------  -------  -------  ----------------------  ------------------------  --------------------
--   middle-missing          1    0.01     1630.00 +- 0.00           593.00 +- 0.00       (bad trimming)
--   middle-hump             0    0.00        0.00 +- 0.00             0.00 +- 0.00       (bad trimming)
--   no-5-prime              1    0.01     1032.00 +- 0.00           444.00 +- 0.00       (bad trimming)
--   no-3-prime              1    0.01     1740.00 +- 0.00          1116.00 +- 0.00       (bad trimming)
--   
--   low-coverage            2    0.02     1107.50 +- 68.59            2.46 +- 0.52       (easy to assemble, potential for lower quality consensus)
--   unique               9071   74.18     7504.22 +- 3648.41         36.19 +- 6.66       (easy to assemble, perfect, yay)
--   repeat-cont           101    0.83     5265.41 +- 1695.23         71.87 +- 21.11      (potential for consensus errors, no impact on assembly)
--   repeat-dove             0    0.00        0.00 +- 0.00             0.00 +- 0.00       (hard to assemble, likely won't assemble correctly or even at all)
--   
--   span-repeat          2158   17.65     8714.47 +- 4229.00       2602.94 +- 2587.29    (read spans a large repeat, usually easy to assemble)
--   uniq-repeat-cont      725    5.93     7352.06 +- 3047.25                             (should be uniquely placed, low potential for consensus errors, no impact on assembly)
--   uniq-repeat-dove       46    0.38    15586.87 +- 5505.69                             (will end contigs, potential to misassemble)
--   uniq-anchor           123    1.01     8634.78 +- 3405.03       2348.91 +- 2830.11    (repeat read, with unique section, probable bad read)

[UNITIGGING/ADJUSTMENT]
-- No report available.

[UNITIGGING/CONTIGS]
-- Found, in version 1, after unitig construction:
--   contigs:      31 sequences, total length 2869593 bp (including 1 repeats of total length 37069 bp).
--   bubbles:      0 sequences, total length 0 bp.
--   unassembled:  1809 sequences, total length 14514783 bp.
--
-- Contig sizes based on genome size --
--            NG (bp)  LG (contigs)    sum (bp)
--         ----------  ------------  ----------
--     10     1264266             1     1264266
--     20     1264266             1     1264266
--     30     1264266             1     1264266
--     40     1264266             1     1264266
--     50     1264266             1     1264266
--     60      526253             2     1790519
--     70      526253             2     1790519
--     80      328024             3     2118543
--     90       59023             5     2289615
--    100       32956            10     2496209
--    110       17154            21     2749093
--

[UNITIGGING/CONSENSUS]
-- Found, in version 2, after consensus generation:
--   contigs:      31 sequences, total length 2871943 bp (including 1 repeats of total length 37031 bp).
--   bubbles:      0 sequences, total length 0 bp.
--   unassembled:  1809 sequences, total length 14515081 bp.
--
-- Contig sizes based on genome size --
--            NG (bp)  LG (contigs)    sum (bp)
--         ----------  ------------  ----------
--     10     1263438             1     1263438
--     20     1263438             1     1263438
--     30     1263438             1     1263438
--     40     1263438             1     1263438
--     50     1263438             1     1263438
--     60      526566             2     1790004
--     70      526566             2     1790004
--     80      329130             3     2119134
--     90       59182             5     2290853
--    100       32902            10     2498423
--    110       17189            21     2752071
--
skoren commented 6 years ago

I think the issue is there is heterozygosity in the sample and the default error rate coupled with Canu being conservative when it seems sample variation is splitting the assembly. Can you share the log files (asm.*) in unitigging/4-unitigger. Have you tried my suggestion to increase the error rate for assembly as well?

ml3958 commented 6 years ago

Definitely! I did not find asm.* files under unitigging/4-unitigger. Those are all the files in the folder. Which one should I share?

alignGFA.sh                                                                oxk_reanalyze_filt_phageHK630_cov40.best.singletons
oxk_reanalyze_filt_phageHK630_cov40.001.filterOverlaps.thr000.num000.log   oxk_reanalyze_filt_phageHK630_cov40.best.spurs
oxk_reanalyze_filt_phageHK630_cov40.003.buildGreedy.sizes                  oxk_reanalyze_filt_phageHK630_cov40.contigs.aligned.gfa
oxk_reanalyze_filt_phageHK630_cov40.004.placeContains.sizes                oxk_reanalyze_filt_phageHK630_cov40.contigs.aligned.gfa.err
oxk_reanalyze_filt_phageHK630_cov40.005.mergeOrphans.sizes                 oxk_reanalyze_filt_phageHK630_cov40.contigs.gfa
oxk_reanalyze_filt_phageHK630_cov40.005.mergeOrphans.thr000.num000.log     oxk_reanalyze_filt_phageHK630_cov40.final.assembly.gfa
oxk_reanalyze_filt_phageHK630_cov40.005.mergeOrphans.unassembled           oxk_reanalyze_filt_phageHK630_cov40.initial.assembly.gfa
oxk_reanalyze_filt_phageHK630_cov40.007.breakRepeats.sizes                 oxk_reanalyze_filt_phageHK630_cov40.unitigs.aligned.bed
oxk_reanalyze_filt_phageHK630_cov40.007.breakRepeats.thr000.num000.log     oxk_reanalyze_filt_phageHK630_cov40.unitigs.aligned.bed.err
oxk_reanalyze_filt_phageHK630_cov40.008.cleanupMistakes.thr000.num000.log  oxk_reanalyze_filt_phageHK630_cov40.unitigs.aligned.gfa
oxk_reanalyze_filt_phageHK630_cov40.009.generateOutputs.overlaps           oxk_reanalyze_filt_phageHK630_cov40.unitigs.aligned.gfa.err
oxk_reanalyze_filt_phageHK630_cov40.009.generateOutputs.sizes              oxk_reanalyze_filt_phageHK630_cov40.unitigs.bed
oxk_reanalyze_filt_phageHK630_cov40.009.generateOutputs.thr000.num000.log  oxk_reanalyze_filt_phageHK630_cov40.unitigs.gfa
oxk_reanalyze_filt_phageHK630_cov40.011.generateUnitigs.thr000.num000.log  unitigger.1.out
oxk_reanalyze_filt_phageHK630_cov40.best.contains.histogram                unitigger.err
oxk_reanalyze_filt_phageHK630_cov40.best.edges                             unitigger.jobSubmit-01.out
oxk_reanalyze_filt_phageHK630_cov40.best.edges.gfa                         unitigger.jobSubmit-01.sh
oxk_reanalyze_filt_phageHK630_cov40.best.edges.histogram                   unitigger.sh
oxk_reanalyze_filt_phageHK630_cov40.best.edges.suspicious                  unitigger.success

Hi I tried the smashing parameterscorOutCoverage=100 overlapper=mhap utgReAlign=true correctedErrorRate=0.20 'batOptions=-dg 50 -db 50 -dr 1 -ca 500 -cp 50 And it did not work.

I have't try just increasing the error rate to 0.25 but I can definitely try it now.

skoren commented 6 years ago

The files named oxk_reanalyze_filt_phageHK630_cov4.* would work.

ml3958 commented 6 years ago

Archive.zip

Many thanks!

skoren commented 6 years ago

Yes, definitely looks like some variation is preventing this from being circularized/causing the splits. Initially, the contig is the single chromosome:

cat oxk_reanalyze_filt_phageHK630_cov40.005.mergeOrphans.sizes
CONTIGS (23 tigs) (2828698 length) (122986 average) (1.14x coverage)
ng010   2153084   lg010        1   sum     2153084  (CONTIGS)
ng020   2153084   lg020        1   sum     2153084  (CONTIGS)
ng030   2153084   lg030        1   sum     2153084  (CONTIGS)
ng040   2153084   lg040        1   sum     2153084  (CONTIGS)
ng050   2153084   lg050        1   sum     2153084  (CONTIGS)
ng060   2153084   lg060        1   sum     2153084  (CONTIGS)
ng070   2153084   lg070        1   sum     2153084  (CONTIGS)
ng080   2153084   lg080        1   sum     2153084  (CONTIGS)
ng090    112049   lg090        2   sum     2265133  (CONTIGS)
ng100     36426   lg100        7   sum     2495417  (CONTIGS)
ng110     21013   lg110       17   sum     2751967  (CONTIGS)

but is then split because it has poor support and conflicting evidence. If you want to get that initial contig run with canu -assemble -nanopore-corrected <your asm folder>/*.trimmedReads.fastq.gz overlapper=mhap utgReAlign=true correctedErrorRate=0.20 batOptions=-dg 50 -db 50 -dr 1 -ca 0 -cp 0

ml3958 commented 6 years ago

Thanks for the explanation!!

I tried your suggestion

/ifs/home/lium14/tools/canu-1.6/*/bin/canu \
    -assemble \
    -p $JOB_NAME_continous_contig \
    -d $output/1_assemble/$JOB_NAME_continous_contig \
    genomeSize=2.49m \
    overlapper=mhap utgReAlign=true \
    correctedErrorRate=0.20 \
    batOptions='-dg 50 -db 50 -dr 1 -ca 0 -cp 0'
    -nanopore-corrected $output/1_assemble/$JOB_NAME/$JOB_NAME.trimmedReads.fasta.gz 

But keep getting this error

ERROR:  File supplied on command line; use -s, -pacbio-raw, -pacbio-corrected, -nanopore-raw, or -nanopore-corrected.

I did provide the correct nanopore corrected read...

skoren commented 6 years ago

I'd guess it's getting confused by the directory being inside the previous run. Try renaming the trimmed reads to be something else and putting the -d folder outside the previous run.

ml3958 commented 6 years ago

Thanks. It worked.