marbl / canu

A single molecule sequence assembler for genomes large and small.
http://canu.readthedocs.io/
658 stars 179 forks source link

Very low fraction of corrected reads #584

Closed janvanoeveren closed 6 years ago

janvanoeveren commented 7 years ago

I've run canu-1.5 for a huge genome with (only) 25x PacBio data on a grid:

./canu-1.5/Linux-amd64/bin/canu -correct genomeSize=huge(>10g) corErrorRate=0.02 batThreads=64 batMemory=250 maxThreads=128 maxMemory=1024 gridEngineThreadsOption="-pe smp THREADS" gridOptions="-r y -V -p -500" gridEngineMemoryOption="-l mem_free=MEMORY" -pacbio-raw

and initially got no correctedReads because of the - corErrorRate=0.02 - setting. Then changed the options to

rawErrorRate=0.3 correctedErrorRate=0.03

removed the 2-correction folder and asm.correctedReads.fasta.gz and restarted the pipeline.

It now finished but yielding only a 2x corrected reads set. The input 25x is expected to reduce to ~8x (why?), but actually reduces to 3x (corrected) and 2x (trimmed) The report is given below. Which parameters should I tune to get more corrected reads for the assembly step? Maybe corMinCoverage=0? Or should I redo the pipeline from scratch as possibly some steps could have been messed up with the corErrorRate setting? (the correction step is taking a lot of time ...- maybe I should also adjust maxCount??)

janvanoeveren commented 7 years ago
--
-- In gatekeeper store 'correction/S245.gkpStore':
--   Found 38047697 reads.
--   Found 403,673,836,468 bases (25.22 times coverage).
--
--   Read length histogram (one '*' equals 152629.08 reads):
--        0   4999 10684036 **********************************************************************
--     5000   9999 10039231 *****************************************************************
--    10000  14999 6893564 *********************************************
--    15000  19999 5896892 **************************************
--    20000  24999 2988107 *******************
--    25000  29999 1039064 ******
--    30000  34999 344323 **
--    35000  39999 114081 
--    40000  44999  34098 
--    45000  49999  10092 
--    50000  54999   2948 
--    55000  59999    794 
--    60000  64999    228 
--    65000  69999     98 
--    70000  74999     53 
--    75000  79999     34 
--    80000  84999     18 
--    85000  89999     10 
--    90000  94999      9 
--    95000  99999      3 
--   100000 104999      4 
--   105000 109999      4 
--   110000 114999      1 
--   115000 119999      0 
--   120000 124999      0 
--   125000 129999      2 
--   130000 134999      1 
--   135000 139999      0 
--   140000 144999      0 
--   145000 149999      1 
--   150000 154999      0 
--   155000 159999      0 
--   160000 164999      0 
--   165000 169999      0 
--   170000 174999      0 
--   175000 179999      0 
--   180000 184999      0 
--   185000 189999      0 
--   190000 194999      1

[CORRECTION/MERS]
--
--  16-mers                                                                                           Fraction
--    Occurrences   NumMers                                                                         Unique Total
--       1-     1     13502                                                                        0.0000 0.0000
--       2-     2     58981                                                                        0.0000 0.0000
--       3-     4    612639                                                                        0.0001 0.0000
--       5-     7   4916517 **                                                                     0.0007 0.0001
--       8-    11  23369926 *********                                                              0.0043 0.0002
--      12-    16  63849014 **************************                                             0.0180 0.0009
--      17-    22 112864916 **********************************************                         0.0510 0.0036
--      23-    29 150070237 *************************************************************          0.1055 0.0096
--      30-    37 168375992 *********************************************************************  0.1757 0.0197
--      38-    46 170742609 ********************************************************************** 0.2534 0.0340
--      47-    56 162947190 ******************************************************************     0.3317 0.0519
--      57-    67 149860816 *************************************************************          0.4064 0.0727
--      68-    79 134602847 *******************************************************                0.4750 0.0956
--      80-    92 119042845 ************************************************                       0.5366 0.1200
--      93-   106 104304966 ******************************************                             0.5912 0.1453
--     107-   121  90937395 *************************************                                  0.6390 0.1708
--     122-   137  79114549 ********************************                                       0.6808 0.1964
--     138-   154  68754925 ****************************                                           0.7172 0.2216
--     155-   172  59746951 ************************                                               0.7488 0.2463
--     173-   191  51966684 *********************                                                  0.7763 0.2704
--     192-   211  45239359 ******************                                                     0.8003 0.2937
--     212-   232  39485057 ****************                                                       0.8212 0.3162
--     233-   254  34565616 **************                                                         0.8394 0.3378
--     255-   277  30344262 ************                                                           0.8553 0.3586
--     278-   301  26733489 **********                                                             0.8694 0.3785
--     302-   326  23638409 *********                                                              0.8817 0.3976
--     327-   352  20967790 ********                                                               0.8927 0.4160
--     353-   379  18630927 *******                                                                0.9024 0.4335
--     380-   407  16595182 ******                                                                 0.9110 0.4504
--     408-   436  14817086 ******                                                                 0.9187 0.4665
--     437-   466  13263561 *****                                                                  0.9256 0.4820
--     467-   497  11896250 ****                                                                   0.9317 0.4968
--     498-   529  10698137 ****                                                                   0.9372 0.5110
--     530-   562   9639556 ***                                                                    0.9422 0.5246
--     563-   596   8710691 ***                                                                    0.9466 0.5376
--     597-   631   7880595 ***                                                                    0.9507 0.5501
--     632-   667   7145741 **                                                                     0.9543 0.5620
--     668-   704   6487045 **                                                                     0.9577 0.5735
--     705-   742   5904586 **                                                                     0.9607 0.5845
--     743-   781   5384402 **                                                                     0.9634 0.5951
--     782-   821   4915830 **                                                                     0.9659 0.6053
--
--   184027607 (max occurrences)
-- 403103107511 (total mers, non-unique)
--  2147501366 (distinct mers, non-unique)
--       13502 (unique mers)

[CORRECTION/CORRECTIONS]
--
-- Reads to be corrected:
--   37535364 reads longer than 0 bp
--   397623746990 bp
-- Expected corrected reads:
--   37535364 reads
--   132550121930 bp
--   0 bp minimum length
--   3531 bp mean length
--   28841 bp n50 length
janvanoeveren commented 7 years ago

--
-- In gatekeeper store 'trimming/S245.gkpStore':
--   Found 5837622 reads.
--   Found 50384133956 bases (3.14 times coverage).
--
--   Read length histogram (one '*' equals 7793.08 reads):
--        0    999      0 
--     1000   1999 545516 **********************************************************************
--     2000   2999 492982 ***************************************************************
--     3000   3999 465366 ***********************************************************
--     4000   4999 445631 *********************************************************
--     5000   5999 428239 ******************************************************
--     6000   6999 408010 ****************************************************
--     7000   7999 379438 ************************************************
--     8000   8999 345839 ********************************************
--     9000   9999 308919 ***************************************
--    10000  10999 272736 **********************************
--    11000  11999 239700 ******************************
--    12000  12999 212267 ***************************
--    13000  13999 190739 ************************
--    14000  14999 174986 **********************
--    15000  15999 162431 ********************
--    16000  16999 150965 *******************
--    17000  17999 133458 *****************
--    18000  18999 112792 **************
--    19000  19999  91562 ***********
--    20000  20999  71094 *********
--    21000  21999  53768 ******
--    22000  22999  40438 *****
--    23000  23999  29682 ***
--    24000  24999  21979 **
--    25000  25999  16258 **
--    26000  26999  11685 *
--    27000  27999   8455 *
--    28000  28999   6124 
--    29000  29999   4477 
--    30000  30999   3306 
--    31000  31999   2404 
--    32000  32999   1730 
--    33000  33999   1289 
--    34000  34999    982 
--    35000  35999    657 
--    36000  36999    517 
--    37000  37999    359 
--    38000  38999    245 
--    39000  39999    165 
--    40000  40999    125 
--    41000  41999     98 
--    42000  42999     52 
--    43000  43999     51 
--    44000  44999     31 
--    45000  45999     20 
--    46000  46999     23 
--    47000  47999     11 
--    48000  48999      8 
--    49000  49999      1 
--    50000  50999      3 
--    51000  51999      3 
--    52000  52999      2 
--    53000  53999      0 
--    54000  54999      2 
--    55000  55999      0 
--    56000  56999      0 
--    57000  57999      0 
--    58000  58999      0 
--    59000  59999      1 
--    60000  60999      0 
--    61000  61999      0 
--    62000  62999      0 
--    63000  63999      0 
--    64000  64999      0 
--    65000  65999      0 
--    66000  66999      0 
--    67000  67999      0 
--    68000  68999      1

[TRIMMING/MERS]
--
--  22-mers                                                                                           Fraction
--    Occurrences   NumMers                                                                         Unique Total
--       1-     1 6924284039 *******************************************************************--> 0.5866 0.1378
--       2-     2 1249562859 ****************************************************************       0.6925 0.1875
--       3-     4 1363190979 ********************************************************************** 0.7575 0.2333
--       5-     7 1160121650 ***********************************************************            0.8487 0.3285
--       8-    11 612974142 *******************************                                        0.9255 0.4516
--      12-    16 218190666 ***********                                                            0.9640 0.5434
--      17-    22  91916905 ****                                                                   0.9785 0.5929
--      23-    29  52253673 **                                                                     0.9853 0.6250
--      30-    37  33406215 *                                                                      0.9894 0.6504
--      38-    46  22667880 *                                                                      0.9920 0.6716
--      47-    56  16004912                                                                        0.9938 0.6899
--      57-    67  11675011                                                                        0.9951 0.7058
--      68-    79   8719891                                                                        0.9961 0.7198
--      80-    92   6612294                                                                        0.9968 0.7323
--      93-   106   5072105                                                                        0.9973 0.7434
--     107-   121   3956379                                                                        0.9978 0.7533
--     122-   137   3144889                                                                        0.9981 0.7621
--     138-   154   2557903                                                                        0.9983 0.7701
--     155-   172   2093773                                                                        0.9986 0.7775
--     173-   191   1731618                                                                        0.9987 0.7842
--     192-   211   1452283                                                                        0.9989 0.7904
--     212-   232   1223406                                                                        0.9990 0.7962
--     233-   254   1047301                                                                        0.9991 0.8016
--     255-   277    905209                                                                        0.9992 0.8066
--     278-   301    785632                                                                        0.9993 0.8114
--     302-   326    684016                                                                        0.9993 0.8159
--     327-   352    597028                                                                        0.9994 0.8201
--     353-   379    525739                                                                        0.9994 0.8241
--     380-   407    465789                                                                        0.9995 0.8279
--     408-   436    413197                                                                        0.9995 0.8316
--     437-   466    370323                                                                        0.9996 0.8350
--     467-   497    333767                                                                        0.9996 0.8383
--     498-   529    300401                                                                        0.9996 0.8415
--     530-   562    277376                                                                        0.9996 0.8446
--     563-   596    255442                                                                        0.9997 0.8476
--     597-   631    236696                                                                        0.9997 0.8505
--     632-   667    217767                                                                        0.9997 0.8534
--     668-   704    200751                                                                        0.9997 0.8562
--     705-   742    192727                                                                        0.9997 0.8590
--     743-   781    187656                                                                        0.9998 0.8617
--     782-   821    177915                                                                        0.9998 0.8646
--
--     4489071 (max occurrences)
-- 43337259840 (total mers, non-unique)
--  4879144506 (distinct mers, non-unique)
--  6924284039 (unique mers)

[TRIMMING/TRIMMING]
--  PARAMETERS:
--  ----------
--     1000    (reads trimmed below this many bases are deleted)
--   0.0300    (use overlaps at or below this fraction error)
--        1    (break region if overlap is less than this long, for 'largest covered' algorithm)
--        1    (break region if overlap coverage is less than this many read, for 'largest covered' algorithm)
--  
--  INPUT READS:
--  -----------
--  5837622 reads  50384133956 bases (reads processed)
--       0 reads            0 bases (reads not processed, previously deleted)
--       0 reads            0 bases (reads not processed, in a library where trimming isn't allowed)
--  
--  OUTPUT READS:
--  ------------
--  4726414 reads  33664301899 bases (trimmed reads output)
--  228287 reads   1810560017 bases (reads with no change, kept as is)
--  418017 reads   1544988989 bases (reads with no overlaps, deleted)
--  464904 reads   2372378883 bases (reads with short trimmed length, deleted)
--  
--  TRIMMING DETAILS:
--  ----------------
--  4151395 reads   5873957073 bases (bases trimmed from the 5' end of a read)
--  4346958 reads   5117947095 bases (bases trimmed from the 3' end of a read)

[TRIMMING/SPLITTING]
--  PARAMETERS:
--  ----------
--     1000    (reads trimmed below this many bases are deleted)
--   0.0300    (use overlaps at or below this fraction error)
--  INPUT READS:
--  -----------
--  4954701 reads  46466766084 bases (reads processed)
--  882921 reads   3917367872 bases (reads not processed, previously deleted)
--       0 reads            0 bases (reads not processed, in a library where trimming isn't allowed)
--  
--  PROCESSED:
--  --------
--       0 reads            0 bases (no overlaps)
--   80265 reads    556668360 bases (no coverage after adjusting for trimming done already)
--       0 reads            0 bases (processed for chimera)
--       0 reads            0 bases (processed for spur)
--  4874436 reads  45910097724 bases (processed for subreads)
--  
--  READS WITH SIGNALS:
--  ------------------
--       0 reads            0 signals (number of 5' spur signal)
--       0 reads            0 signals (number of 3' spur signal)
--       0 reads            0 signals (number of chimera signal)
--    1097 reads         1106 signals (number of subread signal)
--  
--  SIGNALS:
--  -------
--       0 reads            0 bases (size of 5' spur signal)
--       0 reads            0 bases (size of 3' spur signal)
--       0 reads            0 bases (size of chimera signal)
--    1106 reads       430611 bases (size of subread signal)
--  
--  TRIMMING:
--  --------
--     611 reads      2287734 bases (trimmed from the 5' end of the read)
--     487 reads      1831761 bases (trimmed from the 3' end of the read)

[UNITIGGING/READS]
--
-- In gatekeeper store 'unitigging/S245.gkpStore':
--   Found 4954696 reads.
--   Found 35470738030 bases (2.21 times coverage).
--
--   Read length histogram (one '*' equals 11499.15 reads):
--        0    999      0 
--     1000   1999 804941 **********************************************************************
--     2000   2999 571379 *************************************************
--     3000   3999 469253 ****************************************
--     4000   4999 412759 ***********************************
--     5000   5999 367490 *******************************
--     6000   6999 326007 ****************************
--     7000   7999 283755 ************************
--     8000   8999 246145 *********************
--     9000   9999 213307 ******************
--    10000  10999 184946 ****************
--    11000  11999 159770 *************
--    12000  12999 139026 ************
--    13000  13999 121889 **********
--    14000  14999 109927 *********
--    15000  15999 100375 ********
--    16000  16999  90883 *******
--    17000  17999  79396 ******
--    18000  18999  65755 *****
--    19000  19999  52790 ****
--    20000  20999  40967 ***
--    21000  21999  30431 **
--    22000  22999  23025 **
--    23000  23999  16618 *
--    24000  24999  12340 *
--    25000  25999   9021 
--    26000  26999   6331 
--    27000  27999   4607 
--    28000  28999   3223 
--    29000  29999   2312 
--    30000  30999   1662 
--    31000  31999   1232 
--    32000  32999    889 
--    33000  33999    669 
--    34000  34999    477 
--    35000  35999    326 
--    36000  36999    250 
--    37000  37999    165 
--    38000  38999    102 
--    39000  39999     75 
--    40000  40999     60 
--    41000  41999     39 
--    42000  42999     28 
--    43000  43999     13 
--    44000  44999     11 
--    45000  45999     13 
--    46000  46999      5 
--    47000  47999      6 
--    48000  48999      2 
--    49000  49999      1 
--    50000  50999      1 
--    51000  51999      2

[UNITIGGING/MERS]
--
--  22-mers                                                                                           Fraction
--    Occurrences   NumMers                                                                         Unique Total
--       1-     1 3264009111 *******************************************************************--> 0.4672 0.0923
--       2-     2 828616052 *******************************************************                0.5858 0.1391
--       3-     4 1054097255 ********************************************************************** 0.6685 0.1881
--       5-     7 960518507 ***************************************************************        0.7932 0.2978
--       8-    11 507767805 *********************************                                      0.9013 0.4436
--      12-    16 172219029 ***********                                                            0.9548 0.5506
--      17-    22  68782627 ****                                                                   0.9739 0.6054
--      23-    29  38248189 **                                                                     0.9824 0.6394
--      30-    37  24091289 *                                                                      0.9874 0.6657
--      38-    46  16202868 *                                                                      0.9906 0.6874
--      47-    56  11313797                                                                        0.9928 0.7059
--      57-    67   8203087                                                                        0.9944 0.7219
--      68-    79   6067241                                                                        0.9955 0.7359
--      80-    92   4576255                                                                        0.9963 0.7482
--      93-   106   3480468                                                                        0.9970 0.7591
--     107-   121   2683573                                                                        0.9975 0.7687
--     122-   137   2142267                                                                        0.9978 0.7773
--     138-   154   1726548                                                                        0.9981 0.7850
--     155-   172   1405949                                                                        0.9984 0.7921
--     173-   191   1167840                                                                        0.9986 0.7985
--     192-   211    975937                                                                        0.9988 0.8044
--     212-   232    821724                                                                        0.9989 0.8100
--     233-   254    704823                                                                        0.9990 0.8151
--     255-   277    606566                                                                        0.9991 0.8199
--     278-   301    524276                                                                        0.9992 0.8244
--     302-   326    457700                                                                        0.9993 0.8287
--     327-   352    396956                                                                        0.9993 0.8327
--     353-   379    346413                                                                        0.9994 0.8365
--     380-   407    305412                                                                        0.9994 0.8401
--     408-   436    271362                                                                        0.9995 0.8435
--     437-   466    244477                                                                        0.9995 0.8467
--     467-   497    217335                                                                        0.9996 0.8498
--     498-   529    194029                                                                        0.9996 0.8528
--     530-   562    179652                                                                        0.9996 0.8556
--     563-   596    163727                                                                        0.9996 0.8584
--     597-   631    153704                                                                        0.9997 0.8610
--     632-   667    141263                                                                        0.9997 0.8637
--     668-   704    134764                                                                        0.9997 0.8663
--     705-   742    132363                                                                        0.9997 0.8689
--     743-   781    134921                                                                        0.9997 0.8716
--     782-   821    127945                                                                        0.9998 0.8745
--
--     3189190 (max occurrences)
-- 32102680291 (total mers, non-unique)
--  3722071131 (distinct mers, non-unique)
--  3264009111 (unique mers)```
skoren commented 7 years ago

There should be more information in the Canu logs (# of reads without overlaps/etc, something like this:

--  READS:
--  -----
--  
--        109013 (no overlaps)
--       4070150 (no overlaps filtered)
--       2393499 (<  50% overlaps filtered)
--       5094030 (<  80% overlaps filtered)
--       9712974 (<  95% overlaps filtered)
--      28492431 (< 100% overlaps filtered)

That would give an idea if the sensitivity parameters need to be adjusted and if corMinCoverage=0 would help. There should also be info on what parameters mhap used:

-- OVERLAPPER (mhap) (correction)
--
--
-- PARAMETERS: hashes=512, minMatches=3, threshold=0.78

The reason the coverage is reduced from 25 to 8x is because most of the reads don't have 4-fold (the default minimum) coverage across their full length so they are being trimmed. corMinCoverage would help this but if the reads are very low quality still they could be trimmed during the trimming step.

Another observation from your logs, the k-mer distribution looks suspicious. You have 25x coverage yet the peak of the distribution looks to be around 40. I wouldn't expect this to be higher than your coverage, which makes me suspect some contaminant in the data.

janvanoeveren commented 7 years ago

Thanks

--  OVERLAPS:
--  --------
--
--  IGNORED:
--
--             0 (< 0.0000 fraction error)
--             0 (> 0.4095 fraction error)
--             0 (< 0 bases long)
--             0 (> 2097151 bases long)
--
--  FILTERED:
--
--  836777765067 (too many overlaps, discard these shortest ones)
--
--  EVIDENCE:
--
--    1206913429 (longest overlaps)
--
--  TOTAL:
--
--  837984678496 (all overlaps)
--
--  READS:
--  -----
--
--        512333 (no overlaps)
--      11541183 (no overlaps filtered)
--       3543081 (<  50% overlaps filtered)
--       8287523 (<  80% overlaps filtered)
--      12783144 (<  95% overlaps filtered)
--      25994181 (< 100% overlaps filtered)

and

-- OVERLAPPER (mhap) (correction)
--
-- Set corMhapSensitivity=high based on read coverage of 25.
--
-- PARAMETERS: hashes=768, minMatches=2, threshold=0.73

about the 16-mer distribution: I've plotted the data using bins of 1 only, and this show that the top is actually at 28x: afbeelding

skoren commented 7 years ago

The overlap stats look ok so corMinCoverage=0 should help. You can do the same remove the 2-correction folder and asm.correctedReads.fasta.gz (along with trimming/unitgging folders) and see how that changes the corrected reads.

A peak at 28 is still suspicious for a 25x dataset given the high error rate in the reads and the low chance of k-mer survival. I'd expect something like 1/4 of your coverage (e.g. I've seen peaks in the low 30s for 100+x coverage datasets).

janvanoeveren commented 7 years ago

Can you confirm that this 16-mer analysis is done on the raw data (i.e. no filtering)? We would expect more unique and low abundant k-mers ...

Lowering corMinCoverage helps a lot in gaining more corrected reads (increase from 3x to 9x) - but now we face a big drop in the trimming step:


--  INPUT READS:
--  -----------
--  13444996 reads 144872263784 bases (reads processed)
--       0 reads            0 bases (reads not processed, previously deleted)
--       0 reads            0 bases (reads not processed, in a library where trimming isn't allowed)
--  
--  OUTPUT READS:
--  ------------
--  8641707 reads  57,876,361,867 bases (trimmed reads output)
--  414450 reads    3,155,975,421 bases (reads with no change, kept as is)
--  3325894 reads  27,383,155,690 bases (reads with no overlaps, deleted)
--  1062945 reads  15,276,328,504 bases (reads with short trimmed length, deleted)
--  
--  TRIMMING DETAILS:
--  ----------------
--  7680471 reads  23,425,668,345 bases (bases trimmed from the 5' end of a read)
--  8121168 reads  17,754,773,957 bases (bases trimmed from the 3' end of a read)

Anyway to set additional trim parameters to cope with this (although we understand that low overlap will also mean small chance of contigging ...)?

skoren commented 7 years ago

Yep, the 16-mer counts are done with raw data. I think this is pointing to low quality data with a small high coverage contaminant (like a plastid). This would be consistent with the trimming eliminating a lot of the data.

You could increase the correctedErrorRate to 0.10 though this will slow trimming down significantly but will keep more data. Is your corMinCoverage set to 0 or higher? With corMinCoverage=0, I would expect all your data to be corrected not just 10x. You could also try multiple correction rounds as seemed to work in issue #420

janvanoeveren commented 7 years ago

Thanks Yes I used both corMinCoverage=0 and correctedErrorRate=0.10

skoren commented 7 years ago

That implies more than 50% of the reads don't have any overlaps with high sensitivity. What is the summary report for correction overlaps in the run?

janvanoeveren commented 7 years ago

Multiple correction rounds will likely result in less corrected coverage I would say. So how to get more coverage?

I would expect all your data to be corrected not just 10x.

So how to achieve this? I could start another analysis from scratch and lower params (minOverlap and minReadLength) and maybe increase rawErrorRate? It's likely that repeats are abundant - I would say that contamination would show an additional peak in the k-mer plot; which I don't see ..

skoren commented 7 years ago

Multiple correction rounds would recover very low quality data that's being trimmed now but won't exceed the coverage you get from the first round (e.g. 10x). Given you're already running with high sensitivity and min coverage 0, the only option would be to drop the mer size to maybe 14 but that will increase runtime. That's why I was asking about the overlapping report. Repeats won't matter, they would generate more not less overlaps. Essentially the data seems like is only the contaminant (that's the only peak), the majority of the target input data seems more like noise.

So to run multiple rounds, I run correction with your current parameters and take all the corrected reads and any input reads that didn't get corrected and input them as raw pacbio reads again into correction with the same parameters and repeat this 4-5 times.

janvanoeveren commented 7 years ago
-----------------
37535364 (input reads)
397623746990 (input bases)

CORRECTION OUTPUTS:
-----------------
24090368 (reads that failed to generate any corrected bases)
13444996 (corrected read pieces)
144872263784 (corrected bases)

PIECES PER READ:
---------------
0 pieces: 24090368
1 pieces: 13444996
skoren commented 7 years ago

I meant these overlap stats not the post-correction stats:

----------------------------------------
--  PARAMETERS:
--  ----------
--  
--       40 (expected coverage)
--        0 (don't use overlaps shorter than this)
--    0.000 (don't use overlaps with erate less than this)
--    1.000 (don't use overlaps with erate more than this)
--  
--  OVERLAPS:
--  --------
--  
--  IGNORED:
--  
--             0 (< 0.0000 fraction error)
--             0 (> 0.4095 fraction error)
--             0 (< 0 bases long)
--             0 (> 2097151 bases long)
--  
--  FILTERED:
--  
--  409492177815 (too many overlaps, discard these shortest ones)
--  
--  EVIDENCE:
--  
--     644111355 (longest overlaps)
--  
--  TOTAL:
--  
--  410136289170 (all overlaps)
--  
--  READS:
--  -----
--  
--        223194 (no overlaps)
--       1923883 (no overlaps filtered)
--       1354637 (<  50% overlaps filtered)
--       3060707 (<  80% overlaps filtered)
--       5008242 (<  95% overlaps filtered)
--      15059837 (< 100% overlaps filtered)
--  
-- Computing expected corrected read lengths 'correction/2-correction/asm.estimate.log'.
brianwalenz commented 7 years ago

The first bit of logging in here is reporting

-- Reads to be corrected:
--   37535364 reads longer than 0 bp
--   397623746990 bp
-- Expected corrected reads:
--   37535364 reads
--   132550121930 bp

which is saying 2/3 of the bases have no overlap at all.

This bit of logging

--  FILTERED:
--  836777765067 (too many overlaps, discard these shortest ones)
--
--  EVIDENCE:
--    1206913429 (longest overlaps)
--
--  TOTAL:
--  837984678496 (all overlaps)

says it threw out the vast majority of overlaps because they're too repetitive.

The last bit of logging hints that 2/3 of the reads had no (or very few) overlaps.

Instead of what Serge is asking for, generate a histogram from:

overlapStore -G *gkpStore -O *ovlStore -d -counts

This will dump a table of the number of ovelraps per read: 'readID numberOfOverlaps'.

janvanoeveren commented 7 years ago

Thanks guys. Here's the log Sergey asked for:

----------

     40 (expected coverage)
      0 (don't use overlaps shorter than this)
  0.000 (don't use overlaps with erate less than this)
  1.000 (don't use overlaps with erate more than this)

OVERLAPS:
--------

IGNORED:

           0 (< 0.0000 fraction error)
           0 (> 0.4095 fraction error)
           0 (< 0 bases long)
           0 (> 2097151 bases long)

FILTERED:

836777765067 (too many overlaps, discard these shortest ones)

EVIDENCE:

  1206913429 (longest overlaps)

TOTAL:

837984678496 (all overlaps)

READS:
-----

      512333 (no overlaps)
    11541183 (no overlaps filtered)
     3543081 (<  50% overlaps filtered)
     8287523 (<  80% overlaps filtered)
    12783144 (<  95% overlaps filtered)
    25994181 (< 100% overlaps filtered)

I cannot find the executable overlapStore (in canu-1.5, nor 1.6). And in which directory to run? I guess /correction?

janvanoeveren commented 7 years ago

I do see ovStoreStats (but that did not work) and OverlapStore.pm in lib/canu

brianwalenz commented 7 years ago

Whoops, sorry, make that ovStoreDump (overlapStore was the old old old name for the command).

janvanoeveren commented 7 years ago

afbeelding

Nr of occurrences for 0 overlap to 300x overlap. max # overlaps = 1951797 (for 1 read)

janvanoeveren commented 7 years ago

afbeelding

So mode at 2 overlaps and very long tail (up to almost 2 M overlaps for 1 read). What does this tell us? I guess very repeat rich content ...

brianwalenz commented 7 years ago

Ah, thanks for the log axis. I was squinting at the other one yesterday to see this.

With 25x coverage, the mode should be somewhere between 25 and 50 or so, not 8. This could mean sequencing failed, the source DNA was incredibly diverged, the genome is waaay larger than you think, ...

Looking at the 22-mer histogram for 'unitigging' there seems to be about 5x coverage in those reads. The max count is very high, but only a few different mers, implying there are some super-simple repeats or homopolymer junk causing the enormous number of overlaps for a few reads.

Have you completed an assembly for the corrected reads you have? Do any of the raw reads map to these? I'm attempting to find a way to tell if the reads are random sequence or real sequence but noisy and/or diverged.

For assembly, I can't think of anything to do except drop the kmer size as Sergey suggested. As he said, this will increase run time, and if it works, the size of overlap output will be much bigger.

skoren commented 7 years ago

Yes, I second what Brian said. I think it is consistent with some very repetitive or contaminant high coverage sequence but the majority of the data has no overlaps indicating it is noisy or very diverged or the genome is about 4-fold larger than your estimate.

However, I still expect more reads to have been output from correction with corMinCoverage=0, all the reads with >=1 overlap should have some bases corrected. You can see in the logs it says it is trying to correct 397,623,746,990 of your total bases which is about 97% but then all those reads are lost to correctReads.sh (most end up with 0 pieces). Can you check the shell script and confirm it is passing 0 as the coverage threshold?

skoren commented 7 years ago

So if the correctReads.sh shell script is specifying coverage 0 and the count of reads in asm.readsToCorrect is the full dataset, it is possible there is a short circuit in the consensus caller that skips a read when it has no alignments, even with corMinCov=0. If that is the case, we should fix it.

My earlier suggestion, while a bit cumbersome, should work to determine if the read quality is the issue or if is something else. You would run correction, take all reads that have been corrected along with all raw reads that are not output in correction to get back to full coverage and run correction again on that dataset. Repeat that 4-5 times and try to run assembly and trimming on the final output.

janvanoeveren commented 7 years ago

Yes, I have an assembly with this set of corrected reads - yielding approx 100,00 unitigs totaling 4 Gb and an N50 of 40 kb. I haven't tried mapping raw reads yet. An other idea to check contamination could is to get GC content. The mode 2 for overlaps is actually the 'raw' number of overlaps (so not on log scale) - I guess we would expect a poisson like distribution with indeed a mode around 25 - so this puzzles me too. I'm pretty sure it's using the corMinCoverage - as I got a major increase from 50 Gb corrected reads (default corMinCoverage=40) to 144 Gb corrected. This is part of the correctReads.sh:

$bin/falcon_sense \
  --min_idt 0.7 \
  --min_len 1000\
  --max_read_len 389014 \
  --min_ovl_len 500\
  --min_cov 0 \
  --n_core 4 \
  > ./correction_outputs/$jobid.fasta.WORKING \
 2> ./correction_outputs/$jobid.err \
&& \
mv ./correction_outputs/$jobid.fasta.WORKING ./correction_outputs/$jobid.fasta \

I appreciate your suggestion for the iterative approach - but as it will take a lot of time and space I'm a little hesitant to start it. Also - will it prevent the assembly of any contamination - as these will also be confirmed by repetitive rounds of correction right?

skoren commented 7 years ago

Multiple rounds shouldn't take any more space than one since you can erase all the output of a previous round (except for the fasta.gz file) but it will take 5-times longer (or however many rounds of correction you run longer). It won't eliminate the contaminant but I'm not so much worried about that as losing >50% your data. Canu can deal with repeats in the assembly, it isn't going to affect your other data other than increasing disk usage during correction. I think the reason it's getting assembled is because it is so high coverage you have some higher identity reads by chance and it is getting corrected well. One other option is to try minimap2 instead of mhap for the correction overlaps which will more aggressively filter repeats, try corOverlapper=minimap and symlink the minimap2 executable from the canu/Linux-amd64/bin directory. I don't expect this to make much difference in terms of the corrected reads but worth a try.

janvanoeveren commented 7 years ago

So you mean try minimap for the read correction step (and not for obt and utg step)? My latest assembly (with correctedError=0.1 minOverlapLength=300 minReadLength=500) starting from the same set of corrected reads (144 Gb) --> 41 Gb trimmed) resulted in double size (7.7 Gb) but still small pieces: 245 K unitigs N50 31 K ...

skoren commented 7 years ago

Yep, minimap just for correction.

Interesting that the genome size doubled, that could indicate high heterozygosity. Have you run genome scope http://qb.cshl.edu/genomescope/) on the corrected reads, you should be able to give it the unitigging/0-*/*.hist file as input (just first two columns space separated)? You could also run BUSCO to see if you have gene duplication leading to the larger assembly size.

janvanoeveren commented 7 years ago

yep - ran busco: got approx half of the genes but very few duplicated

brianwalenz commented 6 years ago

Idle, reopen if needed.