marbl / canu

A single molecule sequence assembler for genomes large and small.
http://canu.readthedocs.io/
655 stars 179 forks source link

Evaluating Canu reports and assembly quality. #655

Closed mortunco closed 7 years ago

mortunco commented 7 years ago

Dear Authors hi,

As an ignorant visiting scholar about denovo assembly generation, I couldnt find detailed information in the canu wiki about, simply if my run was successfully ended or I need to tweak some parameters and retry it.

In previously asked issue, #289, I found that bad trimming flags are not necessarily a bad call based on the organism.

Before I ask questions I have a two small comments about canu status messages and results interpretation.

1) To determine if the run ended or aborted due to an error, could you add like a line for example "process ended successfully (implying run is ended but the quality or results need to be interpreted)

2) Would you be able to add a section about interpreting assembly quaility for new people in the subject. ( like me )

In the #289 couple histograms were asked so I just pasted the assembly report. I am sorry if this question is too general but any of your comments could be useful for me to tweak couple parameters. I am aware that there is not a ultimate/perfect results but I am planning to do RNAseq ,mutation calling etc next therefore I wanna be sure about the assembly before stuff gets too complicated. Please tell me if you need other output files.

Thank you very much for your time,

Tunc.

tmorova@donut:~/kefal_genome$ cat kefal.report
[CORRECTION/READS]
--
-- In gatekeeper store 'correction/kefal.gkpStore':
--   Found 3629764 reads.
--   Found 28664078334 bases (0.05 times coverage).
--
--   Read length histogram (one '*' equals 4667.25 reads):
--        0    999      0
--     1000   1999 293679 **************************************************************
--     2000   2999 326708 **********************************************************************
--     3000   3999 323552 *********************************************************************
--     4000   4999 312992 *******************************************************************
--     5000   5999 296839 ***************************************************************
--     6000   6999 276515 ***********************************************************
--     7000   7999 253165 ******************************************************
--     8000   8999 231004 *************************************************
--     9000   9999 213937 *********************************************
--    10000  10999 200298 ******************************************
--    11000  11999 183646 ***************************************
--    12000  12999 156526 *********************************
--    13000  13999 127205 ***************************
--    14000  14999 100735 *********************
--    15000  15999  78965 ****************
--    16000  16999  60646 ************
--    17000  17999  46516 *********
--    18000  18999  35555 *******
--    19000  19999  26589 *****
--    20000  20999  20323 ****
--    21000  21999  15401 ***
--    22000  22999  11783 **
--    23000  23999   8947 *
--    24000  24999   6813 *
--    25000  25999   5093 *
--    26000  26999   4008
--    27000  27999   3100
--    28000  28999   2388
--    29000  29999   1729
--    30000  30999   1349
--    31000  31999    928
--    32000  32999    733
--    33000  33999    597
--    34000  34999    409
--    35000  35999    319
--    36000  36999    224
--    37000  37999    174
--    38000  38999    105
--    39000  39999     81
--    40000  40999     54
--    41000  41999     48
--    42000  42999     25
--    43000  43999     21
--    44000  44999     17
--    45000  45999     10
--    46000  46999      5
--    47000  47999      4
--    48000  48999      2
--    49000  49999      1
--    50000  50999      1

[CORRECTION/MERS]
--
--  16-mers                                                                                           Fraction
--    Occurrences   NumMers                                                                         Unique Total
--       1-     1 160168767 ***************************                                            0.0774 0.0056
--       2-     2 198247238 **********************************                                     0.1733 0.0195
--       3-     4 379813678 *****************************************************************      0.2695 0.0403
--       5-     7 404226606 ********************************************************************** 0.4328 0.0930
--       8-    11 310801954 *****************************************************                  0.5989 0.1758
--      12-    16 210091867 ************************************                                   0.7283 0.2724
--      17-    22 136234173 ***********************                                                0.8181 0.3682
--      23-    29  87399535 ***************                                                        0.8776 0.4553
--      30-    37  56227256 *********                                                              0.9165 0.5302
--      38-    46  36651274 ******                                                                 0.9419 0.5928
--      47-    56  24404893 ****                                                                   0.9586 0.6443
--      57-    67  16608676 **                                                                     0.9698 0.6866
--      68-    79  11544991 *                                                                      0.9775 0.7214
--      80-    92   8191129 *                                                                      0.9829 0.7502
--      93-   106   5938800 *                                                                      0.9867 0.7742
--     107-   121   4406241                                                                        0.9895 0.7944
--     122-   137   3339997                                                                        0.9916 0.8117
--     138-   154   2574762                                                                        0.9932 0.8266
--     155-   172   2010196                                                                        0.9944 0.8395
--     173-   191   1587458                                                                        0.9953 0.8508
--     192-   211   1263097                                                                        0.9961 0.8608
--     212-   232   1016402                                                                        0.9967 0.8696
--     233-   254    824168                                                                        0.9972 0.8774
--     255-   277    672477                                                                        0.9976 0.8844
--     278-   301    555904                                                                        0.9979 0.8906
--     302-   326    462948                                                                        0.9982 0.8962
--     327-   352    390165                                                                        0.9984 0.9012
--     353-   379    331979                                                                        0.9986 0.9058
--     380-   407    284818                                                                        0.9987 0.9100
--     408-   436    246224                                                                        0.9989 0.9139
--     437-   466    214181                                                                        0.9990 0.9175
--     467-   497    187928                                                                        0.9991 0.9209
--     498-   529    165070                                                                        0.9992 0.9241
--     530-   562    145560                                                                        0.9993 0.9270
--     563-   596    127895                                                                        0.9993 0.9298
--     597-   631    113813                                                                        0.9994 0.9324
--     632-   667    101065                                                                        0.9994 0.9348
--     668-   704     89880                                                                        0.9995 0.9371
--     705-   742     80328                                                                        0.9995 0.9392
--     743-   781     71196                                                                        0.9996 0.9413
--     782-   821     63666                                                                        0.9996 0.9432
--
--    12082696 (max occurrences)
-- 28449463107 (total mers, non-unique)
--  1908449297 (distinct mers, non-unique)
--   160168767 (unique mers)

[CORRECTION/CORRECTIONS]
--
-- Reads to be corrected:
--   3629515 reads longer than 0 bp
--   28657761232 bp
-- Expected corrected reads:
--   3629515 reads
--   26419680288 bp
--   0 bp minimum length
--   7279 bp mean length
--   0 bp n50 length

[TRIMMING/READS]
--
-- In gatekeeper store 'trimming/kefal.gkpStore':
--   Found 3495244 reads.
--   Found 27246096664 bases (0.04 times coverage).
--
--   Read length histogram (one '*' equals 4574.48 reads):
--        0    999      0
--     1000   1999 261951 *********************************************************
--     2000   2999 316304 *********************************************************************
--     3000   3999 320214 **********************************************************************
--     4000   4999 311883 ********************************************************************
--     5000   5999 296203 ****************************************************************
--     6000   6999 274963 ************************************************************
--     7000   7999 250216 ******************************************************
--     8000   8999 227890 *************************************************
--     9000   9999 211232 **********************************************
--    10000  10999 197312 *******************************************
--    11000  11999 177356 **************************************
--    12000  12999 147854 ********************************
--    13000  13999 117345 *************************
--    14000  14999  91712 ********************
--    15000  15999  70756 ***************
--    16000  16999  54106 ***********
--    17000  17999  40890 ********
--    18000  18999  31238 ******
--    19000  19999  23413 *****
--    20000  20999  17543 ***
--    21000  21999  13261 **
--    22000  22999  10120 **
--    23000  23999   7644 *
--    24000  24999   5808 *
--    25000  25999   4353
--    26000  26999   3375
--    27000  27999   2586
--    28000  28999   2005
--    29000  29999   1448
--    30000  30999   1115
--    31000  31999    801
--    32000  32999    653
--    33000  33999    463
--    34000  34999    351
--    35000  35999    248
--    36000  36999    185
--    37000  37999    140
--    38000  38999     88
--    39000  39999     73
--    40000  40999     41
--    41000  41999     30
--    42000  42999     27
--    43000  43999     15
--    44000  44999     13
--    45000  45999      9
--    46000  46999      3
--    47000  47999      5
--    48000  48999      1
--    49000  49999      1
--    50000  50999      1

[TRIMMING/MERS]
--
--  22-mers                                                                                           Fraction
--    Occurrences   NumMers                                                                         Unique Total
--       1-     1 12260626729 *******************************************************************--> 0.8896 0.4512
--       2-     2 533775095 ********************************************************************** 0.9284 0.4905
--       3-     4 284814077 *************************************                                  0.9416 0.5107
--       5-     7 185535500 ************************                                               0.9543 0.5391
--       8-    11 181107716 ***********************                                                0.9660 0.5804
--      12-    16 166213114 *********************                                                  0.9785 0.6461
--      17-    22  97505608 ************                                                           0.9894 0.7281
--      23-    29  32661591 ****                                                                   0.9953 0.7883
--      30-    37  11428041 *                                                                      0.9973 0.8141
--      38-    46   6744985                                                                        0.9980 0.8270
--      47-    56   4681131                                                                        0.9985 0.8370
--      57-    67   3396372                                                                        0.9988 0.8456
--      68-    79   2475548                                                                        0.9990 0.8531
--      80-    92   1836334                                                                        0.9992 0.8597
--      93-   106   1381945                                                                        0.9994 0.8654
--     107-   121   1073182                                                                        0.9994 0.8703
--     122-   137    854088                                                                        0.9995 0.8748
--     138-   154    695443                                                                        0.9996 0.8788
--     155-   172    572700                                                                        0.9996 0.8825
--     173-   191    474726                                                                        0.9997 0.8859
--     192-   211    403287                                                                        0.9997 0.8891
--     212-   232    351721                                                                        0.9997 0.8920
--     233-   254    308458                                                                        0.9998 0.8949
--     255-   277    270487                                                                        0.9998 0.8976
--     278-   301    233142                                                                        0.9998 0.9003
--     302-   326    205025                                                                        0.9998 0.9027
--     327-   352    182074                                                                        0.9998 0.9051
--     353-   379    162066                                                                        0.9999 0.9074
--     380-   407    145224                                                                        0.9999 0.9095
--     408-   436    132609                                                                        0.9999 0.9116
--     437-   466    120070                                                                        0.9999 0.9137
--     467-   497    110231                                                                        0.9999 0.9157
--     498-   529    100708                                                                        0.9999 0.9176
--     530-   562     91565                                                                        0.9999 0.9195
--     563-   596     83045                                                                        0.9999 0.9214
--     597-   631     74053                                                                        0.9999 0.9231
--     632-   667     66574                                                                        0.9999 0.9248
--     668-   704     60236                                                                        0.9999 0.9264
--     705-   742     55556                                                                        0.9999 0.9279
--     743-   781     50255                                                                        0.9999 0.9294
--     782-   821     46485                                                                        0.9999 0.9308
--
--     4636251 (max occurrences)
-- 14912069811 (total mers, non-unique)
--  1521244268 (distinct mers, non-unique)
-- 12260626729 (unique mers)

[TRIMMING/TRIMMING]
--  PARAMETERS:
--  ----------
--     1000    (reads trimmed below this many bases are deleted)
--   0.0450    (use overlaps at or below this fraction error)
--        1    (break region if overlap is less than this long, for 'largest covered' algorithm)
--        1    (break region if overlap coverage is less than this many read, for 'largest covered' algorithm)
--
--  INPUT READS:
--  -----------
--  3495244 reads  27246096664 bases (reads processed)
--       0 reads            0 bases (reads not processed, previously deleted)
--       0 reads            0 bases (reads not processed, in a library where trimming isn't allowed)
--
--  OUTPUT READS:
--  ------------
--  1908356 reads  12231115752 bases (trimmed reads output)
--   10887 reads     82122962 bases (reads with no change, kept as is)
--  1413575 reads   8343705824 bases (reads with no overlaps, deleted)
--  162426 reads   1076303248 bases (reads with short trimmed length, deleted)
--
--  TRIMMING DETAILS:
--  ----------------
--  1798193 reads   3317375440 bases (bases trimmed from the 5' end of a read)
--  1847292 reads   2195473438 bases (bases trimmed from the 3' end of a read)

[TRIMMING/SPLITTING]
--  PARAMETERS:
--  ----------
--     1000    (reads trimmed below this many bases are deleted)
--   0.0450    (use overlaps at or below this fraction error)
--  INPUT READS:
--  -----------
--  1919243 reads  17826087592 bases (reads processed)
--  1576001 reads   9420009072 bases (reads not processed, previously deleted)
--       0 reads            0 bases (reads not processed, in a library where trimming isn't allowed)
--
--  PROCESSED:
--  --------
--       0 reads            0 bases (no overlaps)
--    2128 reads     20164402 bases (no coverage after adjusting for trimming done already)
--       0 reads            0 bases (processed for chimera)
--       0 reads            0 bases (processed for spur)
--  1917115 reads  17805923190 bases (processed for subreads)
--
--  READS WITH SIGNALS:
--  ------------------
--       0 reads            0 signals (number of 5' spur signal)
--       0 reads            0 signals (number of 3' spur signal)
--       0 reads            0 signals (number of chimera signal)
--    1802 reads         1855 signals (number of subread signal)
--
--  SIGNALS:
--  -------
--       0 reads            0 bases (size of 5' spur signal)
--       0 reads            0 bases (size of 3' spur signal)
--       0 reads            0 bases (size of chimera signal)
--    1855 reads       725743 bases (size of subread signal)
--
--  TRIMMING:
--  --------
--     892 reads      2801628 bases (trimmed from the 5' end of the read)
--     910 reads      2898173 bases (trimmed from the 3' end of the read)

[UNITIGGING/READS]
--
-- In gatekeeper store 'unitigging/kefal.gkpStore':
--   Found 1919237 reads.
--   Found 12307534080 bases (0.02 times coverage).
--
--   Read length histogram (one '*' equals 3444.71 reads):
--        0    999      0
--     1000   1999 241130 **********************************************************************
--     2000   2999 206150 ***********************************************************
--     3000   3999 200506 **********************************************************
--     4000   4999 192071 *******************************************************
--     5000   5999 178966 ***************************************************
--     6000   6999 162169 ***********************************************
--     7000   7999 139011 ****************************************
--     8000   8999 122364 ***********************************
--     9000   9999 109188 *******************************
--    10000  10999  96426 ***************************
--    11000  11999  80027 ***********************
--    12000  12999  59739 *****************
--    13000  13999  41489 ************
--    14000  14999  29058 ********
--    15000  15999  20042 *****
--    16000  16999  13341 ***
--    17000  17999   9049 **
--    18000  18999   6105 *
--    19000  19999   3991 *
--    20000  20999   2643
--    21000  21999   1883
--    22000  22999   1268
--    23000  23999    814
--    24000  24999    607
--    25000  25999    399
--    26000  26999    257
--    27000  27999    170
--    28000  28999    110
--    29000  29999     87
--    30000  30999     75
--    31000  31999     39
--    32000  32999     17
--    33000  33999     23
--    34000  34999     11
--    35000  35999      5
--    36000  36999      4
--    37000  37999      1
--    38000  38999      0
--    39000  39999      1
--    40000  40999      1

[UNITIGGING/MERS]
--
--  22-mers                                                                                           Fraction
--    Occurrences   NumMers                                                                         Unique Total
--       1-     1 1447015818 *******************************************************************--> 0.6029 0.1180
--       2-     2 221550005 ********************************************************************** 0.6952 0.1541
--       3-     4 165817638 ****************************************************                   0.7368 0.1785
--       5-     7 143696575 *********************************************                          0.7864 0.2216
--       8-    11 154496371 ************************************************                       0.8414 0.2965
--      12-    16 139066530 *******************************************                            0.9025 0.4211
--      17-    22  76704062 ************************                                               0.9544 0.5712
--      23-    29  23794362 *******                                                                0.9809 0.6746
--      30-    37   8073147 **                                                                     0.9890 0.7158
--      38-    46   4799111 *                                                                      0.9920 0.7360
--      47-    56   3322848 *                                                                      0.9939 0.7518
--      57-    67   2375175                                                                        0.9952 0.7653
--      68-    79   1697902                                                                        0.9962 0.7769
--      80-    92   1245202                                                                        0.9969 0.7869
--      93-   106    942099                                                                        0.9974 0.7954
--     107-   121    732654                                                                        0.9978 0.8029
--     122-   137    586262                                                                        0.9981 0.8096
--     138-   154    480968                                                                        0.9983 0.8157
--     155-   172    395916                                                                        0.9985 0.8214
--     173-   191    341037                                                                        0.9987 0.8266
--     192-   211    296078                                                                        0.9988 0.8316
--     212-   232    254651                                                                        0.9989 0.8365
--     233-   254    215522                                                                        0.9990 0.8410
--     255-   277    188700                                                                        0.9991 0.8453
--     278-   301    167571                                                                        0.9992 0.8494
--     302-   326    148208                                                                        0.9993 0.8533
--     327-   352    133242                                                                        0.9993 0.8571
--     353-   379    121520                                                                        0.9994 0.8608
--     380-   407    111145                                                                        0.9994 0.8644
--     408-   436     99146                                                                        0.9995 0.8679
--     437-   466     89273                                                                        0.9995 0.8713
--     467-   497     81724                                                                        0.9996 0.8746
--     498-   529     73095                                                                        0.9996 0.8778
--     530-   562     64545                                                                        0.9996 0.8809
--     563-   596     58255                                                                        0.9996 0.8837
--     597-   631     52634                                                                        0.9997 0.8865
--     632-   667     47991                                                                        0.9997 0.8891
--     668-   704     43826                                                                        0.9997 0.8916
--     705-   742     40439                                                                        0.9997 0.8941
--     743-   781     37113                                                                        0.9997 0.8965
--     782-   821     34485                                                                        0.9998 0.8988
--
--     2771059 (max occurrences)
-- 10820214285 (total mers, non-unique)
--   953021358 (distinct mers, non-unique)
--  1447015818 (unique mers)

[UNITIGGING/OVERLAPS]
--   category            reads     %          read length        feature size or coverage  analysis
--   ----------------  -------  -------  ----------------------  ------------------------  --------------------
--   middle-missing       4912    0.26     7757.87 +- 4266.86        700.29 +- 810.81     (bad trimming)
--   middle-hump          5263    0.27     5052.64 +- 3358.56        411.98 +- 666.82     (bad trimming)
--   no-5-prime          30246    1.58     7248.07 +- 4282.80        283.56 +- 569.21     (bad trimming)
--   no-3-prime          30456    1.59     7274.99 +- 4266.40        284.69 +- 564.61     (bad trimming)
--
--   low-coverage       372242   19.40     3546.36 +- 2513.70          4.35 +- 1.98       (easy to assemble, potential for lower quality consensus)
--   unique             619104   32.26     5900.64 +- 3584.44         17.85 +- 5.61       (easy to assemble, perfect, yay)
--   repeat-cont         48239    2.51     5004.67 +- 3087.43        664.58 +- 762.18     (potential for consensus errors, no impact on assembly)
--   repeat-dove           347    0.02    11948.92 +- 5775.99        410.45 +- 518.99     (hard to assemble, likely won't assemble correctly or even at all)
--
--   span-repeat        238007   12.40     8996.36 +- 4134.79       3630.03 +- 3256.51    (read spans a large repeat, usually easy to assemble)
--   uniq-repeat-cont   416695   21.71     6656.96 +- 3071.05                             (should be uniquely placed, low potential for consensus errors, no impact on assembly)
--   uniq-repeat-dove   136860    7.13    11760.30 +- 3722.25                             (will end contigs, potential to misassemble)
--   uniq-anchor          4264    0.22     8649.11 +- 3565.50       2771.79 +- 2839.78    (repeat read, with unique section, probable bad read)

[UNITIGGING/ADJUSTMENT]
-- No report available.

[UNITIGGING/CONTIGS]
-- Found, in version 1, after unitig construction:
--   contigs:      21987 sequences, total length 871658642 bp (including 1180 repeats of total length 12090138 bp).
--   bubbles:      0 sequences, total length 0 bp.
--   unassembled:  607088 sequences, total length 3272717389 bp.
--
-- Contig sizes based on genome size --
--            NG (bp)  LG (contigs)    sum (bp)
--         ----------  ------------  ----------
--

[UNITIGGING/CONSENSUS]
-- Found, in version 2, after consensus generation:
--   contigs:      21987 sequences, total length 870033481 bp (including 1180 repeats of total length 12069127 bp).
--   bubbles:      0 sequences, total length 0 bp.
--   unassembled:  607088 sequences, total length 3272430212 bp.
--
-- Contig sizes based on genome size --
--            NG (bp)  LG (contigs)    sum (bp)
--         ----------  ------------  ----------
—
skoren commented 7 years ago

Canu does output a message like:

-- Finished stage 'outputSequence', reset canuIteration.
--
-- Bye.

at the end of the run. There should be a canu.out file in the folder which will have this output when running on the grid.

There is also some information in the FAQ about how to improve assembly continuity and deal with different techs/genome characteristics. What kind of information would you like to see? The report provides only basic contiguity stats to see if something went wrong during the assembly (low coverage from correction, strange k-mer distributions) but it won't detect assembly errors or similar issues.

In your case, Canu is reporting only 0.05x coverage but based on everything else in the logs there is about 30-40x. There is also an assembly of almost 900mb in size so I'm going to guess the genome size was not set correctly for this run (I would guess it was set to 1 terabase instead of 1 gig). The report doesn't have any assembly stats because of this since the assembly is <1% of expected size.

mortunco commented 7 years ago

Thank you for your fast response. I think my run is not correct because I entered a wrong genome size value.

What kind of information would you like to see? The report provides only basic contiguity stats to see if something went wrong during the assembly (low coverage from correction, strange k-mer distributions) but it won't detect assembly errors or similar issues.

I just wanted to know in the most simplest way that if my assembly is viable for the further analysis.

In your case, Canu is reporting only 0.05x coverage but based on everything else in the logs there is about 30-40x. There is also an assembly of almost 900mb in size so I'm going to guess the genome size was not set correctly for this run (I would guess it was set to 1 terabase instead of 1 gig). The report doesn't have any assembly stats because of this since the assembly is <1% of expected size.

You are right. I made a mistake while calculating genome size. For some reason, I estimated it based on the file size. I know this question is out of this issue but, do you think If I make my genome size 5g like you suggest, will I have higher coverage ? or do I have to run a software that estimates genome is a must ? (I just found Kmergenie software which is used for genome size estimation. But I am also open to your suggestions.)

Thank you very much for your help and patience,

Best regards,

Tunc.

This is the command line option which I obtained aforementioned results.

tmorova@lisa:~$ canu-1.6/Linux-amd64/bin/canu 
-p kefal 
-d kefal_genome/ 
genomeSize=555g 
-pacbio-raw kefal_pacbio/pacbio/*/*/*.fastq 
skoren commented 7 years ago

If you set the genome size to 5g then the stats reported would be more accurate, yes. I'm not sure it would change the assembly very much though. The genome size doesn't have to be exact, as long as it is in the right ballpark (say 6g instead of 5g is ok). You could also use GenomeScope to estimate genome size and diversity as well but both it and KmerGenie would work best given Illumina data not raw PacBio data.

Given that genome size, you only have about 5x of data (28664078334 / 5000000000 from the correction/asm.gkpStore log) which isn't enough to assemble the full genome. It looks like you assembled about 20% of the genome from the log:

--   contigs:      21987 sequences, total length 870033481 bp (including 1180 repeats of total length 12069127 bp).

So I don't think this assembly would be sufficient for downstream analysis since it is so incomplete. You could confirm this by using BUSCO which will look for single-copy universal genes in the assembly, presumably only 20% of them would be found if the genome size is accurate.

mortunco commented 7 years ago

Dear skoren,

I have just finished my new run based on the new genome parameter and I waited to ask couple questions with the newest results. You are right about the 5x. Genome parameter produced the results as you expected. But can I improve this coverage ( am I doing something wrong againg so that I am having low coverage again? ) or it is what it is and there is nothing to do ?

Thank you very much for your time and patience.

Best regards,

Tunc.

This is my new run.report. Maybe it helps.

[CORRECTION/READS]
--
-- In gatekeeper store 'correction/kefal.gkpStore':
--   Found 3629764 reads.
--   Found 28664078334 bases (5.73 times coverage).
--
--   Read length histogram (one '*' equals 4667.25 reads):
--        0    999      0
--     1000   1999 293679 **************************************************************
--     2000   2999 326708 **********************************************************************
--     3000   3999 323552 *********************************************************************
--     4000   4999 312992 *******************************************************************
--     5000   5999 296839 ***************************************************************
--     6000   6999 276515 ***********************************************************
--     7000   7999 253165 ******************************************************
--     8000   8999 231004 *************************************************
--     9000   9999 213937 *********************************************
--    10000  10999 200298 ******************************************
--    11000  11999 183646 ***************************************
--    12000  12999 156526 *********************************
--    13000  13999 127205 ***************************
--    14000  14999 100735 *********************
--    15000  15999  78965 ****************
--    16000  16999  60646 ************
--    17000  17999  46516 *********
--    18000  18999  35555 *******
--    19000  19999  26589 *****
--    20000  20999  20323 ****
--    21000  21999  15401 ***
--    22000  22999  11783 **
--    23000  23999   8947 *
--    24000  24999   6813 *
--    25000  25999   5093 *
--    26000  26999   4008
--    27000  27999   3100
--    28000  28999   2388
--    29000  29999   1729
--    30000  30999   1349
--    31000  31999    928
--    32000  32999    733
--    33000  33999    597
--    34000  34999    409
--    35000  35999    319
--    36000  36999    224
--    37000  37999    174
--    38000  38999    105
--    39000  39999     81
--    40000  40999     54
--    41000  41999     48
--    42000  42999     25
--    43000  43999     21
--    44000  44999     17
--    45000  45999     10
--    46000  46999      5
--    47000  47999      4
--    48000  48999      2
--    49000  49999      1
--    50000  50999      1

[CORRECTION/MERS]
--
--  16-mers                                                                                           Fraction
--    Occurrences   NumMers                                                                         Unique Total
--       1-     1 160168767 ***************************                                            0.0774 0.0056
--       2-     2 198247238 **********************************                                     0.1733 0.0195
--       3-     4 379813678 *****************************************************************      0.2695 0.0403
--       5-     7 404226606 ********************************************************************** 0.4328 0.0930
--       8-    11 310801954 *****************************************************                  0.5989 0.1758
--      12-    16 210091867 ************************************                                   0.7283 0.2724
--      17-    22 136234173 ***********************                                                0.8181 0.3682
--      23-    29  87399535 ***************                                                        0.8776 0.4553
--      30-    37  56227256 *********                                                              0.9165 0.5302
--      38-    46  36651274 ******                                                                 0.9419 0.5928
--      47-    56  24404893 ****                                                                   0.9586 0.6443
--      57-    67  16608676 **                                                                     0.9698 0.6866
--      68-    79  11544991 *                                                                      0.9775 0.7214
--      80-    92   8191129 *                                                                      0.9829 0.7502
--      93-   106   5938800 *                                                                      0.9867 0.7742
--     107-   121   4406241                                                                        0.9895 0.7944
--     122-   137   3339997                                                                        0.9916 0.8117
--     138-   154   2574762                                                                        0.9932 0.8266
--     155-   172   2010196                                                                        0.9944 0.8395
--     173-   191   1587458                                                                        0.9953 0.8508
--     192-   211   1263097                                                                        0.9961 0.8608
--     212-   232   1016402                                                                        0.9967 0.8696
--     233-   254    824168                                                                        0.9972 0.8774
--     255-   277    672477                                                                        0.9976 0.8844
--     278-   301    555904                                                                        0.9979 0.8906
--     302-   326    462948                                                                        0.9982 0.8962
--     327-   352    390165                                                                        0.9984 0.9012
--     353-   379    331979                                                                        0.9986 0.9058
--     380-   407    284818                                                                        0.9987 0.9100
--     408-   436    246224                                                                        0.9989 0.9139
--     437-   466    214181                                                                        0.9990 0.9175
--     467-   497    187928                                                                        0.9991 0.9209
--     498-   529    165070                                                                        0.9992 0.9241
--     530-   562    145560                                                                        0.9993 0.9270
--     563-   596    127895                                                                        0.9993 0.9298
--     597-   631    113813                                                                        0.9994 0.9324
--     632-   667    101065                                                                        0.9994 0.9348
--     668-   704     89880                                                                        0.9995 0.9371
--     705-   742     80328                                                                        0.9995 0.9392
--     743-   781     71196                                                                        0.9996 0.9413
--     782-   821     63666                                                                        0.9996 0.9432
--
--    12082696 (max occurrences)
-- 28449463107 (total mers, non-unique)
--  1908449297 (distinct mers, non-unique)
--   160168767 (unique mers)

[CORRECTION/CORRECTIONS]
--
-- Reads to be corrected:
--   3629515 reads longer than 0 bp
--   28657761232 bp
-- Expected corrected reads:
--   3629515 reads
--   26419680288 bp
--   0 bp minimum length
--   7279 bp mean length
--   18191 bp n50 length

[TRIMMING/READS]
--
-- In gatekeeper store 'trimming/kefal.gkpStore':
--   Found 3495244 reads.
--   Found 27246096664 bases (5.44 times coverage).
--
--   Read length histogram (one '*' equals 4574.48 reads):
--        0    999      0
--     1000   1999 261951 *********************************************************
--     2000   2999 316304 *********************************************************************
--     3000   3999 320214 **********************************************************************
--     4000   4999 311883 ********************************************************************
--     5000   5999 296203 ****************************************************************
--     6000   6999 274963 ************************************************************
--     7000   7999 250216 ******************************************************
--     8000   8999 227890 *************************************************
--     9000   9999 211232 **********************************************
--    10000  10999 197312 *******************************************
--    11000  11999 177356 **************************************
--    12000  12999 147854 ********************************
--    13000  13999 117345 *************************
--    14000  14999  91712 ********************
--    15000  15999  70756 ***************
--    16000  16999  54106 ***********
--    17000  17999  40890 ********
--    18000  18999  31238 ******
--    19000  19999  23413 *****
--    20000  20999  17543 ***
--    21000  21999  13261 **
--    22000  22999  10120 **
--    23000  23999   7644 *
--    24000  24999   5808 *
--    25000  25999   4353
--    26000  26999   3375
--    27000  27999   2586
--    28000  28999   2005
--    29000  29999   1448
--    30000  30999   1115
--    31000  31999    801
--    32000  32999    653
--    33000  33999    463
--    34000  34999    351
--    35000  35999    248
--    36000  36999    185
--    37000  37999    140
--    38000  38999     88
--    39000  39999     73
--    40000  40999     41
--    41000  41999     30
--    42000  42999     27
--    43000  43999     15
--    44000  44999     13
--    45000  45999      9
--    46000  46999      3
--    47000  47999      5
--    48000  48999      1
--    49000  49999      1
--    50000  50999      1

[TRIMMING/MERS]
--
--  22-mers                                                                                           Fraction
--    Occurrences   NumMers                                                                         Unique Total
--       1-     1 12260626729 *******************************************************************--> 0.8896 0.4512
--       2-     2 533775095 ********************************************************************** 0.9284 0.4905
--       3-     4 284814077 *************************************                                  0.9416 0.5107
--       5-     7 185535500 ************************                                               0.9543 0.5391
--       8-    11 181107716 ***********************                                                0.9660 0.5804
--      12-    16 166213114 *********************                                                  0.9785 0.6461
--      17-    22  97505608 ************                                                           0.9894 0.7281
--      23-    29  32661591 ****                                                                   0.9953 0.7883
--      30-    37  11428041 *                                                                      0.9973 0.8141
--      38-    46   6744985                                                                        0.9980 0.8270
--      47-    56   4681131                                                                        0.9985 0.8370
--      57-    67   3396372                                                                        0.9988 0.8456
--      68-    79   2475548                                                                        0.9990 0.8531
--      80-    92   1836334                                                                        0.9992 0.8597
--      93-   106   1381945                                                                        0.9994 0.8654
--     107-   121   1073182                                                                        0.9994 0.8703
--     122-   137    854088                                                                        0.9995 0.8748
--     138-   154    695443                                                                        0.9996 0.8788
--     155-   172    572700                                                                        0.9996 0.8825
--     173-   191    474726                                                                        0.9997 0.8859
--     192-   211    403287                                                                        0.9997 0.8891
--     212-   232    351721                                                                        0.9997 0.8920
--     233-   254    308458                                                                        0.9998 0.8949
--     255-   277    270487                                                                        0.9998 0.8976
--     278-   301    233142                                                                        0.9998 0.9003
--     302-   326    205025                                                                        0.9998 0.9027
--     327-   352    182074                                                                        0.9998 0.9051
--     353-   379    162066                                                                        0.9999 0.9074
--     380-   407    145224                                                                        0.9999 0.9095
--     408-   436    132609                                                                        0.9999 0.9116
--     437-   466    120070                                                                        0.9999 0.9137
--     467-   497    110231                                                                        0.9999 0.9157
--     498-   529    100708                                                                        0.9999 0.9176
--     530-   562     91565                                                                        0.9999 0.9195
--     563-   596     83045                                                                        0.9999 0.9214
--     597-   631     74053                                                                        0.9999 0.9231
--     632-   667     66574                                                                        0.9999 0.9248
--     668-   704     60236                                                                        0.9999 0.9264
--     705-   742     55556                                                                        0.9999 0.9279
--     743-   781     50255                                                                        0.9999 0.9294
--     782-   821     46485                                                                        0.9999 0.9308
--
--     4636251 (max occurrences)
-- 14912069811 (total mers, non-unique)
--  1521244268 (distinct mers, non-unique)
-- 12260626729 (unique mers)

[TRIMMING/TRIMMING]
--  PARAMETERS:
--  ----------
--     1000    (reads trimmed below this many bases are deleted)
--   0.0450    (use overlaps at or below this fraction error)
--        1    (break region if overlap is less than this long, for 'largest covered' algorithm)
--        1    (break region if overlap coverage is less than this many read, for 'largest covered' algorithm)
--
--  INPUT READS:
--  -----------
--  3495244 reads  27246096664 bases (reads processed)
--       0 reads            0 bases (reads not processed, previously deleted)
--       0 reads            0 bases (reads not processed, in a library where trimming isn't allowed)
--
--  OUTPUT READS:
--  ------------
--  1906859 reads  12247680932 bases (trimmed reads output)
--   11611 reads     88262599 bases (reads with no change, kept as is)
--  1415652 reads   8360582403 bases (reads with no overlaps, deleted)
--  161122 reads   1066902964 bases (reads with short trimmed length, deleted)
--
--  TRIMMING DETAILS:
--  ----------------
--  1795449 reads   3300784545 bases (bases trimmed from the 5' end of a read)
--  1843927 reads   2181883221 bases (bases trimmed from the 3' end of a read)

[TRIMMING/SPLITTING]
--  PARAMETERS:
--  ----------
--     1000    (reads trimmed below this many bases are deleted)
--   0.0450    (use overlaps at or below this fraction error)
--  INPUT READS:
--  -----------
--  1918470 reads  17818611297 bases (reads processed)
--  1576774 reads   9427485367 bases (reads not processed, previously deleted)
--       0 reads            0 bases (reads not processed, in a library where trimming isn't allowed)
--
--  PROCESSED:
--  --------
--       0 reads            0 bases (no overlaps)
--    2099 reads     20164371 bases (no coverage after adjusting for trimming done already)
--       0 reads            0 bases (processed for chimera)
--       0 reads            0 bases (processed for spur)
--  1916371 reads  17798446926 bases (processed for subreads)
--
--  READS WITH SIGNALS:
--  ------------------
--       0 reads            0 signals (number of 5' spur signal)
--       0 reads            0 signals (number of 3' spur signal)
--       0 reads            0 signals (number of chimera signal)
--    1612 reads         1653 signals (number of subread signal)
--
--  SIGNALS:
--  -------
--       0 reads            0 bases (size of 5' spur signal)
--       0 reads            0 bases (size of 3' spur signal)
--       0 reads            0 bases (size of chimera signal)
--    1653 reads       624382 bases (size of subread signal)
--
--  TRIMMING:
--  --------
--     804 reads      2581563 bases (trimmed from the 5' end of the read)
--     808 reads      2563806 bases (trimmed from the 3' end of the read)

[UNITIGGING/READS]
--
-- In gatekeeper store 'unitigging/kefal.gkpStore':
--   Found 1918464 reads.
--   Found 12330792807 bases (2.46 times coverage).
--
--   Read length histogram (one '*' equals 3426.61 reads):
--        0    999      0
--     1000   1999 239863 **********************************************************************
--     2000   2999 205272 ***********************************************************
--     3000   3999 199874 **********************************************************
--     4000   4999 191229 *******************************************************
--     5000   5999 178933 ****************************************************
--     6000   6999 162014 ***********************************************
--     7000   7999 139561 ****************************************
--     8000   8999 122586 ***********************************
--     9000   9999 109484 *******************************
--    10000  10999  97025 ****************************
--    11000  11999  80311 ***********************
--    12000  12999  59992 *****************
--    13000  13999  41738 ************
--    14000  14999  29223 ********
--    15000  15999  20174 *****
--    16000  16999  13476 ***
--    17000  17999   9085 **
--    18000  18999   6126 *
--    19000  19999   4030 *
--    20000  20999   2659
--    21000  21999   1899
--    22000  22999   1256
--    23000  23999    814
--    24000  24999    625
--    25000  25999    400
--    26000  26999    260
--    27000  27999    177
--    28000  28999    112
--    29000  29999     88
--    30000  30999     73
--    31000  31999     38
--    32000  32999     19
--    33000  33999     22
--    34000  34999     13
--    35000  35999      6
--    36000  36999      3
--    37000  37999      2
--    38000  38999      0
--    39000  39999      1
--    40000  40999      1

[UNITIGGING/MERS]
--
--  22-mers                                                                                           Fraction
--    Occurrences   NumMers                                                                         Unique Total
--       1-     1 1457015698 *******************************************************************--> 0.6040 0.1185
--       2-     2 222392301 ********************************************************************** 0.6962 0.1547
--       3-     4 166192731 ****************************************************                   0.7377 0.1791
--       5-     7 143889575 *********************************************                          0.7872 0.2223
--       8-    11 154736318 ************************************************                       0.8420 0.2972
--      12-    16 139299333 *******************************************                            0.9028 0.4216
--      17-    22  76830942 ************************                                               0.9546 0.5717
--      23-    29  23830374 *******                                                                0.9810 0.6751
--      30-    37   8080124 **                                                                     0.9890 0.7163
--      38-    46   4807446 *                                                                      0.9920 0.7365
--      47-    56   3325156 *                                                                      0.9939 0.7522
--      57-    67   2378775                                                                        0.9953 0.7658
--      68-    79   1700300                                                                        0.9962 0.7774
--      80-    92   1246263                                                                        0.9969 0.7873
--      93-   106    943593                                                                        0.9974 0.7958
--     107-   121    733334                                                                        0.9978 0.8033
--     122-   137    587348                                                                        0.9981 0.8100
--     138-   154    481902                                                                        0.9983 0.8162
--     155-   172    396668                                                                        0.9985 0.8218
--     173-   191    340978                                                                        0.9987 0.8271
--     192-   211    296269                                                                        0.9988 0.8321
--     212-   232    255127                                                                        0.9989 0.8369
--     233-   254    216020                                                                        0.9990 0.8415
--     255-   277    188654                                                                        0.9991 0.8457
--     278-   301    167732                                                                        0.9992 0.8498
--     302-   326    148592                                                                        0.9993 0.8537
--     327-   352    133390                                                                        0.9993 0.8575
--     353-   379    121586                                                                        0.9994 0.8612
--     380-   407    111504                                                                        0.9994 0.8648
--     408-   436     99469                                                                        0.9995 0.8683
--     437-   466     89500                                                                        0.9995 0.8717
--     467-   497     82023                                                                        0.9996 0.8750
--     498-   529     73239                                                                        0.9996 0.8782
--     530-   562     64692                                                                        0.9996 0.8813
--     563-   596     58486                                                                        0.9996 0.8842
--     597-   631     52828                                                                        0.9997 0.8869
--     632-   667     48049                                                                        0.9997 0.8895
--     668-   704     43956                                                                        0.9997 0.8921
--     705-   742     40625                                                                        0.9997 0.8945
--     743-   781     37285                                                                        0.9997 0.8969
--     782-   821     34685                                                                        0.9998 0.8992
--
--     2777938 (max occurrences)
-- 10833489365 (total mers, non-unique)
--   955101113 (distinct mers, non-unique)
--  1457015698 (unique mers)

[UNITIGGING/OVERLAPS]
--   category            reads     %          read length        feature size or coverage  analysis
--   ----------------  -------  -------  ----------------------  ------------------------  --------------------
--   middle-missing       4681    0.24     7922.26 +- 4270.65        696.52 +- 788.02     (bad trimming)
--   middle-hump          5368    0.28     5111.90 +- 3307.65        426.57 +- 713.32     (bad trimming)
--   no-5-prime          31096    1.62     7239.44 +- 4261.69        302.69 +- 600.04     (bad trimming)
--   no-3-prime          31264    1.63     7260.04 +- 4257.05        300.99 +- 595.48     (bad trimming)
--
--   low-coverage       376134   19.61     3568.38 +- 2532.20          4.34 +- 1.98       (easy to assemble, potential for lower quality consensus)
--   unique             620466   32.34     5897.27 +- 3586.42         17.86 +- 5.63       (easy to assemble, perfect, yay)
--   repeat-cont         41696    2.17     5299.95 +- 3253.08        430.77 +- 406.55     (potential for consensus errors, no impact on assembly)
--   repeat-dove           232    0.01    13591.04 +- 5956.38        298.90 +- 282.08     (hard to assemble, likely won't assemble correctly or even at all)
--
--   span-repeat        239201   12.47     9013.93 +- 4131.67       3648.01 +- 3264.46    (read spans a large repeat, usually easy to assemble)
--   uniq-repeat-cont   416219   21.70     6674.32 +- 3067.54                             (should be uniquely placed, low potential for consensus errors, no impact on assembly)
--   uniq-repeat-dove   136186    7.10    11790.83 +- 3714.03                             (will end contigs, potential to misassemble)
--   uniq-anchor          2765    0.14     9217.72 +- 3480.68       2978.97 +- 2902.69    (repeat read, with unique section, probable bad read)

[UNITIGGING/ADJUSTMENT]
-- No report available.

[UNITIGGING/CONTIGS]
-- Found, in version 1, after unitig construction:
--   contigs:      21977 sequences, total length 873673405 bp (including 1152 repeats of total length 12175083 bp).
--   bubbles:      0 sequences, total length 0 bp.
--   unassembled:  609194 sequences, total length 3289799996 bp.
--
-- Contig sizes based on genome size --
--            NG (bp)  LG (contigs)    sum (bp)
--         ----------  ------------  ----------
--     10       50294          5474   500047085
--

[UNITIGGING/CONSENSUS]
-- Found, in version 2, after consensus generation:
--   contigs:      21977 sequences, total length 872069112 bp (including 1152 repeats of total length 12152210 bp).
--   bubbles:      0 sequences, total length 0 bp.
--   unassembled:  609194 sequences, total length 3289508348 bp.
--
-- Contig sizes based on genome size --
--            NG (bp)  LG (contigs)    sum (bp)
--         ----------  ------------  ----------
--     10       50099          5492   500025153
—
brianwalenz commented 7 years ago

The genomeSize parameter is used only for determining coverage in input reads and reporting of statistics. The coverage reported is just bases_in_input_reads / genome_size_parameter.

The unitigging overlap report is claiming 17x (+-5x) coverage in reads that look to be from unique portions of the genome. The kmer report in the same section is showing a slight peak at about 10x, but this is usually skewed low by noisy reads, and the big peak at low copy number shows this is pretty noisy data.

The unitigging gatekeeper report is showing 12,330,792,807 bases in input reads, but the consensus report says about 3.3 Gbp of those remain 'unassembled'. So that leaves about 9 Gbp in input bases that assembled to about 0.9 Gbp, so around 10x in assembled coverage.

You can try the low coverage settings, but it looks like you might need more input coverage for any better assembly.

skoren commented 7 years ago

I would suggest running the unitigging kmer histogram (unitigging/0-*/*.histogram) through something like GenomeScope (http://qb.cshl.edu/genomescope/) to see whether it is able to predict a genome size and heterozygosity. That will give you a better idea of if the 1gb you've assembled is a small part of your genome or not. However, you probably do need more coverage to improve the assembly result.

mortunco commented 7 years ago

@skoren We also have Illumina data for the same sample and as you suggested I used illumina data in KmerGenie software. The estimated genome size was ~1 Giga bp (892293194 bp).

Regarding @brianwalenz comment, I understood that genome parameter is just used for simple calculations so I cannot really change the output. But when I found this 1/5 times smaller value of the genome parameter. Do you think it is worth to give a shot with the new value ? because if the algorithm mapped 1gb that its actually the whole genome ?

Thank you both of you for your patience and time to help me problem. I owe a lot !

Best regards,

T.

skoren commented 7 years ago

So if your genome size is 1gb, that would imply the Canu assembly you have is almost the complete genome (873 Mbp) with an NG50 of 50kb. This would also be consistent with the slight peak at 10x in the corrected data, 12330792807 / 10 = 1.2 Gbp. Any reason you are setting the genome size to 5 if the estimate was 1? Another way to test this is to use BUSCO (http://busco.ezlab.org) to see how complete the marker genes for your assembly are. If the genome is close to 1 Gbp, the gene set will be largely complete, if it is 5 Gb, it should be < 20% complete.

As @brianwalenz said, the genome size is just used for basic calculations. You can see your assembly with a genome size of 555 Gbp and 5 Gbp produced similar assemblies (870 and 873 mb total). It would just give you more meaningful stats in the report (e.g. you don't have 2x, you have 10x coverage) but isn't worth rerunning with it set to 1 from 5. Given a 1gb genome, you have an input of 25x which is relatively low and your average input read length is <8kb, also not very high. This probably means if you want to improve your assembly contiguity you'd need more/longer reads.

mortunco commented 7 years ago

@skoren I did not know anything about estimating genome size (I thought fastq sizes could be used for an approximation, but I have deeply mistaken. Sorry!) before hand and I can say that I learned through questions.

I will definetely give busco a try.

Thank you very much again for your help!

Best, T.