marbl / canu

A single molecule sequence assembler for genomes large and small.
http://canu.readthedocs.io/
658 stars 179 forks source link

No contig reconstructed using HiCanu and deep-HiFi libraries. #2197

Closed deoliveira86 closed 1 year ago

deoliveira86 commented 1 year ago

Dear all,

I have recently acquired many HiFi libraries through a ultra-low insert library including PCR protocol from a single metazoan specimen. I followed the recommended pre-processing steps and performed everything in house from the native reads.bam file:

  1. Demutiplexing with lima v2.6.0;
  2. Removal of PCR duplicates with pbmarkdup v1.0.2;
  3. Extraction of HiFi reads using two two predicted accuracies (QV20 and QV30) with bamtools filter 2.5.2 (parameters: "rq":">=0.99" and "rq":">=0.999", respectively). For convenience, the HiFi reads with QV>=30 I will call ultra-HiFi (uHiFi) from this point onwards.

Before running the assembly, I did some sanity checks with the data, namely:

  1. Since I have already a draft genome (~7K contigs and N50 ~300kb) from my metazoan using the old PacBio CLR sequencing approach, I mapped all the HiFi reads against the draft genome (which was assembled by Canu v2.2 using Sequel recommended parameters, and was polished, purged, and any contaminants removed with blobtools) with minimap2. The mapping rate was >94%, confirming that the HiFi data correspond to the organism which I am interested and there are likely no contamination in the data.
  2. Using the Hifi and uHiFi data I generated kmer histograms and used GenomeScope2 plots to assess if the kmer spectra were in agreement with my previous genome size estimations, heterozygosity, and repeat content. Additionally, I also double checked the ploidy of my organism with smudgeplot and I reconfirmed that it is likely diploid. I got overall comparable results with my old estimations (see below the GenomeScope2 results from the uHiFi data) which indicated once again that the HiFi data is solid and ready to be assembled.

oal-GenomeScope2-uhifi-linear_plot

To summarise some basic info about the genome:

  1. Estimated size ~1.2Gb
  2. Overall repeat content 65-70% (based on kmer spectra and also RepeatModeler2/RepeatMasker2 analyses)
  3. GC content is around 43%

With this previous information I ran Canu v2.2 with the following command line:

canu -p uhifi-100x -d uhifi-100x genomeSize=1.265g maxInputCoverage=100 uHiFi_libraries.*gz

Everything ran normally (and the histogram plots agree with the GenomeScope prediction - see below) until the computation of the cns jobs.

[UNITIGGING/MERS]
--
--  22-mers                                                                                           Fraction
--    Occurrences   NumMers                                                                         Unique Total
--       1-     1         0                                                                        0.0000 0.0000
--       2-     2 112048308 ********************************************************************** 0.1426 0.0032
--       3-     4  84214709 ****************************************************                   0.2100 0.0055
--       5-     7  41414799 *************************                                              0.2743 0.0087
--       8-    11  26746167 ****************                                                       0.3119 0.0116
--      12-    16  34464355 *********************                                                  0.3447 0.0155
--      17-    22  39746738 ************************                                               0.3896 0.0232
--      23-    29  31329474 *******************                                                    0.4376 0.0343
--      30-    37  32899175 ********************                                                   0.4758 0.0459
--      38-    46  39758212 ************************                                               0.5181 0.0625
--      47-    56  52719489 ********************************                                       0.5691 0.0874
--      57-    67  75183169 **********************************************                         0.6381 0.1286
--      68-    79  82957925 ***************************************************                    0.7353 0.1983
--      80-    92  53840228 *********************************                                      0.8388 0.2854
--      93-   106  21179635 *************                                                          0.9031 0.3482
--     107-   121   9546220 *****                                                                  0.9280 0.3762
--     122-   137   7306502 ****                                                                   0.9397 0.3914
--     138-   154   6133209 ***                                                                    0.9489 0.4050
--     155-   172   4588862 **                                                                     0.9566 0.4177
--     173-   191   3339133 **                                                                     0.9623 0.4284
--     192-   211   2631112 *                                                                      0.9665 0.4370
--     212-   232   2215099 *                                                                      0.9698 0.4446
--     233-   254   1866626 *                                                                      0.9726 0.4516
--     255-   277   1572602                                                                        0.9749 0.4581
--     278-   301   1360720                                                                        0.9769 0.4641
--     302-   326   1198653                                                                        0.9786 0.4698
--     327-   352   1068025                                                                        0.9801 0.4752
--     353-   379    951655                                                                        0.9815 0.4804
--     380-   407    869568                                                                        0.9827 0.4854
--     408-   436    761500                                                                        0.9838 0.4903
--     437-   466    698665                                                                        0.9848 0.4950
--     467-   497    620415                                                                        0.9856 0.4995
--     498-   529    571456                                                                        0.9864 0.5038
--     530-   562    521798                                                                        0.9871 0.5080
--     563-   596    481317                                                                        0.9878 0.5121
--     597-   631    455735                                                                        0.9884 0.5161
--     632-   667    417148                                                                        0.9890 0.5202
--     668-   704    388549                                                                        0.9895 0.5241
--     705-   742    367142                                                                        0.9900 0.5279
--     743-   781    334208                                                                        0.9905 0.5317
--     782-   821    307866                                                                        0.9909 0.5354
--
--           0 (max occurrences)
-- 69297027604 (total mers, non-unique)
--   785918395 (distinct mers, non-unique)
--           0 (unique mers)

Thousands and thousands of cns instances were submitted to the cluster and I could see that after bogard assembler run, no reconstructed contig was generated. See log below from the canu.report file:

[UNITIGGING/CONTIGS]
-- Found, in version 1, after unitig construction:
--   contigs:      377 sequences, total length 3793548 bp (including 0 repeats of total length 0 bp).
--   bubbles:      12 sequences, total length 108472 bp.
--   unassembled:  10955417 sequences, total length 69933679332 bp.
--
-- Contig sizes based on genome size 1.265gbp:
--
--            NG (bp)  LG (contigs)    sum (bp)
--         ----------  ------------  ----------
--

From the report above, it is pretty clear that almost every all the HiFi reads could not be assembled, which is making me pretty worried about the quality of the data and making me extremely confused. From the Unittiging/overlap categories (below), I can see that Canu could identify more reads for the genome reconstruction, and despite the level of unique reads is quite low (2.50%), the low-coverage and other category reads could be used for assembling.

[UNITIGGING/OVERLAPS]
--   category            reads     %          read length        feature size or coverage  analysis
--   ----------------  -------  -------  ----------------------  ------------------------  --------------------
--   middle-missing      76335    0.70     8305.19 +- 2218.84        917.71 +- 885.08     (bad trimming)
--   middle-hump        590592    5.39     7278.13 +- 1804.00       1906.68 +- 1404.50    (bad trimming)
--   no-5-prime        1846303   16.84     6544.61 +- 1950.29        813.72 +- 947.54     (bad trimming)
--   no-3-prime        1846718   16.84     6545.41 +- 1952.84        815.17 +- 948.58     (bad trimming)
--   
--   low-coverage      4010078   36.57     6368.77 +- 1920.89          9.66 +- 5.44       (easy to assemble, potential for lower quality consensus)
--   unique             274085    2.50     4807.11 +- 1663.13         45.57 +- 17.04      (easy to assemble, perfect, yay)
--   repeat-cont         72399    0.66     5650.88 +- 1780.46        631.22 +- 456.58     (potential for consensus errors, no impact on assembly)
--   repeat-dove           559    0.01    10542.23 +- 1764.60        402.98 +- 217.04     (hard to assemble, likely won't assemble correctly or even at all)
--   
--   span-repeat        511012    4.66     6809.10 +- 1875.63       2444.25 +- 1782.49    (read spans a large repeat, usually easy to assemble)
--   uniq-repeat-cont   583548    5.32     6129.40 +- 1558.55                             (should be uniquely placed, low potential for consensus errors, no impact on assembly)
--   uniq-repeat-dove   145277    1.32     8518.53 +- 1837.94                             (will end contigs, potential to misassemble)
--   uniq-anchor           850    0.01     8394.72 +- 1870.29       1771.10 +- 1182.45    (repeat read, with unique section, probable bad read)

Because I am using uHiFi reads the error rates should not be high and the Error rate log (if I understood it correctly), tells me that the error rate is not an issue.

UNITIGGING/ERROR RATES]
--  
--  ERROR RATES
--  -----------
--                                                   --------threshold------
--  2636067                      fraction error      fraction        percent
--  samples                              (1e-5)         error          error
--                   --------------------------      --------       --------
--  command line (-eg)                           ->     30.00        0.0300%
--  command line (-ef)                           ->  -----.--      ---.----%
--  command line (-eM)                           ->     30.00        0.0300%
--  mean + std.dev       2.86 +-  12 *     7.44  ->     92.16        0.0922%
--  median + mad         0.00 +-  12 *     0.00  ->      0.00        0.0000%
--  90th percentile                              ->     18.00        0.0180%  (enabled)

The Best edge filtering log tells me that only a small percentage of reads have at least one edge (0.06%), indicating that virtually all my HiFi reads have no useful overlap. This is scary and I wonder why?

--  BEST EDGE FILTERING
--  -------------------
--  At graph threshold 0.0300%, reads:
--    available to have edges:     10418812
--    with at least one edge:        727437
--  
--  At max threshold 0.0300%, reads:
--    available to have edges:     10418812
--    with at least one edge:        727437
--  
--  At tight threshold 0.0180%, reads with:
--    both edges below error threshold:    645627  (80.00% minReadsBest threshold = 581949)
--    one  edge  above error threshold:     81762
--    both edges above error threshold:        48
--    at least one edge:                   727437
--  
--  At loose threshold 0.0300%, reads with:
--    both edges below error threshold:    727437  (80.00% minReadsBest threshold = 581949)
--    one  edge  above error threshold:         0
--    both edges above error threshold:         0
--    at least one edge:                   727437
--  
--  
--  INITIAL EDGES
--  -------- ----------------------------------------
--     55357 reads are contained
--  10845999 reads have no best edges (singleton)
--     59171 reads have only one best edge (spur) 
--              46918 are mutual best
--      5456 reads have two best edges 
--                814 have one mutual best edge
--               4461 have two mutual best edges
--  
--  
--  FINAL EDGES
--  -------- ----------------------------------------
--     55357 reads are contained
--  10846708 reads have no best edges (singleton)
--     58588 reads have only one best edge (spur) 
--              46750 are mutual best
--      5330 reads have two best edges 
--                706 have one mutual best edge
--               4478 have two mutual best edges
--  
--  
--  EDGE FILTERING
--  -------- ------------------------------------------
--         0 reads are ignored
--   1968468 reads have a gap in overlap coverage
--       531 reads have lopsided best edges

Finally, since almost no overlap was identified Canu produced any output and for some reason launched thousands of cns jobs which I killed after more than 50,237 *out files were written in the unitigging/5-consensus/ folder. Furthermore, I could see that in the uhifi-full.ctgStore/ folder there is a total of 87,022 partitions. Why cns jobs would be submitted without any reconstructed contigs?

[UNITIGGING/CONTIGS]
-- Found, in version 1, after unitig construction:
--   contigs:      377 sequences, total length 3793548 bp (including 0 repeats of total length 0 bp).
--   bubbles:      12 sequences, total length 108472 bp.
--   unassembled:  10955417 sequences, total length 69933679332 bp.
--
-- Contig sizes based on genome size 1.265gbp:
--
--            NG (bp)  LG (contigs)    sum (bp)
--         ----------  ------------  ----------
--

I investigated the unitigger.err log file, but I couldn't properly interpret the results. I will attach here, that could may be useful.

==> PARAMETERS.

Resources:
  Memory                256 GB
  Compute Threads       16

Lengths:
  Minimum read          0 bases
  Maximum read          4294967295 bases
  Minimum overlap       500 bases

Overlap Error Rates:
  Graph                 0.000 (0.030%)
  Max                   0.000 (0.030%)
  Forced                -.--- (-.---%)   (not used)

Deviations:
  Graph                 12.000
  Bubble                1.000
  Repeat                1.000

Similarity Thresholds:
  Graph                 0.000
  Bubble                0.010
  Repeat                0.010

Edge Confusion:
  Absolute              2500
  Percent               15.0000

Unitig Construction:
  Minimum intersection  500 bases
  Maxiumum placements   2 positions

Debugging Enabled:
  (none)

==> LOADING AND FILTERING OVERLAPS.

ReadInfo()-- Found    10965983 reads.

OverlapCache()-- limited to 262144MB memory (user supplied).

OverlapCache()--      83MB for read data.
OverlapCache()--     334MB for best edges.
OverlapCache()--    1087MB for tigs.
OverlapCache()--     292MB for tigs - read layouts.
OverlapCache()--     418MB for tigs - error profiles.
OverlapCache()--   65536MB for tigs - error profile overlaps.
OverlapCache()--       0MB for other processes.
OverlapCache()-- ---------
OverlapCache()--   67962MB for data structures (sum of above).
OverlapCache()-- ---------
OverlapCache()--     209MB for overlap store structure.
OverlapCache()--  193972MB for overlap data.
OverlapCache()-- ---------
OverlapCache()--  262144MB allowed.
OverlapCache()--
OverlapCache()-- Retain at least 110 overlaps/read, based on 55.31x coverage.
OverlapCache()-- Initial guess at 1159 overlaps/read.
OverlapCache()--
OverlapCache()-- Adjusting for sparse overlaps.
OverlapCache()--
OverlapCache()--               reads loading olaps          olaps               memory
OverlapCache()--   olaps/read       all      some          loaded                 free
OverlapCache()--   ----------   -------   -------     ----------- -------     --------
OverlapCache()--         1159   10943912     22071       281982940  96.09%     189669 MB
OverlapCache()--       564350   10965983         0       293457678 100.00%     189494 MB
OverlapCache()--
OverlapCache()-- Loading overlaps.
OverlapCache()--
OverlapCache()--          read from store           saved in cache
OverlapCache()--   ------------ ---------   ------------ ---------
OverlapCache()--        2758766 (000.94%)         129240 (000.04%)
OverlapCache()--        5578235 (001.90%)         270049 (000.09%)
OverlapCache()--        8245330 (002.81%)         390950 (000.13%)
OverlapCache()--       10981393 (003.74%)         508349 (000.17%)
OverlapCache()--       13723899 (004.68%)         637150 (000.22%)
OverlapCache()--       16378401 (005.58%)         746424 (000.25%)
OverlapCache()--       19038358 (006.49%)         878893 (000.30%)
OverlapCache()--       21738303 (007.41%)        1019297 (000.35%)
OverlapCache()--       24515590 (008.35%)        1168236 (000.40%)
OverlapCache()--       27197465 (009.27%)        1304170 (000.44%)
OverlapCache()--       29820894 (010.16%)        1442618 (000.49%)
OverlapCache()--       32406371 (011.04%)        1575246 (000.54%)
OverlapCache()--       35089513 (011.96%)        1722027 (000.59%)
OverlapCache()--       37730398 (012.86%)        1854388 (000.63%)
OverlapCache()--       40366219 (013.76%)        1987546 (000.68%)
OverlapCache()--       42975704 (014.64%)        2122715 (000.72%)
OverlapCache()--       45641718 (015.55%)        2260023 (000.77%)
OverlapCache()--       48338962 (016.47%)        2402989 (000.82%)
OverlapCache()--       50860909 (017.33%)        2531854 (000.86%)
OverlapCache()--       53565546 (018.25%)        2673960 (000.91%)
OverlapCache()--       56194008 (019.15%)        2803905 (000.96%)
OverlapCache()--       58869710 (020.06%)        2937590 (001.00%)
OverlapCache()--       61556779 (020.98%)        3073602 (001.05%)
OverlapCache()--       64232073 (021.89%)        3203548 (001.09%)
OverlapCache()--       66892384 (022.79%)        3329276 (001.13%)
OverlapCache()--       69616252 (023.72%)        3455098 (001.18%)
OverlapCache()--       72235382 (024.62%)        3576747 (001.22%)
OverlapCache()--       74841915 (025.50%)        3701101 (001.26%)
OverlapCache()--       77381970 (026.37%)        3823115 (001.30%)
OverlapCache()--       79965223 (027.25%)        3945554 (001.34%)
OverlapCache()--       82527080 (028.12%)        4067129 (001.39%)
OverlapCache()--       85166475 (029.02%)        4197837 (001.43%)
OverlapCache()--       87767357 (029.91%)        4327761 (001.47%)
OverlapCache()--       90371151 (030.80%)        4448018 (001.52%)
OverlapCache()--       93039733 (031.70%)        4578939 (001.56%)
OverlapCache()--       95653898 (032.60%)        4698881 (001.60%)
OverlapCache()--       98263695 (033.48%)        4818728 (001.64%)
OverlapCache()--      100917778 (034.39%)        4944699 (001.68%)
OverlapCache()--      103575587 (035.29%)        5067806 (001.73%)
OverlapCache()--      106222513 (036.20%)        5193168 (001.77%)
OverlapCache()--      108938832 (037.12%)        5325683 (001.81%)
OverlapCache()--      111704971 (038.07%)        5456138 (001.86%)
OverlapCache()--      114516240 (039.02%)        5590399 (001.91%)
OverlapCache()--      117325996 (039.98%)        5725050 (001.95%)
OverlapCache()--      120117973 (040.93%)        5859989 (002.00%)
OverlapCache()--      122859778 (041.87%)        5987615 (002.04%)
OverlapCache()--      125652547 (042.82%)        6119541 (002.09%)
OverlapCache()--      128424375 (043.76%)        6256129 (002.13%)
OverlapCache()--      131125411 (044.68%)        6384923 (002.18%)
OverlapCache()--      133863522 (045.62%)        6516510 (002.22%)
OverlapCache()--      136662205 (046.57%)        6655635 (002.27%)
OverlapCache()--      139457317 (047.52%)        6792329 (002.31%)
OverlapCache()--      142220064 (048.46%)        6928200 (002.36%)
OverlapCache()--      144998972 (049.41%)        7059304 (002.41%)
OverlapCache()--      147797092 (050.36%)        7192573 (002.45%)
OverlapCache()--      150601438 (051.32%)        7324925 (002.50%)
OverlapCache()--      153355406 (052.26%)        7454633 (002.54%)
OverlapCache()--      156119270 (053.20%)        7590452 (002.59%)
OverlapCache()--      158763395 (054.10%)        7718434 (002.63%)
OverlapCache()--      161404652 (055.00%)        7843559 (002.67%)
OverlapCache()--      164012493 (055.89%)        7963008 (002.71%)
OverlapCache()--      166613337 (056.78%)        8084105 (002.75%)
OverlapCache()--      169261309 (057.68%)        8211127 (002.80%)
OverlapCache()--      171954790 (058.60%)        8336457 (002.84%)
OverlapCache()--      174520149 (059.47%)        8455262 (002.88%)
OverlapCache()--      177140268 (060.36%)        8581936 (002.92%)
OverlapCache()--      179778156 (061.26%)        8713065 (002.97%)
OverlapCache()--      182335192 (062.13%)        8833687 (003.01%)
OverlapCache()--      184940657 (063.02%)        8954635 (003.05%)
OverlapCache()--      187505568 (063.90%)        9073052 (003.09%)
OverlapCache()--      190154039 (064.80%)        9199689 (003.13%)
OverlapCache()--      192752167 (065.68%)        9322149 (003.18%)
OverlapCache()--      195419049 (066.59%)        9449628 (003.22%)
OverlapCache()--      198000298 (067.47%)        9571465 (003.26%)
OverlapCache()--      200592130 (068.35%)        9694932 (003.30%)
OverlapCache()--      203216574 (069.25%)        9812485 (003.34%)
OverlapCache()--      205858123 (070.15%)        9939886 (003.39%)
OverlapCache()--      208453372 (071.03%)       10065660 (003.43%)
OverlapCache()--      211110404 (071.94%)       10187754 (003.47%)
OverlapCache()--      213845975 (072.87%)       10320866 (003.52%)
OverlapCache()--      216511334 (073.78%)       10448993 (003.56%)
OverlapCache()--      219184091 (074.69%)       10570501 (003.60%)
OverlapCache()--      221818024 (075.59%)       10688045 (003.64%)
OverlapCache()--      224425105 (076.48%)       10809785 (003.68%)
OverlapCache()--      227142060 (077.40%)       10939544 (003.73%)
OverlapCache()--      229745705 (078.29%)       11062481 (003.77%)
OverlapCache()--      232395043 (079.19%)       11187703 (003.81%)
OverlapCache()--      235073552 (080.10%)       11314923 (003.86%)
OverlapCache()--      237685078 (080.99%)       11434403 (003.90%)
OverlapCache()--      240376443 (081.91%)       11560408 (003.94%)
OverlapCache()--      243054761 (082.82%)       11687282 (003.98%)
OverlapCache()--      245762510 (083.75%)       11811758 (004.03%)
OverlapCache()--      248441840 (084.66%)       11936924 (004.07%)
OverlapCache()--      251093849 (085.56%)       12058156 (004.11%)
OverlapCache()--      253753972 (086.47%)       12192422 (004.15%)
OverlapCache()--      256450132 (087.39%)       12329232 (004.20%)
OverlapCache()--      259098548 (088.29%)       12457532 (004.25%)
OverlapCache()--      261804737 (089.21%)       12592919 (004.29%)
OverlapCache()--      264560476 (090.15%)       12726555 (004.34%)
OverlapCache()--      267261120 (091.07%)       12860460 (004.38%)
OverlapCache()--      269967258 (092.00%)       12996233 (004.43%)
OverlapCache()--      272618784 (092.90%)       13125504 (004.47%)
OverlapCache()--      275340609 (093.83%)       13261406 (004.52%)
OverlapCache()--      278076032 (094.76%)       13398750 (004.57%)
OverlapCache()--      280835745 (095.70%)       13539659 (004.61%)
OverlapCache()--      283503084 (096.61%)       13673418 (004.66%)
OverlapCache()--      286213556 (097.53%)       13818669 (004.71%)
OverlapCache()--      288919193 (098.45%)       13950277 (004.75%)
OverlapCache()--      291612452 (099.37%)       14082570 (004.80%)
OverlapCache()--   ------------ ---------   ------------ ---------
OverlapCache()--      293457678 (100.00%)       14178812 (004.83%)
OverlapCache()--
OverlapCache()-- Ignored 6274 duplicate overlaps.
OverlapCache()--
OverlapCache()-- Symmetrizing overlaps.
OverlapCache()--   Finding missing twins.
OverlapCache()--   In 14178812 overlaps:
OverlapCache()--     Found 250091 overlaps with non-symmetric error rates.
OverlapCache()--     Found 1678380 overlaps with missing twins.
OverlapCache()--     Dropped 0 weak missing-twin overlaps.
OverlapCache()--   Shifting overlaps.
OverlapCache()--   Adding missing twins.
OverlapCache()--   Sorting overlaps.
OverlapCache()--   Checking overlap symmetry.
OverlapCache()--   Finished.

BestOverlapGraph()-- Computing Best Overlap Graph.
BestOverlapGraph()-- Allocating best edges (334MB).
BestOverlapGraph()-- Filtering high error edges.
BestOverlapGraph()--   Ignore overlaps with more than 0.018000% error.
BestOverlapGraph()-- Filtering reads with a gap in overlap coverage.
BestOverlapGraph()--   1968468 reads removed.
BestOverlapGraph()-- Filtering reads with lopsided best edges (more than 25.00% different).
BestOverlapGraph()--   531 reads have lopsided edges.
BestOverlapGraph()-- Filtering spur reads.
BestOverlapGraph()--   Initial        8936289 terminal  spur reads -  8940685/8940518  5'/3' spur path reads.
BestOverlapGraph()--   Iteration 1 -  8936289 terminal  spur reads -  8940664/8940499  5'/3' spur path reads -       63/288      edges changed to avoid spur path.
BestOverlapGraph()--   Iteration 2 -  8936289 terminal  spur reads -  8940660/8940497  5'/3' spur path reads -        0/2        edges changed to avoid spur path.
BestOverlapGraph()--   Iteration 3 -  8936289 terminal  spur reads -  8940660/8940495  5'/3' spur path reads -       12/52       edges changed to avoid spur path.
BestOverlapGraph()--   Iteration 4 -  8936289 terminal  spur reads -  8940660/8940495  5'/3' spur path reads -        0/0        edges changed to avoid spur path.
BestOverlapGraph()--   Final          8941610 confirmed spur reads -  8940660/8940495  5'/3' spur path reads.

==> BUILDING GREEDY TIGS.

breakSingletonTigs()-- Removed 0 singleton tigs; reads are now unplaced.
optimizePositions()-- Optimizing read positions for 10965984 reads in 3663 tigs, with 16 threads.
optimizePositions()--   Allocating scratch space for 10965984 reads (342687 KB).
optimizePositions()--   Initializing positions with 16 threads.
optimizePositions()--   Recomputing positions, iteration 1, with 16 threads.
optimizePositions()--     Reset zero.
optimizePositions()--     Checking convergence.
optimizePositions()--     converged: 10965983 reads
optimizePositions()--     changed:        1 reads
optimizePositions()--   Recomputing positions, iteration 2, with 16 threads.
optimizePositions()--     Reset zero.
optimizePositions()--     Checking convergence.
optimizePositions()--     converged: 10965982 reads
optimizePositions()--     changed:        2 reads
optimizePositions()--   Recomputing positions, iteration 3, with 16 threads.
optimizePositions()--     Reset zero.
optimizePositions()--     Checking convergence.
optimizePositions()--     converged: 10965983 reads
optimizePositions()--     changed:        1 reads
optimizePositions()--   Recomputing positions, iteration 4, with 16 threads.
optimizePositions()--     Reset zero.
optimizePositions()--     Checking convergence.
optimizePositions()--     converged: 10965983 reads
optimizePositions()--     changed:        1 reads
optimizePositions()--   Recomputing positions, iteration 5, with 16 threads.
optimizePositions()--     Reset zero.
optimizePositions()--     Checking convergence.
optimizePositions()--     converged: 10965983 reads
optimizePositions()--     changed:        1 reads
optimizePositions()--   Expanding short reads with 16 threads.
optimizePositions()--   Updating positions.
optimizePositions()--   Finished.
splitDiscontinuous()-- Tested 3662 tigs, split none.
detectSpur() done.
tested            0
nEdges      5'    0   3'    0
nPotential        0         0
nVerified         0         0

==> PLACE CONTAINED READS.

computeErrorProfiles()-- Computing error profiles for 3663 tigs, with 16 threads.
computeErrorProfiles()-- Finished.

placeContains()-- placing 55357 contained and 10898737 unplaced reads, with 16 threads.
placeContains()-- Placed 1635 contained reads and 315 unplaced reads.
placeContains()-- Failed to place 53722 contained reads (too high error suspected) and 10898422 unplaced reads (lack of overlaps suspected).
optimizePositions()-- Optimizing read positions for 10965984 reads in 3663 tigs, with 16 threads.
optimizePositions()--   Allocating scratch space for 10965984 reads (342687 KB).
optimizePositions()--   Initializing positions with 16 threads.
optimizePositions()--   Recomputing positions, iteration 1, with 16 threads.
optimizePositions()--     Reset zero.
optimizePositions()--     Checking convergence.
optimizePositions()--     converged: 10965958 reads
optimizePositions()--     changed:       26 reads
optimizePositions()--   Recomputing positions, iteration 2, with 16 threads.
optimizePositions()--     Reset zero.
optimizePositions()--     Checking convergence.
optimizePositions()--     converged: 10965972 reads
optimizePositions()--     changed:       12 reads
optimizePositions()--   Recomputing positions, iteration 3, with 16 threads.
optimizePositions()--     Reset zero.
optimizePositions()--     Checking convergence.
optimizePositions()--     converged: 10965978 reads
optimizePositions()--     changed:        6 reads
optimizePositions()--   Recomputing positions, iteration 4, with 16 threads.
optimizePositions()--     Reset zero.
optimizePositions()--     Checking convergence.
optimizePositions()--     converged: 10965979 reads
optimizePositions()--     changed:        5 reads
optimizePositions()--   Recomputing positions, iteration 5, with 16 threads.
optimizePositions()--     Reset zero.
optimizePositions()--     Checking convergence.
optimizePositions()--     converged: 10965980 reads
optimizePositions()--     changed:        4 reads
optimizePositions()--   Expanding short reads with 16 threads.
optimizePositions()--   Updating positions.
optimizePositions()--   Finished.
splitDiscontinuous()-- Tested 3662 tigs, split none.

==> MERGE ORPHANS.

computeErrorProfiles()-- Computing error profiles for 3663 tigs, with 16 threads.
computeErrorProfiles()-- Finished.

findPotentialOrphans()-- working on 3663 tigs.
findPotentialOrphans()-- found 0 potential orphans.
mergeOrphans()-- flagged       0        bubble tigs with 0 reads
mergeOrphans()-- placed        0 unique orphan tigs with 0 reads
mergeOrphans()-- shattered     0 repeat orphan tigs with 0 reads
mergeOrphans()-- ignored       0               tigs with 0 reads; failed to place
mergeOrphans()--

==> MARK SIMPLE BUBBLES.
    using 0.010000 user-specified threshold

findPotentialOrphans()-- working on 3663 tigs.
findPotentialOrphans()-- found 15 potential orphans.
mergeOrphans()-- flagged      12        bubble tigs with 43 reads
mergeOrphans()-- placed        0 unique orphan tigs with 0 reads
mergeOrphans()-- shattered     0 repeat orphan tigs with 0 reads
mergeOrphans()-- ignored       0               tigs with 0 reads; failed to place
mergeOrphans()--
classifyAsUnassembled()--      0 tigs           0 bases -- singleton
classifyAsUnassembled()--      0 tigs           0 bases -- too few reads        (< 2 reads)
classifyAsUnassembled()--      0 tigs           0 bases -- too short            (< 0 bp)
classifyAsUnassembled()--      0 tigs           0 bases -- single spanning read (> 1.000000 tig length)
classifyAsUnassembled()--   3280 tigs    37994669 bases -- low coverage         (> 0.500000 tig length at < 3 coverage)
classifyAsUnassembled()--    389 tigs     3902020 bases -- acceptable contigs

==> GENERATING ASSEMBLY GRAPH.

computeErrorProfiles()-- Computing error profiles for 3663 tigs, with 16 threads.
computeErrorProfiles()-- Finished.

AssemblyGraph()-- allocating vectors for placements, 501.983MB
AssemblyGraph()-- finding edges for 13839 reads (1635 contained), ignoring 10952144 unplaced reads, with 16 threads.
AssemblyGraph()-- building reverse edges.
AssemblyGraph()-- build complete.

==> BREAK REPEATS.

computeErrorProfiles()-- Computing error profiles for 3663 tigs, with 16 threads.
computeErrorProfiles()-- Finished.

==> CLEANUP MISTAKES.

splitDiscontinuous()-- Tested 3662 tigs, split none.
promoteToSingleton()-- Moved 10952144 unplaced reads to singleton tigs.
splitDiscontinuous()-- Tested 3662 tigs, split none.
promoteToSingleton()-- Moved 0 unplaced reads to singleton tigs.

==> CLEANUP GRAPH.

==> GENERATE OUTPUTS.

Bye.

Does anyone have an idea what is happening? My guess would be that playing around with bogart parameters (e.g., overlapping generation limits and overlap processing limits) can actually help me to produce an assembly (well, it cannot get any worse, since I had no output). Additionally, I am running now many instances of hifiasm to see how the software will perform with the data. I imagine if the hifiasm results (if any) are also not great, something must be wrong with the data.

Furthermore, does anyone have any idea what could go wrong with the library preparation? Since the quality of the libraries is good (in terms of QV) and my N50 HiFi read length is not bad (~10Kb), the source of the issue must come from the ultra-low library prep (I guess??). The extremely high repeat content of the genome I am working can also cause assembly issues, but, one more time, researchers can go through telomeric sequences using HiFi and that does not impede them to have at least assembled sequences.

I appreciate the help, Best, André

skoren commented 1 year ago

I think your data is lower quality (or at least in terms of matching between reads). This could be due to the PCR amplification beforehand. There's essentially no overlaps and nothing to reconstruct at typical HiFi error rates. That is why you had so many consensus jobs, there were lots of very short nodes composed of just a few (or 1) reads.

I'd suggest turning on trimming and increasing the error rate (correctedErrorRate=0.025 -untrimmed). You could try several error rates, the default if 0.01, so I'd try the above plus maybe 0.015, 0.03, and 0.05.

deoliveira86 commented 1 year ago

Hello @skoren,

thanks one more time for the prompt answer. I will wait for my hifiasm results and compile all information I have before contacting the sequencing facility to try to pin point where the issue is. Meanwhile I will give it a go with your suggested canu parameters. I will close this issue by now and updated you with the result afterward. Maybe it will be helpful to someone facing the same difficulties I am right now.

Best, André

deoliveira86 commented 1 year ago

Hello @skoren,

I did some digging and your comment about the data being lower in quality, when compared to a standard HiFi library, is correct. When I increased the error rate to 2.5% and add the -untrimmed parameters I got an assembly. I really fragmented one (154k contigs and N50 of 35k), but still some things were reconstructed. Please see the trimming logs when 0.025 error rate was specified:

trimmed (2.5% error rate)

[UNITIGGING/OVERLAPS]
--   category            reads     %          read length        feature size or coverage  analysis
--   ----------------  -------  -------  ----------------------  ------------------------  --------------------
--   middle-missing      16833    0.16     6930.20 +- 2279.10        463.30 +- 606.55     (bad trimming)
--   middle-hump          5610    0.05     4084.91 +- 1920.38       1046.51 +- 1237.10    (bad trimming)
--   no-5-prime          36595    0.34     5419.54 +- 2467.18        333.23 +- 558.40     (bad trimming)
--   no-3-prime          17919    0.17     5829.36 +- 2414.60        403.62 +- 635.87     (bad trimming)
--   
--   low-coverage       343167    3.22     4246.45 +- 2115.61         10.35 +- 4.81       (easy to assemble, potential for lower quality consensus)
--   unique            4256967   39.97     5115.65 +- 2215.88         53.59 +- 15.86      (easy to assemble, perfect, yay)
--   repeat-cont       1149297   10.79     4226.55 +- 2004.19       1530.70 +- 1562.14    (potential for consensus errors, no impact on assembly)
--   repeat-dove          6431    0.06     9358.10 +- 2267.42        773.91 +- 787.84     (hard to assemble, likely won't assemble correctly or even at all)
--   
--   span-repeat       1295225   12.16     6656.00 +- 2157.41       2170.91 +- 1862.89    (read spans a large repeat, usually easy to assemble)
--   uniq-repeat-cont  2609816   24.51     5479.84 +- 1860.06                             (should be uniquely placed, low potential for consensus errors, no impact on assembly)
--   uniq-repeat-dove   705781    6.63     8237.57 +- 1966.66                             (will end contigs, potential to misassemble)
--   uniq-anchor        198710    1.87     6701.58 +- 2120.72       2449.79 +- 1960.48    (repeat read, with unique section, probable bad read) 

As you can notice, trimming actually decrease greatly the middle-missing, middle-hump, no-5-prime, no-3-prime percentages, while increasing the low-coverage and unique. So, this conclusive shows (at least to me) that higher sequencing errors than expected are present in my data. To check how high are those error rates I used pbmm2 against a high-quality draft reference I have in-house and calculated the error rate using the gap-compressed sequence identity . If you consider my reference as being "nearly" perfect (after polishing with arrow at a 70X CLR data coverage) the assumed error rate in my HiFi libraries are ~3.35%. I also checked the error rate in my trimmed canu fasta file and it is about ~3.32%, which is still quite high even after trimming. Is this normal?

Now that I have a clearer picture about what is happening in my data I would set another canu run with the parameters: correctedErrorRate=0.045 -untrimmed maxInputCoverage=100. Do you recommend any other parameters or a higher error correction?

Best regards, André

skoren commented 1 year ago

Trimming won't improve the read quality, just remove bad sequence on the ends that don't have support from other reads. If the whole read is at a uniform error rate of 3.5% it will stay that way after trimming. I think you'd need the corrected error rate to be higher than 4.5%, you want to approximately double your read error rate so 6.5% is probably more reasonable. You could set the trimming one to 4.5 and the final one to 6.5 (correctedErrorRate=0.035 utgOvlErrorRate=0.065) or just set corrected to 6.5 (correctedErrorRate=0.065).

deoliveira86 commented 1 year ago

Hello @skoren,

thanks one more time for the feedback. I will run canu considering your parameters and I come back later to inform you about the outcome. I will close this issue now.

deoliveira86 commented 1 year ago

Dear @skoren,

After months of troubleshooting and many assembly runs, I identified with the help of Pb bioinformaticians what was causing the issues with the assemblies and it was the employment of the ultra-low input HiFi protocol for sequencing my target species. The genome is relatively large and full of repeats which due the PCR-based nature of the protocol did not cover the whole breadth of the genome, causing the massive fragmentation and suboptimal results.

We have sequenced now using the low-input protocol five different individuals and the individual results are much better in terms of completeness and contiguity. However, since each low-input SMRT cell yields around 10-13x coverage, the draft assemblies are a bit fragmented and I would like to pool all the libraries together and obtain a more contiguous genomic reference. My attempts so far with the 2, 3, 4 and the 5 pooled libraries with hifiasm have been not so successful. The pooled genome assembly is highly fragmented and the reconstructed genome size is way bigger than expected. I am assuming that this might be due the heterozigosity and the generation of a multitude of haplotigs during the sequence reconstruction. I am currently playing around with hifiasm and adjusting the purge duplication parameters to see if I have any improvements.

Additionally, I would like to try Hi-Canu as well, but I wonder if I will face the same similar issue I am having with hifiasm. Is there a way to generate a "squashed" haploid assembly ignoring the different haplotigs and heterozigosity using Hi-Canu fine-tuning the parameters? Or the haplotig purging must be done separately using third party software like "purge_dups"?

Below you can find the genomescopeplots using the five pooled libraries:

image

Best regards and thanks again for the help, André

skoren commented 1 year ago

You'd likely face similar issues. You can increase the error rate in HiCanu, as you did previously, but that loses one of the main advantages of HiFi data (accuracy). You'd also end up with an assembly mixing different individuals throughout the contigs which is probably not what you want. Your best option is probably to run the samples separately and then compare the assemblies and take the best representation of a genomic region from all samples. Random luck or coverage variation might mean one region of a genome is better assembled in one sample than another. That is, chr1 might come from sample 1 while chr2 is from sample 2. This would still result in a final assembly mixing individuals but at least those switches would be outside the contigs (the contigs would only mix between the two haplotypes within an individual).

deoliveira86 commented 1 year ago

Hello @skoren,

Thanks for answering so fast and for the advice. I agree that I might have a better chance of increasing the contiguity of my reconstructions by post-processing the individual assemblies rather than trying to assemble all the pooled five samples together. Do you have any suggestion of such tool that would recognize these homologous genomic regions and extract the best representative? I am aware of Ragtag (https://github.com/malonge/RagTag), however, I am unsure with I can provide multiple assemblies at once, most likely I need to do this in a pairwise and stepwise manner.

skoren commented 1 year ago

I am not aware of a tool to do the multi-way analysis, no. I would suggest picking one assembly (the most continuous/accurate by QV) and aligning the others against it.

deoliveira86 commented 1 year ago

Hello @skoren, okay. I will try to implement your suggestions and as soon as I have an outcome I let you know. I am closing this issue now. Cheers!