marbl / canu

A single molecule sequence assembler for genomes large and small.
http://canu.readthedocs.io/
644 stars 177 forks source link

Possible problem with consensus step #2304

Closed jacopoM28 closed 2 months ago

jacopoM28 commented 2 months ago

Dear Canu developers,

I am working on a diploid insect genome sequenced with HiFi reads on a Sequel IIe platform. Genomescope estimated a genome size of 380Mb after considering also highly abundant kmers (max kmer count of 5,000,000) and low levels of heterozygosity.

image

The genome appears to be quite repetitive, and preliminary analyses on the reads revealed that 24% could be composed of a single tandem repeat family.

Canu version 2.2 was installed via Conda and run with default settings:

canu -p Fpar_Canu_asm -d . genomeSize=380000000 -pacbio-hifi

The final assembly size without considering bubbles (566Mb) was much greater than the genome scope estimation, and the same tandem repeat previously identified in the reads composed 40% of the genome. Similar results were also obtained with HiFiasm.

[UNITIGGING/CONSENSUS]
-- Found, in version 2, after consensus generation:
--   contigs:      628 sequences, total length 566183555 bp (including 1115 repeats of total length 94215999 bp).
--   bubbles:      3275 sequences, total length 303607841 bp.
--   unassembled:  72009 sequences, total length 1035819470 bp.
--
-- Contig sizes based on genome size 380mbp:
--
--            NG (bp)  LG (contigs)    sum (bp)
--         ----------  ------------  ----------
--     10    12326352             3    49046055
--     20     9221623             6    77855771
--     30     6499111            11   115592538
--     40     6043323            17   153232310
--     50     5638511            24   194129318
--     60     5193466            31   231280351
--     70     4597142            39   269737412
--     80     3847070            48   307463648
--     90     2889455            59   342859239
--    100     2460160            73   380025907
--    110     2058417            90   418018804
--    120     1566445           112   456422575
--    130      976921           142   494958315
--    140      379219           200   532024760
--

Upon deeper inspection of the Canu report, it seems that the consensus step greatly increased the assembly size compared to the UNITIGGING/CONTIGS step. is this normal?

[UNITIGGING/CONTIGS]
-- Found, in version 1, after unitig construction:
--   contigs:      628 sequences, total length 410492251 bp (including 1115 repeats of total length 68376565 bp).
--   bubbles:      3275 sequences, total length 219107730 bp.
--   unassembled:  72009 sequences, total length 759100931 bp.
--
-- Contig sizes based on genome size 380mbp:
--
--            NG (bp)  LG (contigs)    sum (bp)
--         ----------  ------------  ----------
--     10     7351709             4    43007550
--     20     5130461            10    78891375
--     30     4317331            18   115044419
--     40     3897477            28   155834774
--     50     3359697            38   191857088
--     60     2714253            50   228105730
--     70     1879178            68   266524598
--     80     1467629            91   304760759
--     90      995755           123   342798078
--    100      356921           182   380259335
--

Considering the low genome-wide heterozygosity but the apparent huge coverage of a single tandem repeat family, is it possible that the tandem repeat arrays are being artificially extended due to extreme haplotypic variations within these regions?

Here the complete Canu report:

 [TRIMMING/READS]
--
-- In sequence store './Fpar_Canu_asm.seqStore':
--   Found 1581448 reads.
--   Found 22357942731 bases (58.83 times coverage).
--    Histogram of corrected reads:
--    
--    G=22357942731                      sum of  ||               length     num
--    NG         length     index       lengths  ||                range    seqs
--    ----- ------------ --------- ------------  ||  ------------------- -------
--    00010        21991     92523   2235807233  ||       1282-2204          453|-
--    00020        19557    200762   4471596534  ||       2205-3127         2759|--
--    00030        17876    320571   6707384820  ||       3128-4050        10835|-----
--    00040        16552    450671   8943185908  ||       4051-4973        26687|-------------
--    00050        15415    590696  11178972002  ||       4974-5896        36319|-----------------
--    00060        14354    740995  13414766501  ||       5897-6819        37671|------------------
--    00070        13273    902869  15650562091  ||       6820-7742        39643|-------------------
--    00080        12024   1079441  17886355276  ||       7743-8665        45273|---------------------
--    00090        10172   1279694  20122150971  ||       8666-9588        56493|--------------------------
--    00100         1282   1581447  22357942731  ||       9589-10511       75787|-----------------------------------
--    001.000x             1581448  22357942731  ||      10512-11434       96889|---------------------------------------------
--                                               ||      11435-12357      117748|------------------------------------------------------
--                                               ||      12358-13280      133203|-------------------------------------------------------------
--                                               ||      13281-14203      138568|---------------------------------------------------------------
--                                               ||      14204-15126      132998|-------------------------------------------------------------
--                                               ||      15127-16049      120645|-------------------------------------------------------
--                                               ||      16050-16972      104059|------------------------------------------------
--                                               ||      16973-17895       86517|----------------------------------------
--                                               ||      17896-18818       71267|---------------------------------
--                                               ||      18819-19741       57389|---------------------------
--                                               ||      19742-20664       46538|----------------------
--                                               ||      20665-21587       37427|------------------
--                                               ||      21588-22510       29632|--------------
--                                               ||      22511-23433       23634|-----------
--                                               ||      23434-24356       18225|---------
--                                               ||      24357-25279       13314|-------
--                                               ||      25280-26202        9144|-----
--                                               ||      26203-27125        5877|---
--                                               ||      27126-28048        3373|--
--                                               ||      28049-28971        1698|-
--                                               ||      28972-29894         781|-
--                                               ||      29895-30817         290|-
--                                               ||      30818-31740         104|-
--                                               ||      31741-32663          77|-
--                                               ||      32664-33586          37|-
--                                               ||      33587-34509          21|-
--                                               ||      34510-35432          14|-
--                                               ||      35433-36355          13|-
--                                               ||      36356-37278           9|-
--                                               ||      37279-38201           5|-
--                                               ||      38202-39124           6|-
--                                               ||      39125-40047           7|-
--                                               ||      40048-40970           7|-
--                                               ||      40971-41893           1|-
--                                               ||      41894-42816           1|-
--                                               ||      42817-43739           1|-
--                                               ||      43740-44662           4|-
--                                               ||      44663-45585           0|
--                                               ||      45586-46508           3|-
--                                               ||      46509-47431           2|-
--

[UNITIGGING/READS]
--
-- In sequence store './Fpar_Canu_asm.seqStore':
--   Found 1581448 reads.
--   Found 22357942731 bases (58.83 times coverage).
--    Histogram of corrected-trimmed reads:
--    
--    G=22357942731                      sum of  ||               length     num
--    NG         length     index       lengths  ||                range    seqs
--    ----- ------------ --------- ------------  ||  ------------------- -------
--    00010        21991     92523   2235807233  ||       1282-2204          453|-
--    00020        19557    200762   4471596534  ||       2205-3127         2759|--
--    00030        17876    320571   6707384820  ||       3128-4050        10835|-----
--    00040        16552    450671   8943185908  ||       4051-4973        26687|-------------
--    00050        15415    590696  11178972002  ||       4974-5896        36319|-----------------
--    00060        14354    740995  13414766501  ||       5897-6819        37671|------------------
--    00070        13273    902869  15650562091  ||       6820-7742        39643|-------------------
--    00080        12024   1079441  17886355276  ||       7743-8665        45273|---------------------
--    00090        10172   1279694  20122150971  ||       8666-9588        56493|--------------------------
--    00100         1282   1581447  22357942731  ||       9589-10511       75787|-----------------------------------
--    001.000x             1581448  22357942731  ||      10512-11434       96889|---------------------------------------------
--                                               ||      11435-12357      117748|------------------------------------------------------
--                                               ||      12358-13280      133203|-------------------------------------------------------------
--                                               ||      13281-14203      138568|---------------------------------------------------------------
--                                               ||      14204-15126      132998|-------------------------------------------------------------
--                                               ||      15127-16049      120645|-------------------------------------------------------
--                                               ||      16050-16972      104059|------------------------------------------------
--                                               ||      16973-17895       86517|----------------------------------------
--                                               ||      17896-18818       71267|---------------------------------
--                                               ||      18819-19741       57389|---------------------------
--                                               ||      19742-20664       46538|----------------------
--                                               ||      20665-21587       37427|------------------
--                                               ||      21588-22510       29632|--------------
--                                               ||      22511-23433       23634|-----------
--                                               ||      23434-24356       18225|---------
--                                               ||      24357-25279       13314|-------
--                                               ||      25280-26202        9144|-----
--                                               ||      26203-27125        5877|---
--                                               ||      27126-28048        3373|--
--                                               ||      28049-28971        1698|-
--                                               ||      28972-29894         781|-
--                                               ||      29895-30817         290|-
--                                               ||      30818-31740         104|-
--                                               ||      31741-32663          77|-
--                                               ||      32664-33586          37|-
--                                               ||      33587-34509          21|-
--                                               ||      34510-35432          14|-
--                                               ||      35433-36355          13|-
--                                               ||      36356-37278           9|-
--                                               ||      37279-38201           5|-
--                                               ||      38202-39124           6|-
--                                               ||      39125-40047           7|-
--                                               ||      40048-40970           7|-
--                                               ||      40971-41893           1|-
--                                               ||      41894-42816           1|-
--                                               ||      42817-43739           1|-
--                                               ||      43740-44662           4|-
--                                               ||      44663-45585           0|
--                                               ||      45586-46508           3|-
--                                               ||      46509-47431           2|-
--

[UNITIGGING/MERS]
--
--  22-mers                                                                                           Fraction
--    Occurrences   NumMers                                                                         Unique Total
--       1-     1         0                                                                        0.0000 0.0000
--       2-     2   1233564 *                                                                      0.0082 0.0002
--       3-     4    421104                                                                        0.0101 0.0002
--       5-     7    196194                                                                        0.0115 0.0003
--       8-    11    210488                                                                        0.0126 0.0003
--      12-    16    553530                                                                        0.0142 0.0005
--      17-    22   1958071 **                                                                     0.0186 0.0011
--      23-    29   4554144 ******                                                                 0.0338 0.0041
--      30-    37   5595433 *******                                                                0.0650 0.0121
--      38-    46  17963577 ************************                                               0.1039 0.0248
--      47-    56  51287149 ********************************************************************** 0.2417 0.0819
--      57-    67  44304000 ************************************************************           0.5933 0.2564
--      68-    79   9132803 ************                                                           0.8618 0.4130
--      80-    92   1491619 **                                                                     0.9112 0.4466
--      93-   106   2301418 ***                                                                    0.9209 0.4545
--     107-   121   2415352 ***                                                                    0.9365 0.4693
--     122-   137   1141190 *                                                                      0.9520 0.4860
--     138-   154    759693 *                                                                      0.9591 0.4947
--     155-   172    778635 *                                                                      0.9641 0.5016
--     173-   191    544951                                                                        0.9692 0.5095
--     192-   211    419249                                                                        0.9727 0.5155
--     212-   232    388069                                                                        0.9755 0.5208
--     233-   254    307386                                                                        0.9781 0.5261
--     255-   277    267410                                                                        0.9801 0.5308
--     278-   301    228463                                                                        0.9818 0.5352
--     302-   326    198311                                                                        0.9833 0.5393
--     327-   352    179190                                                                        0.9846 0.5431
--     353-   379    159669                                                                        0.9858 0.5469
--     380-   407    135054                                                                        0.9869 0.5505
--     408-   436    123679                                                                        0.9877 0.5538
--     437-   466    116531                                                                        0.9886 0.5571
--     467-   497    107164                                                                        0.9893 0.5604
--     498-   529     97963                                                                        0.9900 0.5636
--     530-   562     83256                                                                        0.9907 0.5667
--     563-   596     76936                                                                        0.9912 0.5695
--     597-   631     73162                                                                        0.9917 0.5723
--     632-   667     61755                                                                        0.9922 0.5751
--     668-   704     54391                                                                        0.9926 0.5776
--     705-   742     50160                                                                        0.9930 0.5799
--     743-   781     46965                                                                        0.9933 0.5821
--     782-   821     43019                                                                        0.9936 0.5844
--
--           0 (max occurrences)
-- 16043963206 (total mers, non-unique)
--   150980207 (distinct mers, non-unique)
--           0 (unique mers)

[UNITIGGING/OVERLAPS]
--   category            reads     %          read length        feature size or coverage  analysis
--   ----------------  -------  -------  ----------------------  ------------------------  --------------------
--   middle-missing       2917    0.18    11286.82 +- 5407.80        972.69 +- 1284.54    (bad trimming)
--   middle-hump           108    0.01    13648.19 +- 3391.43       5737.93 +- 2901.23    (bad trimming)
--   no-5-prime           2655    0.17    10343.64 +- 4200.39       2213.46 +- 2759.85    (bad trimming)
--   no-3-prime           2928    0.19     9959.70 +- 4083.73       1941.45 +- 2651.04    (bad trimming)
--   
--   low-coverage        40629    2.57     9871.96 +- 3373.37         12.68 +- 3.57       (easy to assemble, potential for lower quality consensus)
--   unique            1208868   76.44    10043.76 +- 3414.36         49.06 +- 12.49      (easy to assemble, perfect, yay)
--   repeat-cont         29156    1.84     9124.01 +- 3270.98        661.53 +- 732.13     (potential for consensus errors, no impact on assembly)
--   repeat-dove           537    0.03    17010.94 +- 2036.96        452.61 +- 450.18     (hard to assemble, likely won't assemble correctly or even at all)
--   
--   span-repeat        106811    6.75    11412.15 +- 3319.29       3812.03 +- 3412.28    (read spans a large repeat, usually easy to assemble)
--   uniq-repeat-cont   127766    8.08     9748.09 +- 2724.94                             (should be uniquely placed, low potential for consensus errors, no impact on assembly)
--   uniq-repeat-dove    35557    2.25    14447.98 +- 2668.95                             (will end contigs, potential to misassemble)
--   uniq-anchor          6019    0.38    10769.52 +- 3294.90       4878.25 +- 3453.10    (repeat read, with unique section, probable bad read)

[UNITIGGING/ADJUSTMENT]
-- No report available.

[UNITIGGING/ERROR RATES]
--  
--  ERROR RATES
--  -----------
--                                                   --------threshold------
--  1806107                      fraction error      fraction        percent
--  samples                              (1e-5)         error          error
--                   --------------------------      --------       --------
--  command line (-eg)                           ->     30.00        0.0300%
--  command line (-ef)                           ->  -----.--      ---.----%
--  command line (-eM)                           ->     30.00        0.0300%
--  mean + std.dev       0.30 +-  12 *     2.39  ->     28.98        0.0290%
--  median + mad         0.00 +-  12 *     0.00  ->      0.00        0.0000%
--  90th percentile                              ->      1.00        0.0010%  (enabled)
--  
--  BEST EDGE FILTERING
--  -------------------
--  At graph threshold 0.0300%, reads:
--    available to have edges:       295242
--    with at least one edge:        270309
--  
--  At max threshold 0.0300%, reads:  (not computed)
--    available to have edges:            0
--    with at least one edge:             0
--  
--  At tight threshold 0.0010%, reads with:
--    both edges below error threshold:    243883  (80.00% minReadsBest threshold = 216247)
--    one  edge  above error threshold:     20625
--    both edges above error threshold:      5801
--    at least one edge:                   270309
--  
--  At loose threshold 0.0290%, reads with:
--    both edges below error threshold:    268442  (80.00% minReadsBest threshold = 216247)
--    one  edge  above error threshold:      1809
--    both edges above error threshold:        58
--    at least one edge:                   270309
--  
--  
--  INITIAL EDGES
--  -------- ----------------------------------------
--   1260736 reads are contained
--     68362 reads have no best edges (singleton)
--      1330 reads have only one best edge (spur) 
--                953 are mutual best
--    251020 reads have two best edges 
--               7185 have one mutual best edge
--             243000 have two mutual best edges
--  
--  
--  FINAL EDGES
--  -------- ----------------------------------------
--   1260736 reads are contained
--     69061 reads have no best edges (singleton)
--      1297 reads have only one best edge (spur) 
--               1037 are mutual best
--    250354 reads have two best edges 
--               6889 have one mutual best edge
--             242776 have two mutual best edges
--  
--  
--  EDGE FILTERING
--  -------- ------------------------------------------
--         0 reads are ignored
--     44058 reads have a gap in overlap coverage
--       231 reads have lopsided best edges

[UNITIGGING/CONTIGS]
-- Found, in version 1, after unitig construction:
--   contigs:      628 sequences, total length 410492251 bp (including 1115 repeats of total length 68376565 bp).
--   bubbles:      3275 sequences, total length 219107730 bp.
--   unassembled:  72009 sequences, total length 759100931 bp.
--
-- Contig sizes based on genome size 380mbp:
--
--            NG (bp)  LG (contigs)    sum (bp)
--         ----------  ------------  ----------
--     10     7351709             4    43007550
--     20     5130461            10    78891375
--     30     4317331            18   115044419
--     40     3897477            28   155834774
--     50     3359697            38   191857088
--     60     2714253            50   228105730
--     70     1879178            68   266524598
--     80     1467629            91   304760759
--     90      995755           123   342798078
--    100      356921           182   380259335
--

[UNITIGGING/CONSENSUS]
-- Found, in version 2, after consensus generation:
--   contigs:      628 sequences, total length 566183555 bp (including 1115 repeats of total length 94215999 bp).
--   bubbles:      3275 sequences, total length 303607841 bp.
--   unassembled:  72009 sequences, total length 1035819470 bp.
--
-- Contig sizes based on genome size 380mbp:
--
--            NG (bp)  LG (contigs)    sum (bp)
--         ----------  ------------  ----------
--     10    12326352             3    49046055
--     20     9221623             6    77855771
--     30     6499111            11   115592538
--     40     6043323            17   153232310
--     50     5638511            24   194129318
--     60     5193466            31   231280351
--     70     4597142            39   269737412
--     80     3847070            48   307463648
--     90     2889455            59   342859239
--    100     2460160            73   380025907
--    110     2058417            90   418018804
--    120     1566445           112   456422575
--    130      976921           142   494958315
--    140      379219           200   532024760
--

Thank you in advance for your assistance!

Jacopo

skoren commented 2 months ago

The report is a bit confusing here, the pre-consensus lengths are in homopolymers-compressed space while post-consensus they are not. It's normal to see a 1.4x inflation going from compressed to uncompressed space so the size change seems normal.

It's quite possible for both hicanu and hifiasm to leave haplotype duplication that is too diverged or structurally different in the primary assembly. I think the genome is not very homozygous when evaluated with HiFi data which normally produces a 6gb assembly for human genomes. I suggest running purge-dups (https://canu.readthedocs.io/en/latest/faq.html#my-genome-size-and-assembly-size-are-different-help) and see if the genome is more in line with the expectation after that. You've already got 300mb of alt so if purge_dups removes another 100-150mb you'd end up with two very similar sized haplotype assemblies.