marbl / canu

A single molecule sequence assembler for genomes large and small.
http://canu.readthedocs.io/
658 stars 179 forks source link

Extreme low continuity of contigs #2094

Closed YPGG1234 closed 2 years ago

YPGG1234 commented 2 years ago

Hello,

Recently I have used HiCanu (v2.2) to assembly one mammal genome (genome size: ~2.3 Gb, heterozygosity: ~1.7%), I assembled this genome with default HiFi recommended parameters, but I found the asm.contigs.fasta is 4 Gb (far from the expected 4.6 Gb) and its continuity is very low (N50: 747140). Here is my assembly report:

[TRIMMING/READS]
--
-- In sequence store './asm.seqStore':
--   Found 3094977 reads.
--   Found 46377239973 bases (20.16 times coverage).
--    Histogram of corrected reads:
--    
--    G=46377239973                      sum of  ||               length     num
--    NG         length     index       lengths  ||                range    seqs
--    ----- ------------ --------- ------------  ||  ------------------- -------
--    00010        20666    201912   4637734582  ||       1361-2326          449|-
--    00020        18704    438834   9275464641  ||       2327-3292          262|-
--    00030        17401    696250  13913174851  ||       3293-4258          217|-
--    00040        16330    971536  18550903804  ||       4259-5224          173|-
--    00050        15365   1264379  23188624073  ||       5225-6190          163|-
--    00060        14435   1575747  27826348259  ||       6191-7156          275|-
--    00070        13489   1907988  32464072742  ||       7157-8122         7968|--
--    00080        12491   2265025  37101792476  ||       8123-9088        42909|--------
--    00090        11434   2652881  41739525332  ||       9089-10054       38511|-------
--    00100         1361   3094976  46377239973  ||      10055-11020      200091|------------------------------------
--    001.000x             3094977  46377239973  ||      11021-11986      356772|---------------------------------------------------------------
--                                               ||      11987-12952      347188|--------------------------------------------------------------
--                                               ||      12953-13918      344218|-------------------------------------------------------------
--                                               ||      13919-14884      333000|-----------------------------------------------------------
--                                               ||      14885-15850      310452|-------------------------------------------------------
--                                               ||      15851-16816      272902|-------------------------------------------------
--                                               ||      16817-17782      227419|-----------------------------------------
--                                               ||      17783-18748      180334|--------------------------------
--                                               ||      18749-19714      134820|------------------------
--                                               ||      19715-20680       96090|-----------------
--                                               ||      20681-21646       67208|------------
--                                               ||      21647-22612       45446|---------
--                                               ||      22613-23578       30464|------
--                                               ||      23579-24544       20168|----
--                                               ||      24545-25510       13115|---
--                                               ||      25511-26476        8595|--
--                                               ||      26477-27442        5598|-
--                                               ||      27443-28408        3657|-
--                                               ||      28409-29374        2350|-
--                                               ||      29375-30340        1531|-
--                                               ||      30341-31306         986|-
--                                               ||      31307-32272         645|-
--                                               ||      32273-33238         380|-
--                                               ||      33239-34204         243|-
--                                               ||      34205-35170         142|-
--                                               ||      35171-36136          85|-
--                                               ||      36137-37102          54|-
--                                               ||      37103-38068          28|-
--                                               ||      38069-39034          34|-
--                                               ||      39035-40000          13|-
--                                               ||      40001-40966           9|-
--                                               ||      40967-41932           6|-
--                                               ||      41933-42898           0|
--                                               ||      42899-43864           0|
--                                               ||      43865-44830           3|-
--                                               ||      44831-45796           2|-
--                                               ||      45797-46762           0|
--                                               ||      46763-47728           1|-
--                                               ||      47729-48694           0|
--                                               ||      48695-49660           1|-
--

[UNITIGGING/READS]
--
-- In sequence store './asm.seqStore':
--   Found 3094977 reads.
--   Found 46377239973 bases (20.16 times coverage).
--    Histogram of corrected-trimmed reads:
--    
--    G=46377239973                      sum of  ||               length     num
--    NG         length     index       lengths  ||                range    seqs
--    ----- ------------ --------- ------------  ||  ------------------- -------
--    00010        20666    201912   4637734582  ||       1361-2326          449|-
--    00020        18704    438834   9275464641  ||       2327-3292          262|-
--    00030        17401    696250  13913174851  ||       3293-4258          217|-
--    00040        16330    971536  18550903804  ||       4259-5224          173|-
--    00050        15365   1264379  23188624073  ||       5225-6190          163|-
--    00060        14435   1575747  27826348259  ||       6191-7156          275|-
--    00070        13489   1907988  32464072742  ||       7157-8122         7968|--
--    00080        12491   2265025  37101792476  ||       8123-9088        42909|--------
--    00090        11434   2652881  41739525332  ||       9089-10054       38511|-------
--    00100         1361   3094976  46377239973  ||      10055-11020      200091|------------------------------------
--    001.000x             3094977  46377239973  ||      11021-11986      356772|---------------------------------------------------------------
--                                               ||      11987-12952      347188|--------------------------------------------------------------
--                                               ||      12953-13918      344218|-------------------------------------------------------------
--                                               ||      13919-14884      333000|-----------------------------------------------------------
--                                               ||      14885-15850      310452|-------------------------------------------------------
--                                               ||      15851-16816      272902|-------------------------------------------------
--                                               ||      16817-17782      227419|-----------------------------------------
--                                               ||      17783-18748      180334|--------------------------------
--                                               ||      18749-19714      134820|------------------------
--                                               ||      19715-20680       96090|-----------------
--                                               ||      20681-21646       67208|------------
--                                               ||      21647-22612       45446|---------
--                                               ||      22613-23578       30464|------
--                                               ||      23579-24544       20168|----
--                                               ||      24545-25510       13115|---
--                                               ||      25511-26476        8595|--
--                                               ||      26477-27442        5598|-
--                                               ||      27443-28408        3657|-
--                                               ||      28409-29374        2350|-
--                                               ||      29375-30340        1531|-
--                                               ||      30341-31306         986|-
--                                               ||      31307-32272         645|-
--                                               ||      32273-33238         380|-
--                                               ||      33239-34204         243|-
--                                               ||      34205-35170         142|-
--                                               ||      35171-36136          85|-
--                                               ||      36137-37102          54|-
--                                               ||      37103-38068          28|-
--                                               ||      38069-39034          34|-
--                                               ||      39035-40000          13|-
--                                               ||      40001-40966           9|-
--                                               ||      40967-41932           6|-
--                                               ||      41933-42898           0|
--                                               ||      42899-43864           0|
--                                               ||      43865-44830           3|-
--                                               ||      44831-45796           2|-
--                                               ||      45797-46762           0|
--                                               ||      46763-47728           1|-
--                                               ||      47729-48694           0|
--                                               ||      48695-49660           1|-
--

[UNITIGGING/MERS]
--
--  22-mers                                                                                           Fraction
--    Occurrences   NumMers                                                                         Unique Total
--       1-     1         0                                                                        0.0000 0.0000
--       2-     2   5341735 *                                                                      0.0045 0.0003
--       3-     4   8992966 *                                                                      0.0071 0.0006
--       5-     7  47115976 **********                                                             0.0206 0.0029
--       8-    11 120762137 **************************                                             0.0728 0.0165
--      12-    16 203788652 *******************************************                            0.1813 0.0583
--      17-    22 324447742 ********************************************************************** 0.3642 0.1593
--      23-    29 250298377 ******************************************************                 0.6335 0.3599
--      30-    37  97977190 *********************                                                  0.8169 0.5366
--      38-    46  56757601 ************                                                           0.8891 0.6257
--      47-    56  34200277 *******                                                                0.9340 0.6955
--      57-    67  17554411 ***                                                                    0.9603 0.7453
--      68-    79  10280923 **                                                                     0.9741 0.7766
--      80-    92   6046958 *                                                                      0.9823 0.7985
--      93-   106   3813815                                                                        0.9871 0.8137
--     107-   121   2570560                                                                        0.9902 0.8249
--     122-   137   1797125                                                                        0.9922 0.8336
--     138-   154   1286734                                                                        0.9937 0.8405
--     155-   172    975888                                                                        0.9948 0.8461
--     173-   191    765275                                                                        0.9956 0.8509
--     192-   211    632713                                                                        0.9962 0.8551
--     212-   232    519365                                                                        0.9967 0.8589
--     233-   254    428030                                                                        0.9971 0.8624
--     255-   277    332589                                                                        0.9975 0.8655
--     278-   301    267716                                                                        0.9978 0.8682
--     302-   326    225816                                                                        0.9980 0.8705
--     327-   352    195677                                                                        0.9982 0.8726
--     353-   379    168014                                                                        0.9983 0.8746
--     380-   407    144161                                                                        0.9985 0.8765
--     408-   436    129972                                                                        0.9986 0.8782
--     437-   466    121697                                                                        0.9987 0.8799
--     467-   497    117836                                                                        0.9988 0.8815
--     498-   529    114236                                                                        0.9989 0.8833
--     530-   562     98000                                                                        0.9990 0.8851
--     563-   596     84063                                                                        0.9991 0.8867
--     597-   631     78508                                                                        0.9991 0.8881
--     632-   667     71031                                                                        0.9992 0.8896
--     668-   704     60590                                                                        0.9993 0.8910
--     705-   742     53384                                                                        0.9993 0.8923
--     743-   781     48751                                                                        0.9994 0.8934
--     782-   821     40828                                                                        0.9994 0.8946
--
--           0 (max occurrences)
-- 32923453550 (total mers, non-unique)
--  1199387452 (distinct mers, non-unique)
--           0 (unique mers)

[UNITIGGING/OVERLAPS]
--   category            reads     %          read length        feature size or coverage  analysis
--   ----------------  -------  -------  ----------------------  ------------------------  --------------------
--   middle-missing      14003    0.45    14415.68 +- 4035.95       1503.94 +- 1381.71    (bad trimming)
--   middle-hump            58    0.00    12571.10 +- 2183.39       3687.81 +- 2363.66    (bad trimming)
--   no-5-prime           6434    0.21    10681.58 +- 2617.16       2071.75 +- 2355.10    (bad trimming)
--   no-3-prime           6188    0.20    10572.18 +- 2559.59       2036.20 +- 2333.40    (bad trimming)
--   
--   low-coverage        14626    0.47     9430.13 +- 2053.87          4.03 +- 1.29       (easy to assemble, potential for lower quality consensus)
--   unique            2230657   72.07    10580.66 +- 2428.77         16.98 +- 5.60       (easy to assemble, perfect, yay)
--   repeat-cont         61878    2.00    10262.38 +- 2119.54        148.36 +- 142.24     (potential for consensus errors, no impact on assembly)
--   repeat-dove          4590    0.15    15273.55 +- 2234.34         91.23 +- 88.25      (hard to assemble, likely won't assemble correctly or even at all)
--   
--   span-repeat        348965   11.28    11304.61 +- 2624.75       2814.60 +- 2738.78    (read spans a large repeat, usually easy to assemble)
--   uniq-repeat-cont   203629    6.58     9613.59 +- 1817.58                             (should be uniquely placed, low potential for consensus errors, no impact on assembly)
--   uniq-repeat-dove   188536    6.09    12119.60 +- 2469.23                             (will end contigs, potential to misassemble)
--   uniq-anchor         12635    0.41    11842.55 +- 2820.28       4054.90 +- 3207.58    (repeat read, with unique section, probable bad read)

[UNITIGGING/ADJUSTMENT]
-- No report available.

[UNITIGGING/ERROR RATES]
--  
--  ERROR RATES
--  -----------
--                                                   --------threshold------
--  4126616                      fraction error      fraction        percent
--  samples                              (1e-5)         error          error
--                   --------------------------      --------       --------
--  command line (-eg)                           ->     30.00        0.0300%
--  command line (-ef)                           ->  -----.--      ---.----%
--  command line (-eM)                           ->     30.00        0.0300%
--  mean + std.dev       0.42 +-  12 *     2.80  ->     34.02        0.0340%
--  median + mad         0.00 +-  12 *     0.00  ->      0.00        0.0000%
--  90th percentile                              ->      1.00        0.0010%  (enabled)
--  
--  BEST EDGE FILTERING
--  -------------------
--  At graph threshold 0.0300%, reads:
--    available to have edges:      1193853
--    with at least one edge:       1159509
--  
--  At max threshold 0.0300%, reads:  (not computed)
--    available to have edges:            0
--    with at least one edge:             0
--  
--  At tight threshold 0.0010%, reads with:
--    both edges below error threshold:   1066822  (80.00% minReadsBest threshold = 927607)
--    one  edge  above error threshold:     76416
--    both edges above error threshold:     16271
--    at least one edge:                  1159509
--  
--  At loose threshold 0.0300%, reads with:
--    both edges below error threshold:   1159509  (80.00% minReadsBest threshold = 927607)
--    one  edge  above error threshold:         0
--    both edges above error threshold:         0
--    at least one edge:                  1159509
--  
--  
--  INITIAL EDGES
--  -------- ----------------------------------------
--   1826516 reads are contained
--    190632 reads have no best edges (singleton)
--     10213 reads have only one best edge (spur) 
--               9471 are mutual best
--   1067616 reads have two best edges 
--              24231 have one mutual best edge
--            1041694 have two mutual best edges
--  
--  
--  FINAL EDGES
--  -------- ----------------------------------------
--   1826516 reads are contained
--    193073 reads have no best edges (singleton)
--     10599 reads have only one best edge (spur) 
--              10137 are mutual best
--   1064789 reads have two best edges 
--              23077 have one mutual best edge
--            1040201 have two mutual best edges
--  
--  
--  EDGE FILTERING
--  -------- ------------------------------------------
--         0 reads are ignored
--    164645 reads have a gap in overlap coverage
--      1149 reads have lopsided best edges

[UNITIGGING/CONTIGS]
-- Found, in version 1, after unitig construction:
--   contigs:      7891 sequences, total length 2434992220 bp (including 1975 repeats of total length 47567531 bp).
--   bubbles:      8992 sequences, total length 446494239 bp.
--   unassembled:  202585 sequences, total length 2325761010 bp.
--
-- Contig sizes based on genome size 2.3gbp:
--
--            NG (bp)  LG (contigs)    sum (bp)
--         ----------  ------------  ----------
--     10     2818237            53   230465243
--     20     1771229           161   460884252
--     30     1321465           312   690641200
--     40      979001           514   920050607
--     50      734605           785  1150123964
--     60      547449          1150  1380130103
--     70      413756          1635  1610362203
--     80      303882          2285  1840164550
--     90      199122          3216  2070142482
--    100       99201          4819  2300057460
--

[UNITIGGING/CONSENSUS]
-- Found, in version 2, after consensus generation:
--   contigs:      7891 sequences, total length 3411191834 bp (including 1975 repeats of total length 66729830 bp).
--   bubbles:      8992 sequences, total length 623707787 bp.
--   unassembled:  202585 sequences, total length 3260665063 bp.
--
-- Contig sizes based on genome size 2.3gbp:
--
--            NG (bp)  LG (contigs)    sum (bp)
--         ----------  ------------  ----------
--     10     4625756            32   232346390
--     20     3034140            94   460938854
--     30     2388320           181   690870353
--     40     1892063           289   920273695
--     50     1560999           423  1150803087
--     60     1258218           587  1380293211
--     70     1024624           789  1610443290
--     80      835985          1039  1840538456
--     90      681598          1345  2070041581
--    100      556310          1720  2300313894
--    110      444127          2183  2530338597
--    120      338710          2771  2760103543
--    130      238826          3573  2990225875
--    140      139542          4811  3220077097
--

Is this normal? Could you help me see where I'm going wrong? Thanks.

skoren commented 2 years ago

First, the N50 you quoted is based on the compressed pre-consensus contigs, the post-consensus N50 is 1.6mb. You've got pretty low coverage and relatively short reads so I think your result isn't too surprising. At 20x coverage you have <10x coverage per haplotype. HiFi data has uneven coverage in some contexts so it's likely less in many places. As for the diversity, I'm not sure your genome is as high diversity as you estimated. The k-mer plots don't show any secondary peak for coverage as you'd expect for a heterozygous genome though this might also be caused by low coverage. That likely explains why the asm isn't the full 4.6g in size. I'd run purge_dups and busco to estimate how complete the genome is and how much of the haplotypes is assembled.

The biggest improvement you can make to your assembly would be to increase coverage, you could do this by sequencing more or by running DeepConsensus which can increase the Q20 yield of existing cells.

YPGG1234 commented 2 years ago

Hello, skoren

Thanks for your prompt reply! The busco of contigs shows that it does have a lot of duplication (C:96.8%[S:30.6%,D:66.2%]), then I used purge_haplotigs with HiFi reads to purge the contigs and got ~2.3 Gb primary contigs and ~1.6 Gb haplotigs.

I estimated the heterozygosity by genomescope (I am sorry, the heterozygosity is 1.3% instead of 1.7%) image

I see coverage 20X is enough for current hybrid methods from here, but I found the heatmap of sex chromosome which came from Hi-C based scaffolding contigs was messed up, may due to the relative low sequencing coverage or incomplete purge. image

Do you have any suggestions? Thanks.

skoren commented 2 years ago

That 20x you're referencing is just the minimum. Less than this wouldn't get you a complete genome. However, continuity tends to increase until you get to 35-40x so typical projects target at least this much (https://github.com/human-pangenomics/hpgp-data).

As for the genome scope plot, I think it is over-estimating the heterozygosity. Compare it's model fit line (black) to the actual k-mer counts (blue). The true het peak is much lower and smoother than the modeled one. It's also estimating the genome size as only 1.7gb not 2.3g. So I wouldn't trust those estimates too much in this case.

I'm not sure what to make of your Hi-C plot, it could be consistent with a centromere in the middle across which interaction is less frequent or another biological structure (e.g. see the human X here: https://www.nature.com/articles/s41586-020-2547-7/figures/12). You'd want to validate the assembly using read alignments and other information as I suggested in #2084. As for what to do, your best option is increasing coverage. The one that doesn't require more sequencing is using deep consensus (https://github.com/google/deepconsensus) which can give a higher Q20 read yield from the same input so I'd probably start with that.

YPGG1234 commented 2 years ago

The Hi-C (coverage >50x) plot was made by using juicer+3d-dna+juicerbox pipeline.

I will follow your suggestions, thank you!