fragemented and large assembly

gunjanpandey commented 2 days ago

Could you please tell your interpretation of this log file for a algae assembly attept and how to improve assembly contiguity for this highly heterogygous algal genome?

It is canu 2.2. canu -assemble -p algae -d ./ genomeSize=1.2g -pacbio-hifi ../01_Data/hifi_decontamianted.fq useGrid=true gridOptions="--time=02-00:00:00 "

--    
--    G=60000011670                      sum of  ||               length     num
--    NG         length     index       lengths  ||                range    seqs
--    ----- ------------ --------- ------------  ||  ------------------- -------
--    00010        25230    216225   6000008603  ||       1012-2263          909|-
--    00020        22744    467717  12000021196  ||       2264-3515        18289|---
--    00030        21088    742150  18000004959  ||       3516-4767        51620|-------
--    00040        19816   1035933  24000015574  ||       4768-6019        53288|-------
--    00050        18758   1347318  30000018333  ||       6020-7271        47811|------
--    00060        17829   1675521  36000008872  ||       7272-8523        41756|------
--    00070        16973   2020455  42000022999  ||       8524-9775        36502|-----
--    00080        16110   2383132  48000017181  ||       9776-11027       32306|----
--    00090        14977   2768059  54000013441  ||      11028-12279       31468|----
--    00100         1012   3347938  60000011670  ||      12280-13531       52652|-------
--    001.000x             3347939  60000011670  ||      13532-14783      167900|---------------------
--                                               ||      14784-16035      400316|-------------------------------------------------
--                                               ||      16036-17287      522679|---------------------------------------------------------------
--                                               ||      17288-18539      470618|---------------------------------------------------------
--                                               ||      18540-19791      377504|----------------------------------------------
--                                               ||      19792-21043      291199|------------------------------------
--                                               ||      21044-22295      219015|---------------------------
--                                               ||      22296-23547      163507|--------------------
--                                               ||      23548-24799      119442|---------------
--                                               ||      24800-26051       85834|-----------
--                                               ||      26052-27303       60782|--------
--                                               ||      27304-28555       40895|-----
--                                               ||      28556-29807       26685|----
--                                               ||      29808-31059       16445|--
--                                               ||      31060-32311        9319|--
--                                               ||      32312-33563        4961|-
--                                               ||      33564-34815        2455|-
--                                               ||      34816-36067        1007|-
--                                               ||      36068-37319         365|-
--                                               ||      37320-38571         129|-
--                                               ||      38572-39823          70|-
--                                               ||      39824-41075          43|-
--                                               ||      41076-42327          27|-
--                                               ||      42328-43579          35|-
--                                               ||      43580-44831          13|-
--                                               ||      44832-46083          22|-
--                                               ||      46084-47335          13|-
--                                               ||      47336-48587          16|-
--                                               ||      48588-49839          10|-
--                                               ||      49840-51091           9|-
--                                               ||      51092-52343           8|-
--                                               ||      52344-53595           7|-
--                                               ||      53596-54847           2|-
--                                               ||      54848-56099           3|-
--                                               ||      56100-57351           1|-
--                                               ||      57352-58603           0|
--                                               ||      58604-59855           0|
--                                               ||      59856-61107           0|
--                                               ||      61108-62359           1|-
--                                               ||      62360-63611           1|-
--

[UNITIGGING/MERS]
--
--  22-mers                                                                                           Fraction
--    Occurrences   NumMers                                                                         Unique Total
--       1-     1         0                                                                        0.0000 0.0000
--       2-     2  70353412 *********************                                                  0.0786 0.0032
--       3-     5 163067656 **************************************************                     0.1444 0.0073
--       6-    10 226452057 ********************************************************************** 0.3153 0.0246
--      11-    17 184614141 *********************************************************              0.5559 0.0683
--      18-    26  63289857 *******************                                                    0.7338 0.1205
--      27-    37  28132636 ********                                                               0.7944 0.1479
--      38-    50  20924224 ******                                                                 0.8242 0.1676
--      51-    65  16517590 *****                                                                  0.8468 0.1882
--      66-    82  12640673 ***                                                                    0.8648 0.2097
--      83-   101   8333925 **                                                                     0.8786 0.2307
--     102-   122   8099458 **                                                                     0.8876 0.2477
--     123-   145  11588961 ***                                                                    0.8968 0.2690
--     146-   170  11098585 ***                                                                    0.9098 0.3051
--     171-   197   7138387 **                                                                     0.9220 0.3443
--     198-   226   6849776 **                                                                     0.9299 0.3741
--     227-   257   3515630 *                                                                      0.9374 0.4068
--     258-   290   3859564 *                                                                      0.9413 0.4258
--     291-   325   7078389 **                                                                     0.9457 0.4505
--     326-   362   7034613 **                                                                     0.9536 0.5008
--     363-   401   7501105 **                                                                     0.9615 0.5564
--     402-   442   3378618 *                                                                      0.9698 0.6210
--     443-   485   5011975 *                                                                      0.9735 0.6529
--     486-   530   7194526 **                                                                     0.9792 0.7076
--     531-   577   4579124 *                                                                      0.9872 0.7905
--     578-   626   2679194                                                                        0.9922 0.8475
--     627-   677    986289                                                                        0.9952 0.8837
--     678-   730    331835                                                                        0.9963 0.8980
--     731-   785    146144                                                                        0.9966 0.9032
--     786-   842    109799                                                                        0.9968 0.9057
--     843-   901    102121                                                                        0.9969 0.9077
--     902-   962     89584                                                                        0.9970 0.9098
--     963-  1025     84341                                                                        0.9971 0.9117
--    1026-  1090    103332                                                                        0.9972 0.9136
--    1091-  1157    149332                                                                        0.9973 0.9161
--    1158-  1226    457665                                                                        0.9975 0.9200
--    1227-  1297    856871                                                                        0.9980 0.9327
--    1298-  1370    495135                                                                        0.9990 0.9575
--    1371-  1445    126710                                                                        0.9995 0.9722
--    1446-  1522     65082                                                                        0.9997 0.9762
--    1523-  1601     41337                                                                        0.9997 0.9784
--
--           0 (max occurrences)
-- 43805436174 (total mers, non-unique)
--   895285547 (distinct mers, non-unique)
--           0 (unique mers)

[UNITIGGING/OVERLAPS]
--   category            reads     %          read length        feature size or coverage  analysis
--   ----------------  -------  -------  ----------------------  ------------------------  --------------------
--   middle-missing      24795    0.74    13476.41 +- 3388.92       2322.58 +- 2244.95    (bad trimming)
--   middle-hump           739    0.02    14355.48 +- 3584.40       5780.24 +- 3541.19    (bad trimming)
--   no-5-prime          18712    0.56    12681.51 +- 3525.81       3113.69 +- 3131.66    (bad trimming)
--   no-3-prime          18201    0.54    12411.73 +- 3722.96       3129.88 +- 3171.72    (bad trimming)
--   
--   low-coverage       520646   15.55    12063.76 +- 3598.21          6.73 +- 3.40       (easy to assemble, potential for lower quality consensus)
--   unique             233029    6.96    13212.73 +- 3948.96         47.24 +- 15.63      (easy to assemble, perfect, yay)
--   repeat-cont       2241936   66.96    13203.57 +- 3469.65        438.75 +- 273.38     (potential for consensus errors, no impact on assembly)
--   repeat-dove         58328    1.74    21293.34 +- 2212.20        333.09 +- 212.73     (hard to assemble, likely won't assemble correctly or even at all)
--   
--   span-repeat         79343    2.37    13634.71 +- 3466.81       4671.17 +- 4283.54    (read spans a large repeat, usually easy to assemble)
--   uniq-repeat-cont    86825    2.59    12438.43 +- 3173.49                             (should be uniquely placed, low potential for consensus errors, no impact on assembly)
--   uniq-repeat-dove    36814    1.10    15585.68 +- 3226.67                             (will end contigs, potential to misassemble)
--   uniq-anchor          8571    0.26    14485.44 +- 3635.17       3980.64 +- 3565.95    (repeat read, with unique section, probable bad read)

[UNITIGGING/ADJUSTMENT]
-- No report available.

[UNITIGGING/ERROR RATES]
--  
--  ERROR RATES
--  -----------
--                                                   --------threshold------
--  3764374                      fraction error      fraction        percent
--  samples                              (1e-5)         error          error
--                   --------------------------      --------       --------
--  command line (-eg)                           ->     30.00        0.0300%
--  command line (-eM)                           ->   1000.00        1.0000%
--  mean + std.dev       0.47 +-   4 *     2.98  ->     12.41        0.0124%
--  median + mad         0.00 +-   4 *     0.00  ->      0.00        0.0000%
--  90th percentile                              ->      1.00        0.0010%  (enabled)
--  
--  BEST EDGE FILTERING
--  -------------------
--  At graph threshold 0.0300%, reads:
--    available to have edges:       593234
--    with at least one edge:        558720
--  
--  At max threshold 1.0000%, reads:  (not computed)
--    available to have edges:            0
--    with at least one edge:             0
--  
--  At tight threshold 0.0010%, reads with:
--    both edges below threshold:    470068
--    one  edge  above threshold:     70669
--    both edges above threshold:     17983
--    at least one edge:             558720
--  
--  At loose threshold 0.0124%, reads with:
--    both edges below threshold:    501295
--    one  edge  above threshold:     49081
--    both edges above threshold:      8344
--    at least one edge:             558720
--  
--  
--  INITIAL EDGES
--  -------- ----------------------------------------
--   2529404 reads are contained
--   1150834 reads have no best edges (singleton)
--     59063 reads have only one best edge (spur) 
--              53786 are mutual best
--    383296 reads have two best edges 
--              38960 have one mutual best edge
--             336273 have two mutual best edges
--  
--  
--  FINAL EDGES
--  -------- ----------------------------------------
--   2529404 reads are contained
--   1166426 reads have no best edges (singleton)
--     57991 reads have only one best edge (spur) 
--              55393 are mutual best
--    368776 reads have two best edges 
--              31203 have one mutual best edge
--             331931 have two mutual best edges
--  
--  
--  EDGE FILTERING
--  -------- ------------------------------------------
--         0 reads are ignored
--    332467 reads have a gap in overlap coverage
--      7840 reads have lopsided best edges

[UNITIGGING/CONTIGS]
-- Found, in version 1, after unitig construction:
--   contigs:      15569 sequences, total length 1113172293 bp (including 1969 repeats of total length 40251135 bp).
--   bubbles:      12105 sequences, total length 339428280 bp.
--   unassembled:  453881 sequences, total length 6160515805 bp.
--
-- Contig sizes based on genome size 1.2gbp:
--
--            NG (bp)  LG (contigs)    sum (bp)
--         ----------  ------------  ----------
--     10     3986030            23   120863694
--     20     2371456            62   241139173
--     30      179641           407   360086301
--     40      112244          1281   480111774
--     50       82207          2551   600054836
--     60       62827          4230   720052631
--     70       47673          6430   840029890
--     80       34867          9377   960029583
--     90       21305         13696  1080014296
--

[UNITIGGING/CONSENSUS]
-- Found, in version 2, after consensus generation:
--   contigs:      15569 sequences, total length 1532185200 bp (including 1969 repeats of total length 54224892 bp).
--   bubbles:      12105 sequences, total length 465073080 bp.
--   unassembled:  453881 sequences, total length 8359912934 bp.
--
-- Contig sizes based on genome size 1.2gbp:
--
--            NG (bp)  LG (contigs)    sum (bp)
--         ----------  ------------  ----------
--     10     6333588            16   123644693
--     20     4280720            39   241558585
--     30     1881180            76   360592077
--     40      261411           368   480254527
--     50      176426           937   600054053
--     60      135459          1717   720026039
--     70      110542          2702   840026471
--     80       90976          3902   960052925
--     90       74724          5358  1080073502
--    100       60575          7141  1200005941
--    110       48147          9358  1320015376
--    120       35183         12255  1440004833
--

The assembly stat is below for the reference. Note that the assembly size is quite large as the expected genome size is around 1.2G.

sum = 1997258280, n = 27674, ave = 72170.93, largest = 12573277
N50 = 91503, n = 4209
N60 = 69640, n = 6718
N70 = 53286, n = 10008
N80 = 40965, n = 14298
N90 = 31636, n = 19849
N100 = 4488, n = 27674
N_count = 0
Gaps = 0

Thanks a lot in advance.

skoren commented 1 day ago

The larger size is expected, it's likely both haplotypes of a diploid genome (see https://canu.readthedocs.io/en/latest/faq.html#my-genome-size-and-assembly-size-are-different-help). You can see that about 500mb are already flagged as bubbles (alt haplotype). The rest likely is too diverged to be automatically flagged so you'd need to rely on a tool like purge_dups. As for the fragmentation, the coverage looks really low from the k-mer histogram. The primary peak is between 6-10x which is too low for a good assembly, what coverage were you inputting? Is this a clonal sample or a collection of individuals?

gunjanpandey commented 1 day ago

Thanks for a prompt reply, Sergey

This genome has puzzled me quite a bit. Total input hifi data is ~60X (assuming ~1.2 G genome size, which could be around 2G)

genomescope profile of the same organism with the short read data is here https://github.com/schatzlab/genomescope/issues/142

file                                     format  type   num_seqs         sum_len  min_len   avg_len  max_len
../01_Data/hifi_dedup_decontamianted.fq  FASTQ   DNA   4,122,639  73,888,444,677       90  17,922.6   63,566

Note this is a Cladocopium app where the polidy and duplication levels are not clear. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9412976/

Any thoughts on how to proceed would be very useful to me.

skoren commented 19 hours ago

The genomescope results imply a larger genome than 1.2 Gbp but also that the haplotypes are extremely similar (if it is diploid) as there are very few single-copy k-mers. You'd probably benefit from a larger k-mer size like k=31 instead of 19 for genomescope.

The HiFi assembly implies an even larger genome size, the coverage is somewhere around 8x given 50x 1.2gb or 7gb which would imply a 3.5gb if diploid genome. HiFi assembly is going to be very sensitive to variation though so it makes me wonder if the inputs for the Illumina and HiFi data are the same? Is it possible the Illumina sample is more clonal than the sample for HiFi? Either way, I'd increase either the genome size or the maxInputCoverage since right now it's only use 50x 1.2 gb so you have more data that was not used in the assembly. After that, your best option is probably to rely on core genes/purge_dups to determine if there is haplotype duplication in the assembly or not. You could also try verkko and look at the resulting assembly graphs to see if there is diploid structure (though it would likely be less continuous as it only produces phased outputs while canu can produce a pseudo-haplotype).

marbl / canu

fragemented and large assembly #2343