marbl / canu

A single molecule sequence assembler for genomes large and small.
http://canu.readthedocs.io/
654 stars 179 forks source link

Improving (expected) low assembly contiguity and (unexpected) small assembly size #1755

Closed rotoke closed 4 years ago

rotoke commented 4 years ago

Dear Sergey,

I just finished a canu 2.0 assembly of a diploid 2.3 gbp pacbio plant genome, which turned out to be quite fragmented and much shorter than expected. There are some issues with genome features and read length, and it's clear that I won't end up with chromosome-level contigs. However, this is my first genome assembly, and I would be very happy to get your input on the canu parameter set and results before I start thinking about re-sequencing.

Organism

This is a diploid desert plant grown from wild collected seed. A GenomeScope run with Illumina data suggest ~2.6% heterozygosity, ~50% repeat content, and a genome size of 2n ~ 1.4 gbp. However, flow cytometry analysis suggests a genome size in the range of 2n ~2.3 gbp.

linear_plot

Data

I have ~60x PacBio Sequel I data. Unfortunately, the raw N50 is only around 8900bp with very few reads >10kbp. The sequencing company was unable to troubleshoot the run, so I had to take what I got.

Canu command

I used the suggested parameters for PacBio sequel I data and for high heterozygosity (separating haplotypes). The other parameters (suppressing repeats etc.) had to be added because of disk space issues and our weird cluster configuration.

[...]/canu \
-p gorteria_springbok \
-d [...]/assembly_gorteria_springbok_canu \
genomeSize=2.3g \
correctedErrorRate=0.085 \
corMhapSensitivity=normal \
corMhapFilterThreshold=0.0000000002 \
corMhapOptions="--threshold 0.80 --num-hashes 512 --num-min-matches 3 --ordered-sketch-size 1000 --ordered-kmer-size 14 --min-olap-length 2000 --repeat-idf-scale 50" \
mhapMemory=60g \
mhapBlockSize=500 \
ovlMerDistinct=0.975 \
corOutCoverage=200 \
"batOptions=-dg 3 -db 3 -dr 1 -ca 500 -cp 50" \
purgeOverlaps=aggressive \
gridOptionsOVS="--cpus-per-task 3" \
gridOptionsCNS="--mem-per-cpu 11g" \
maxMemory=370g \
maxThreads=32 \
useGrid=TRUE \
gridOptions="[...] --time=36:00:00 -p skylake-himem" \
-pacbio [...]/gorteria_spring_genome_pacbio/pacbio_*.fasta.gz

Results

This is the complete .report file:

[CORRECTION/LAYOUT]
--                             original      original
--                            raw reads     raw reads
--   category                w/overlaps  w/o/overlaps
--   -------------------- ------------- -------------
--   Number of Reads           17377110       1581426
--   Number of Bases       128888151567    4403225891
--   Coverage                    56.038         1.914
--   Median                        5069             0
--   Mean                          7417          2784
--   N50                           8967          6874
--   Minimum                       2000             0
--   Maximum                     235106        176025
--   
--                                        --------corrected---------  ----------rescued----------
--                             evidence                     expected                     expected
--   category                     reads            raw     corrected            raw     corrected
--   -------------------- -------------  ------------- -------------  ------------- -------------
--   Number of Reads           18102687       15946475      15946475              0             0
--   Number of Bases       133256511747   121537607146  115181138654              0             0
--   Coverage                    57.938         52.842        50.079          0.000         0.000
--   Median                        5052           5190          4934              0             0
--   Mean                          7361           7621          7222              0             0
--   N50                           8863           9298          9119              0             0
--   Minimum                       2000           2000             1              0             0
--   Maximum                     235106         226625        226612              0             0
--   
--                        --------uncorrected--------
--                                           expected
--   category                       raw     corrected
--   -------------------- ------------- -------------
--   Number of Reads            3012061       3012061
--   Number of Bases        11753770312        235093
--   Coverage                     5.110         0.000
--   Median                        3286             0
--   Mean                          3902             0
--   N50                           5969             0
--   Minimum                          0             0
--   Maximum                     235106        235093
--   
--   Maximum Memory          4327616708

[TRIMMING/READS]
--
-- In sequence store './gorteria_springbok.seqStore':
--   Found 15281028 reads.
--   Found 108953889562 bases (47.37 times coverage).
--    
--    G=108953889562                     sum of  ||               length     num
--    NG         length     index       lengths  ||                range    seqs
--    ----- ------------ --------- ------------  ||  ------------------- -------
--    00010        29034    303684  10895404034  ||          1-2646      1453532|--------------
--    00020        21761    739936  21790792354  ||       2647-5292      6988585|---------------------------------------------------------------
--    00030        16129   1321370  32686181717  ||       5293-7938      3109699|-----------------------------
--    00040        11576   2123686  43581561671  ||       7939-10584     1313467|------------
--    00050         8741   3215605  54476948256  ||      10585-13230      653451|------
--    00060         6893   4625724  65372339859  ||      13231-15876      408744|----
--    00070         5545   6393702  76267727551  ||      15877-18522      310678|---
--    00080         4492   8579954  87163113377  ||      18523-21168      252939|---
--    00090         3525  11311732  98058501566  ||      21169-23814      204936|--
--    00100            1  15281027 108953889562  ||      23815-26460      160813|--
--    001.000x            15281028 108953889562  ||      26461-29106      123441|--
--                                               ||      29107-31752       90921|-
--                                               ||      31753-34398       65746|-
--                                               ||      34399-37044       46489|-
--                                               ||      37045-39690       32954|-
--                                               ||      39691-42336       22468|-
--                                               ||      42337-44982       15189|-
--                                               ||      44983-47628        9802|-
--                                               ||      47629-50274        6463|-
--                                               ||      50275-52920        3891|-
--                                               ||      52921-55566        2431|-
--                                               ||      55567-58212        1464|-
--                                               ||      58213-60858         858|-
--                                               ||      60859-63504         597|-
--                                               ||      63505-66150         354|-
--                                               ||      66151-68796         256|-
--                                               ||      68797-71442         203|-
--                                               ||      71443-74088         154|-
--                                               ||      74089-76734         123|-
--                                               ||      76735-79380          83|-
--                                               ||      79381-82026          63|-
--                                               ||      82027-84672          44|-
--                                               ||      84673-87318          40|-
--                                               ||      87319-89964          29|-
--                                               ||      89965-92610          27|-
--                                               ||      92611-95256          24|-
--                                               ||      95257-97902          11|-
--                                               ||      97903-100548          7|-
--                                               ||     100549-103194         17|-
--                                               ||     103195-105840          9|-
--                                               ||     105841-108486          3|-
--                                               ||     108487-111132          6|-
--                                               ||     111133-113778          3|-
--                                               ||     113779-116424          6|-
--                                               ||     116425-119070          3|-
--                                               ||     119071-121716          1|-
--                                               ||     121717-124362          0|
--                                               ||     124363-127008          1|-
--                                               ||     127009-129654          0|
--                                               ||     129655-132300          3|-
--

[TRIMMING/MERS]
--
--  22-mers                                                                                           Fraction
--    Occurrences   NumMers                                                                         Unique Total
--       1-     1         0                                                                        0.0000 0.0000
--       2-     2 343594061 ********************************************************************** 0.2395 0.0065
--       3-     4 214720016 *******************************************                            0.3365 0.0104
--       5-     7 105778398 *********************                                                  0.4226 0.0155
--       8-    11  58730961 ***********                                                            0.4762 0.0204
--      12-    16  43280421 ********                                                               0.5108 0.0252
--      17-    22  42579561 ********                                                               0.5390 0.0309
--      23-    29  48324846 *********                                                              0.5684 0.0390
--      30-    37  56457316 ***********                                                            0.6022 0.0514
--      38-    46  64443287 *************                                                          0.6416 0.0697
--      47-    56  67641547 *************                                                          0.6864 0.0957
--      57-    67  66903937 *************                                                          0.7331 0.1288
--      68-    79  62885328 ************                                                           0.7792 0.1679
--      80-    92  56379681 ***********                                                            0.8224 0.2113
--      93-   106  47505867 *********                                                              0.8611 0.2565
--     107-   121  37568150 *******                                                                0.8935 0.3004
--     122-   137  27970191 *****                                                                  0.9191 0.3400
--     138-   154  20060027 ****                                                                   0.9382 0.3734
--     155-   172  13866975 **                                                                     0.9518 0.4003
--     173-   191   9738147 *                                                                      0.9612 0.4212
--     192-   211   7013640 *                                                                      0.9679 0.4376
--     212-   232   5256859 *                                                                      0.9727 0.4506
--     233-   254   4110656                                                                        0.9763 0.4615
--     255-   277   3315196                                                                        0.9791 0.4708
--     278-   301   2725136                                                                        0.9814 0.4790
--     302-   326   2273314                                                                        0.9833 0.4864
--     327-   352   1932652                                                                        0.9849 0.4930
--     353-   379   1644699                                                                        0.9862 0.4992
--     380-   407   1420388                                                                        0.9874 0.5048
--     408-   436   1228454                                                                        0.9883 0.5100
--     437-   466   1074726                                                                        0.9892 0.5149
--     467-   497    953459                                                                        0.9899 0.5194
--     498-   529    843718                                                                        0.9906 0.5238
--     530-   562    748226                                                                        0.9912 0.5278
--     563-   596    675869                                                                        0.9917 0.5316
--     597-   631    610827                                                                        0.9922 0.5353
--     632-   667    553349                                                                        0.9926 0.5388
--     668-   704    502419                                                                        0.9930 0.5422
--     705-   742    460372                                                                        0.9933 0.5454
--     743-   781    422855                                                                        0.9936 0.5486
--     782-   821    390241                                                                        0.9939 0.5516
--
--           0 (max occurrences)
-- 106319743300 (total mers, non-unique)
--  1434904894 (distinct mers, non-unique)
--           0 (unique mers)

[TRIMMING/TRIMMING]
--  PARAMETERS:
--  ----------
--     1000    (reads trimmed below this many bases are deleted)
--   0.0850    (use overlaps at or below this fraction error)
--      500    (break region if overlap is less than this long, for 'largest covered' algorithm)
--        2    (break region if overlap coverage is less than this many reads, for 'largest covered' algorithm)
--  
--  INPUT READS:
--  -----------
--  18958536 reads 108953889562 bases (reads processed)
--       0 reads            0 bases (reads not processed, previously deleted)
--       0 reads            0 bases (reads not processed, in a library where trimming isn't allowed)
--  
--  OUTPUT READS:
--  ------------
--  6671862 reads  59614770417 bases (trimmed reads output)
--  8379480 reads  48128985026 bases (reads with no change, kept as is)
--  3823338 reads    495248132 bases (reads with no overlaps, deleted)
--   83856 reads     99786038 bases (reads with short trimmed length, deleted)
--  
--  TRIMMING DETAILS:
--  ----------------
--  3685746 reads    339371368 bases (bases trimmed from the 5' end of a read)
--  4380483 reads    275728581 bases (bases trimmed from the 3' end of a read)

[TRIMMING/SPLITTING]
--  PARAMETERS:
--  ----------
--     1000    (reads trimmed below this many bases are deleted)
--   0.0850    (use overlaps at or below this fraction error)
--  INPUT READS:
--  -----------
--  15051342 reads 108358855392 bases (reads processed)
--  3907194 reads    595034170 bases (reads not processed, previously deleted)
--       0 reads            0 bases (reads not processed, in a library where trimming isn't allowed)
--  
--  PROCESSED:
--  --------
--       0 reads            0 bases (no overlaps)
--       5 reads        13592 bases (no coverage after adjusting for trimming done already)
--       0 reads            0 bases (processed for chimera)
--       0 reads            0 bases (processed for spur)
--  15051337 reads 108358841800 bases (processed for subreads)
--  
--  READS WITH SIGNALS:
--  ------------------
--       0 reads            0 signals (number of 5' spur signal)
--       0 reads            0 signals (number of 3' spur signal)
--       0 reads            0 signals (number of chimera signal)
--   52750 reads        52993 signals (number of subread signal)
--  
--  SIGNALS:
--  -------
--       0 reads            0 bases (size of 5' spur signal)
--       0 reads            0 bases (size of 3' spur signal)
--       0 reads            0 bases (size of chimera signal)
--   52993 reads     16762144 bases (size of subread signal)
--  
--  TRIMMING:
--  --------
--   25244 reads    231255712 bases (trimmed from the 5' end of the read)
--   27510 reads    243264456 bases (trimmed from the 3' end of the read)

[UNITIGGING/READS]
--
-- In sequence store './gorteria_springbok.seqStore':
--   Found 15051340 reads.
--   Found 107269233773 bases (46.63 times coverage).
--    
--    G=107269233773                     sum of  ||               length     num
--    NG         length     index       lengths  ||                range    seqs
--    ----- ------------ --------- ------------  ||  ------------------- -------
--    00010        28551    306358  10726935914  ||       1000-3264      3051601|-----------------------------------
--    00020        21437    742515  21453860834  ||       3265-5529      5645750|---------------------------------------------------------------
--    00030        15871   1323817  32180783288  ||       5530-7794      2567895|-----------------------------
--    00040        11426   2125549  42907697079  ||       7795-10059     1222121|--------------
--    00050         8668   3211767  53634623844  ||      10060-12324      655662|--------
--    00060         6856   4609369  64361540758  ||      12325-14589      406821|-----
--    00070         5528   6357311  75088463873  ||      14590-16854      299831|----
--    00080         4484   8514967  85815391435  ||      16855-19119      247287|---
--    00090         3522  11207729  96542313001  ||      19120-21384      207418|---
--    00100         1000  15051339 107269233773  ||      21385-23649      172492|--
--    001.000x            15051340 107269233773  ||      23650-25914      140252|--
--                                               ||      25915-28179      111994|--
--                                               ||      28180-30444       86868|-
--                                               ||      30445-32709       65502|-
--                                               ||      32710-34974       49014|-
--                                               ||      34975-37239       36447|-
--                                               ||      37240-39504       26732|-
--                                               ||      39505-41769       18925|-
--                                               ||      41770-44034       13376|-
--                                               ||      44035-46299        9200|-
--                                               ||      46300-48564        6106|-
--                                               ||      48565-50829        4076|-
--                                               ||      50830-53094        2460|-
--                                               ||      53095-55359        1522|-
--                                               ||      55360-57624         881|-
--                                               ||      57625-59889         451|-
--                                               ||      59890-62154         276|-
--                                               ||      62155-64419         146|-
--                                               ||      64420-66684          71|-
--                                               ||      66685-68949          49|-
--                                               ||      68950-71214          28|-
--                                               ||      71215-73479          23|-
--                                               ||      73480-75744          11|-
--                                               ||      75745-78009          10|-
--                                               ||      78010-80274           5|-
--                                               ||      80275-82539           6|-
--                                               ||      82540-84804           3|-
--                                               ||      84805-87069           4|-
--                                               ||      87070-89334           4|-
--                                               ||      89335-91599           4|-
--                                               ||      91600-93864           3|-
--                                               ||      93865-96129           0|
--                                               ||      96130-98394           1|-
--                                               ||      98395-100659          1|-
--                                               ||     100660-102924          2|-
--                                               ||     102925-105189          2|-
--                                               ||     105190-107454          2|-
--                                               ||     107455-109719          1|-
--                                               ||     109720-111984          2|-
--                                               ||     111985-114249          2|-
--

[UNITIGGING/MERS]
--
--  22-mers                                                                                           Fraction
--    Occurrences   NumMers                                                                         Unique Total
--       1-     1         0                                                                        0.0000 0.0000
--       2-     2 330440903 ********************************************************************** 0.2342 0.0063
--       3-     4 208840295 ********************************************                           0.3300 0.0102
--       5-     7 103528234 *********************                                                  0.4155 0.0152
--       8-    11  57887643 ************                                                           0.4689 0.0201
--      12-    16  43044280 *********                                                              0.5037 0.0250
--      17-    22  42727638 *********                                                              0.5322 0.0307
--      23-    29  48655704 **********                                                             0.5623 0.0390
--      30-    37  56782635 ************                                                           0.5969 0.0515
--      38-    46  64687226 *************                                                          0.6372 0.0703
--      47-    56  67720327 **************                                                         0.6829 0.0967
--      57-    67  66879314 **************                                                         0.7305 0.1303
--      68-    79  62789644 *************                                                          0.7773 0.1699
--      80-    92  56141627 ***********                                                            0.8212 0.2138
--      93-   106  47127164 *********                                                              0.8603 0.2595
--     107-   121  37186596 *******                                                                0.8930 0.3037
--     122-   137  27599712 *****                                                                  0.9188 0.3434
--     138-   154  19782979 ****                                                                   0.9379 0.3768
--     155-   172  13661939 **                                                                     0.9516 0.4038
--     173-   191   9604923 **                                                                     0.9610 0.4246
--     192-   211   6916157 *                                                                      0.9677 0.4410
--     212-   232   5196147 *                                                                      0.9725 0.4541
--     233-   254   4067253                                                                        0.9761 0.4649
--     255-   277   3276660                                                                        0.9790 0.4743
--     278-   301   2696765                                                                        0.9813 0.4825
--     302-   326   2247858                                                                        0.9832 0.4899
--     327-   352   1912707                                                                        0.9848 0.4966
--     353-   379   1627394                                                                        0.9861 0.5028
--     380-   407   1404697                                                                        0.9872 0.5084
--     408-   436   1216785                                                                        0.9882 0.5137
--     437-   466   1063030                                                                        0.9891 0.5185
--     467-   497    944246                                                                        0.9898 0.5231
--     498-   529    834727                                                                        0.9905 0.5274
--     530-   562    741874                                                                        0.9911 0.5315
--     563-   596    670554                                                                        0.9916 0.5354
--     597-   631    604643                                                                        0.9921 0.5391
--     632-   667    548811                                                                        0.9925 0.5426
--     668-   704    497591                                                                        0.9929 0.5460
--     705-   742    456746                                                                        0.9933 0.5492
--     743-   781    418970                                                                        0.9936 0.5524
--     782-   821    387625                                                                        0.9939 0.5554
--
--           0 (max occurrences)
-- 104833548043 (total mers, non-unique)
--  1411080262 (distinct mers, non-unique)
--           0 (unique mers)

[UNITIGGING/OVERLAPS]
--   category            reads     %          read length        feature size or coverage  analysis
--   ----------------  -------  -------  ----------------------  ------------------------  --------------------
--   middle-missing        319    0.00    16081.66 +- 12240.02       914.21 +- 1593.28    (bad trimming)
--   middle-hump           890    0.01     7423.22 +- 4312.67         78.09 +- 449.72     (bad trimming)
--   no-5-prime           3227    0.02     9214.98 +- 6844.78        148.69 +- 646.71     (bad trimming)
--   no-3-prime           1441    0.01    10472.37 +- 8104.56        315.71 +- 1105.90    (bad trimming)
--   
--   low-coverage       145489    0.97     4278.95 +- 2268.08          7.34 +- 3.70       (easy to assemble, potential for lower quality consensus)
--   unique            1413770    9.39     5332.05 +- 3299.63         47.48 +- 14.15      (easy to assemble, perfect, yay)
--   repeat-cont       6443023   42.81     5067.20 +- 3596.73        456.61 +- 537.41     (potential for consensus errors, no impact on assembly)
--   repeat-dove          6610    0.04    28392.35 +- 11670.09       354.21 +- 326.25     (hard to assemble, likely won't assemble correctly or even at all)
--   
--   span-repeat       1458563    9.69     9231.78 +- 7946.27       4668.80 +- 5684.33    (read spans a large repeat, usually easy to assemble)
--   uniq-repeat-cont  4138544   27.50     7629.52 +- 5914.85                             (should be uniquely placed, low potential for consensus errors, no impact on assembly)
--   uniq-repeat-dove   131127    0.87    28226.61 +- 10355.34                            (will end contigs, potential to misassemble)
--   uniq-anchor       1308086    8.69    13358.32 +- 9152.13       5044.74 +- 6748.57    (repeat read, with unique section, probable bad read)

[UNITIGGING/ADJUSTMENT]
-- No report available.

[UNITIGGING/CONTIGS]
-- Found, in version 1, after unitig construction:
--   contigs:      13643 sequences, total length 1667234539 bp (including 260 repeats of total length 7841145 bp).
--   bubbles:      24605 sequences, total length 548515751 bp.
--   unassembled:  1013578 sequences, total length 7002103469 bp.
--
-- Contig sizes based on genome size 2.3gbp:
--
--            NG (bp)  LG (contigs)    sum (bp)
--         ----------  ------------  ----------
--     10     1250647           120   231021309
--     20      758465           362   460163912
--     30      503998           738   690264092
--     40      344106          1296   920137916
--     50      229863          2114  1150163669
--     60      124841          3451  1380047471
--     70       20914          7507  1610012273
--

[UNITIGGING/CONSENSUS]
-- Found, in version 2, after consensus generation:
--   contigs:      13643 sequences, total length 1660521749 bp (including 260 repeats of total length 7768462 bp).
--   bubbles:      24605 sequences, total length 544747813 bp.
--   unassembled:  1013578 sequences, total length 7001902815 bp.
--
-- Contig sizes based on genome size 2.3gbp:
--
--            NG (bp)  LG (contigs)    sum (bp)
--         ----------  ------------  ----------
--     10     1246794           120   230128601
--     20      753093           364   460003689
--     30      501391           743   690335955
--     40      340735          1305   920049620
--     50      227347          2131  1150072062
--     60      122015          3490  1380071957
--     70       18155          7827  1610011512
--

This is the output of an initial BUSCO v.4.0.6 run. I'll try to use purge_dups to get rid of duplication in a next step.

 ***** Results: *****

       C:97.0%[S:20.8%,D:76.2%],F:0.6%,M:2.4%,n:2326
        2257    Complete BUSCOs (C)
        484     Complete and single-copy BUSCOs (S)
        1773    Complete and duplicated BUSCOs (D)
        14  Fragmented BUSCOs (F)
        55  Missing BUSCOs (M)
        2326    Total BUSCO groups searched

Questions

  1. Did I make any errors in canu parameter choice, or are there any parameters I could tweak to improve assembly contiguity?

  2. Do you have an explanation for the shorter assembly length? Maybe canu has collapsed some homozygous regions, or our flow cytometry data is off?

  3. Could re-running the assembly without haplotype separation improve the result?

Thank you very much for your help!

All the best, Roman

skoren commented 4 years ago

I think the parameters are reasonable and the report looks OK. The NG50 is also possibly not too bad, the report is using 2.3 Gbp as the genome size for the calculation so if you used 1.6 the NG50 would go up and it would also go up after purge_dups (confirm it picks a reasonable threshold when you run it). I wouldn't expect turning off haplotype separation would help.

The histogram has a single peak around 55-60x coverage, which would imply a 1.7 Gbp genome not a 2.3 Gbp genome (with a low amount of sequence in the homozygous part of the histogram). Given that plus the GenomeScope and very complete BUSCOs I think the flow cytometry may be off. It is possible there are some very repetitive sequences which got collapsed (and are also being collapsed by the genome scope analysis). You could try to look for this by looking at the coverage of the contigs output by Canu (the *contigs.layout.tigInfo file has a coverage column) and trying to estimate collapsed bases (e.g. if a contig is at 100x but the median is 50x then it should have been 2 times the bases had it not been collapsed). I think you would be unlikely to assemble those given your relative short reads though so this may be as well as you can do.

rotoke commented 4 years ago

Thank you very much for the fast and detailed reply Sergey!

Good to know that the assembly parameters are ok; I'll try to get new flow cytometry data, which will hopefully resolve the mystery. Does canu have an option to re-calculate the stats with different genome size estimates?

Regarding the estimation of collapsed bases in repeats: This is the head of the *contigs.layout.tigInfo file (ordered according to coverage).

Screenshot 2020-07-06 at 17 43 42

There are suggested repeats among the contigs, but not with such deep coverage:

Screenshot 2020-07-06 at 18 21 33

If I understand correctly, the 'coverage' column shows the mean (or median?) base coverage per contig. This may be a stupid question, but where do I find the second value for the estimation (i.e. contig coverage after collapsing)? I assume this could be calculated by length('NumChildren')/'tigLen', but for this I'd need to know the individual length of each read?

Thank you again and all the best, Roman

skoren commented 4 years ago

I wouldn't rely on the repeat annotation. The coverage is the sum of bases in the reads divided by the tig length. Your first list has some deep contigs, for example assuming the median coverage of all your contigs is 50x, the first tig alone should represent approximately >1mb of sequence despite being only 100kb. So if you take all the contigs and multiply them by the ratio of their coverage to the median and sum them up and compare to their actual lengths, it will give you a sense of how many more bases you could have if all of those repeats were resolved.

rotoke commented 4 years ago

Thank you very much for clarification. The median coverage of all contigs is only at 16.61x, so if I sum up the 'extra bases' of e.g. all the contigs with a three-digit coverage, I get an additional 247mbp, which would give an effective genome size of 2n~1.9gbp. I don't really know where to set the threshold though - If I sum up 'extra bases' for all reads above the median I get > 4gbp, which obviously makes no sense. I assume one could look into the fasta file and only include contigs with deep coverage that effectively contain repeats?

One last thing: I understand that I won't be able to assembly these repeats in full length given the short reads I have. However, I also increased repeat suppression (as in the FAQ) to save disk space during the assembly. Would setting this to default improve the resolution here?

Thank you again, Roman

skoren commented 4 years ago

To answer your question about changing the genome size, you can run:

tgStoreDump \
      -S *.seqStore \
      -T unitigging/*.ctgStore 2 \
      -sizes -s 1700000000

for 1.7gb or any other size you want to provide as denominator.

I guess you want to get a median of the reliable contigs (e.g. those over some minimum length like 100kb or 1mb). The short low coverage things are probably skewing the median now. You can look at coverage vs length on a plot and see if there is a threshold that makes sense.

The repeat suppression in correction is probabilistic, unlike typical approaches. It will still find repeat overlaps, it will just very likely not pick repetitive seeds and the parameters decrease this likelihood further.

rotoke commented 4 years ago

Dear Sergey,

I've plotted that quickly: gorteria_springbok_assembly_tigLen_coverage

If I take e.g. 35x (the red line) as a threshold I end up with ~930mbp 'extra' and a total of ~2.59gbp, which is already relatively close to the expected range. Let's see how the new flow cytometry data will look like...

I'll close the issue now - thank you so much for all your input!

All the best, Roman

skoren commented 4 years ago

You're not trying to pick a cutoff like a histogram, you're looking for the coverage of most of your bases in the genome. It looks like the peak is slightly below 100x, it looks like maybe 60-75x. That is consistent with a smaller genome size (107269233773 / 70 = 1.5g). Then take all the tigs >100x and count the extra bases in them (assuming expected coverage of 70x), it looks like there isn't much there since they're all pretty short. This all seems consistent with the genomescope results and the assembly sizes so I think the cytometry data is over-estimating the size.

rotoke commented 4 years ago

Ah sorry, that cutoff doesn't make any sense indeed. I'm still confused though, so please correct me where I'm wrong:

This is a histogram showing the total number of bases per coverage of 'reliable' tigs with ≥100kb length (I've binned all tigs ≥100kb in 5x-steps and summed up tigLength within each bin. The graph is cut at 200x): gorteria_springbok_coverage_100kbtigs Here, the coverage of most bases is still at ~45x rather than 70x. I think the first plot is a bit misleading as the longest tigs are indeed within 60-75x, which seems to mask the true peak? This would give a genome size estimation of 107269233773/45 = 2.38gbp.

I indeed don't get much more if I sum up the extra bases of all tigs >100x, even if I'm using 45x as threshold (ca. 74 mbp).

skoren commented 4 years ago

When you say 2.3gb for the cytometry is that for both haplotypes (that is the human genome would be 6gb then) or is that the haploid size? I was interpreting it as the haploid size but reading your original message seems like it is both haplotypes. In that case, it doesn't seem so far off the assembly you have. The full genome isn't going to be separated since some regions have lower divergence than the overall 2.3% estimate so you shouldn't expect to get a 2.3g assembly (just like you don't get a 6gb human assembly from CLR data).

This plots the total bases at a given coverage? It seems like there may be two peaks around 40-50x and 80-90x which would make sense as a heterozygous and homozygous contigs. The longest contigs then would be homozygous and shared between the haplotypes but represented once. The total assembly size is 1.7g and, based on busco, it's missing about 25% of the second haplotype (duplicated 76%). How many bases do you have between the first peak and the second peak? Those are likely contigs that should be twice their length if they had been separated and I expect will account for about 300-400mb of sequence. All together, the pseudo-haplotype contigs + the collapsed homozygous contigs + the repeats will come close to the 2.3gb estimate you have I think.

rotoke commented 4 years ago

Yes, 2.3 gbp is for 2n / the full diploid genome - I'm very sorry if that didn't come out clearly from the beginning. I didn't think the assembly size was completely off either, I was just wondering where the missing 30% got lost.

Yes, the plot shows total bases at a given 5x-wide coverage bin. I count around 170 mbp between the peaks, but this is probably an underestimation as the bins are relatively broad.

It sounds plausible that the missing 30% are a cumulative result of collapsing both homozygous segments and repeats. I think I'll take this assembly further (despite the lower NG50) and see what purge_dups, illumina polishing etc. will do with it.

Thank you again for all your help!

dcopetti commented 4 years ago

Nice work @rotoke , I see your pain in such complex genomes. If I can add my two cents here, I am facing a similar problem and I see some commonalities with your case.

I am assembling a 2.5 Gb (haploid size) highly heterozygous plant genome, with ~70x (haploid genome) ONT coverage, so 35ishx per allele. I had to ditch Canu due to the large size of the intermediate files and the long computation time on the resources I have, but Flye worked well (especially, no haplotype switches within contigs, which is what everyone should aim at) and gave me a 3.6 Gb assembly (out of 5 Gb), N50 680 kb, N90 130 kb, which I am quite satisfied of. After long and short read polishing, BUSCO score is 95% complete (29% single, 66% duplicated), so also good. To get a more granular view of what is collapsed and what not, I realigned the raw reads, calculated coverage in 10 kb windows and plotted. I get a bimodal curve with a smaller peak at 2x the coverage of the first, spanning about 400 Mb (which I would be very happy to get back). So now I am focusing on those (regions of the) contigs that are around that peak at higher coverage, trying to re-assemble them and get 2x as much sequence. Bad news is, nor Canu (also with the diploid settings), nor NECAT, nor Flye (--keep_haplotypes) can improve the total bp of the contigs I want to re-assemble. I am running out of options on how to try to split those sequences, but they may simply be too homozygous. @skoren : do you think that using --haplotypes and maybe some settings more stringent than default may help to get allele-specific corrected reads? How about duplicating (copy-paste) those collapsed (regions of) contigs, add them to the assembly, and align reads hoping they will split in two groups? Would the 10x chromium linked reads be of any help building 2 different haplotypes on the reference to make it more sticky to the long reads?

skoren commented 4 years ago

--haplotype is for when you have trio information so it wouldn't help here. You shouldn't expect you can just assemble just these reads into two bins since they didn't get separated before. The raw reads are likely too noisy to discriminate haplotype differences vs random error. You could potentially use some other information to try to bin the reads, there are some tools that work with strandseq data. You could also try to make something to split using the 10x reads, this is the strategy used in the Korean reference genome paper: https://www.nature.com/articles/nature20098.