Closed wyim-pgl closed 3 years ago
What version of Canu are you running? It's a little strange there are absolutely no bubbles marked here unless your genome is extremely heterozygous or you have an old Canu version. It's possible some of the extra contigs are false duplications due to under-corrected reads in the assembly. That said, your k-mer distribution shows a peak in the 8-11x range which would be consistent with a 2.4g genome size like the assembly generated. How confident are you in the 900mb genome size?
I'd suggest using the latest 2.1 release and checking genome size perdition's from GenomeScope or similar. You should also run purge_dups after the assembly as that will remove the alternate haplotype along with false duplications, if they exist.
Thanks, Sergey. I compiled it from github. The version shows below.
(base) wyim @ login-0 13:25:20 ~/scratch/data/
canu/Linux-amd64/bin/canu --version
Canu branch hicanu_rc +325 changes (r9818 86bb2e221546c76437887d3a0ff5ab9546f85317)
The 900Mb genome size came from its diploid progenitor, I assumed it should be the double size. Also I checked it with flow cytometry, it showed ~2Gbp.
I will try it with the latest 2.1 release and keep you posted. Regards, Won
Now I read this... To install from source code (DO NOT download the Source code files provided by GitHub as these will not compile, use the canu-2.1.tar.gz instead):
As usual, I compiled from Github source.
I will try it again with Canu 2.1. Thanks.
Your version is relatively old at this point, I'd expect 2.1 to be improved. You can either download the source tar.gz or the pre-compiled binaries for your system.
I am rerunning now and keep you posted. Thanks.
Hi Sergey, I ran with the current release, now it makes bubbles. The final size is still bigger than I expected. Do you have any recommendations? Thanks. Won
cat data.report
[UNITIGGING/READS]
--
-- In sequence store './data.seqStore':
-- Found 2248974 reads.
-- Found 27348782473 bases (30.38 times coverage).
--
-- G=27348782473 sum of || length num
-- NG length index lengths || range seqs
-- ----- ------------ --------- ------------ || ------------------- -------
-- 00010 15752 160204 2734882862 || 3006-3739 259|-
-- 00020 14435 342124 5469761347 || 3740-4473 263|-
-- 00030 13514 538181 8204636976 || 4474-5207 448|-
-- 00040 12771 746497 10939524901 || 5208-5941 780|-
-- 00050 12129 966330 13674391641 || 5942-6675 1376|-
-- 00060 11551 1197443 16409273498 || 6676-7409 2255|-
-- 00070 11012 1439970 19144153658 || 7410-8143 3377|-
-- 00080 10495 1694366 21879027176 || 8144-8877 9886|--
-- 00090 9981 1961518 24613907502 || 8878-9611 106893|-------------------
-- 00100 3006 2248973 27348782473 || 9612-10345 351989|--------------------------------------------------------------
-- 001.000x 2248974 27348782473 || 10346-11079 363434|---------------------------------------------------------------
-- || 11080-11813 319291|--------------------------------------------------------
-- || 11814-12547 269986|-----------------------------------------------
-- || 12548-13281 220331|---------------------------------------
-- || 13282-14015 174938|-------------------------------
-- || 14016-14749 133997|------------------------
-- || 14750-15483 99938|------------------
-- || 15484-16217 71597|-------------
-- || 16218-16951 48407|---------
-- || 16952-17685 31106|------
-- || 17686-18419 18709|----
-- || 18420-19153 10322|--
-- || 19154-19887 5150|-
-- || 19888-20621 2335|-
-- || 20622-21355 935|-
-- || 21356-22089 413|-
-- || 22090-22823 183|-
-- || 22824-23557 126|-
-- || 23558-24291 61|-
-- || 24292-25025 40|-
-- || 25026-25759 37|-
-- || 25760-26493 28|-
-- || 26494-27227 21|-
-- || 27228-27961 16|-
-- || 27962-28695 12|-
-- || 28696-29429 7|-
-- || 29430-30163 7|-
-- || 30164-30897 4|-
-- || 30898-31631 5|-
-- || 31632-32365 2|-
-- || 32366-33099 1|-
-- || 33100-33833 3|-
-- || 33834-34567 2|-
-- || 34568-35301 0|
-- || 35302-36035 1|-
-- || 36036-36769 1|-
-- || 36770-37503 0|
-- || 37504-38237 0|
-- || 38238-38971 1|-
-- || 38972-39705 1|-
--
[UNITIGGING/MERS]
--
-- 22-mers Fraction
-- Occurrences NumMers Unique Total
-- 1- 1 0 0.0000 0.0000
-- 2- 2 6627411 **** 0.0115 0.0007
-- 3- 4 19849202 ************* 0.0242 0.0019
-- 5- 7 73354396 ************************************************ 0.0790 0.0097
-- 8- 11 105504295 ********************************************************************** 0.2246 0.0413
-- 12- 16 79565647 **************************************************** 0.3876 0.0931
-- 17- 22 76648180 ************************************************** 0.5171 0.1526
-- 23- 29 71773495 *********************************************** 0.6450 0.2328
-- 30- 37 59985107 *************************************** 0.7653 0.3324
-- 38- 46 32093039 ********************* 0.8627 0.4348
-- 47- 56 14799307 ********* 0.9126 0.4997
-- 57- 67 9649761 ****** 0.9366 0.5381
-- 68- 79 6550057 **** 0.9526 0.5689
-- 80- 92 4433248 ** 0.9635 0.5937
-- 93- 106 3181902 ** 0.9709 0.6135
-- 107- 121 2379865 * 0.9763 0.6300
-- 122- 137 1807781 * 0.9803 0.6442
-- 138- 154 1399394 0.9834 0.6565
-- 155- 172 1112607 0.9857 0.6672
-- 173- 191 897579 0.9876 0.6768
-- 192- 211 737050 0.9892 0.6854
-- 212- 232 608902 0.9904 0.6933
-- 233- 254 512184 0.9915 0.7004
-- 255- 277 431508 0.9923 0.7070
-- 278- 301 368755 0.9931 0.7131
-- 302- 326 316512 0.9937 0.7187
-- 327- 352 274527 0.9943 0.7240
-- 353- 379 236545 0.9947 0.7289
-- 380- 407 206635 0.9951 0.7335
-- 408- 436 180510 0.9955 0.7378
-- 437- 466 157111 0.9958 0.7419
-- 467- 497 140017 0.9961 0.7456
-- 498- 529 124349 0.9963 0.7492
-- 530- 562 111066 0.9965 0.7526
-- 563- 596 99749 0.9967 0.7558
-- 597- 631 91388 0.9969 0.7589
-- 632- 667 82364 0.9970 0.7619
-- 668- 704 74307 0.9972 0.7647
-- 705- 742 69081 0.9973 0.7674
-- 743- 781 66064 0.9974 0.7701
-- 782- 821 61844 0.9976 0.7728
--
-- 0 (max occurrences)
-- 18771184576 (total mers, non-unique)
-- 577918346 (distinct mers, non-unique)
-- 0 (unique mers)
[UNITIGGING/OVERLAPS]
-- category reads % read length feature size or coverage analysis
-- ---------------- ------- ------- ---------------------- ------------------------ --------------------
-- middle-missing 6383 0.28 9639.17 +- 2647.72 1389.84 +- 1334.91 (bad trimming)
-- middle-hump 622 0.03 10079.20 +- 1614.30 2348.29 +- 1385.90 (bad trimming)
-- no-5-prime 16928 0.75 8688.17 +- 1635.00 1981.53 +- 1950.74 (bad trimming)
-- no-3-prime 15445 0.69 8734.54 +- 1658.78 2268.40 +- 2063.64 (bad trimming)
--
-- low-coverage 322204 14.33 8200.21 +- 1406.77 5.65 +- 1.85 (easy to assemble, potential for lower quality consensus)
-- unique 929949 41.35 8350.63 +- 1469.41 20.77 +- 6.56 (easy to assemble, perfect, yay)
-- repeat-cont 9305 0.41 8026.80 +- 1338.05 100.10 +- 37.60 (potential for consensus errors, no impact on assembly)
-- repeat-dove 401 0.02 11204.49 +- 1502.88 96.80 +- 38.03 (hard to assemble, likely won't assemble correctly or even at all)
--
-- span-repeat 328538 14.61 8659.70 +- 1582.17 2902.59 +- 2485.45 (read spans a large repeat, usually easy to assemble)
-- uniq-repeat-cont 268575 11.94 7751.03 +- 1035.53 (should be uniquely placed, low potential for consensus errors, no impact on assembly)
-- uniq-repeat-dove 234859 10.44 9342.10 +- 1522.81 (will end contigs, potential to misassemble)
-- uniq-anchor 347 0.02 8862.42 +- 1870.23 2134.26 +- 2285.80 (repeat read, with unique section, probable bad read)
[UNITIGGING/ADJUSTMENT]
-- No report available.
[UNITIGGING/ERROR RATES]
--
-- ERROR RATES
-- -----------
-- --------threshold------
-- 2974421 fraction error fraction percent
-- samples (1e-5) error error
-- -------------------------- -------- --------
-- command line (-eg) -> 30.00 0.0300%
-- command line (-eM) -> 1000.00 1.0000%
-- mean + std.dev 0.78 +- 4 * 3.94 -> 16.54 0.0165%
-- median + mad 0.00 +- 4 * 0.00 -> 0.00 0.0000%
-- 90th percentile -> 1.00 0.0010% (enabled)
--
-- BEST EDGE FILTERING
-- -------------------
-- At graph threshold 0.0300%, reads:
-- available to have edges: 1102333
-- with at least one edge: 944316
--
-- At max threshold 1.0000%, reads: (not computed)
-- available to have edges: 0
-- with at least one edge: 0
--
-- At tight threshold 0.0010%, reads with:
-- both edges below threshold: 844539
-- one edge above threshold: 78769
-- both edges above threshold: 21008
-- at least one edge: 944316
--
-- At loose threshold 0.0165%, reads with:
-- both edges below threshold: 889279
-- one edge above threshold: 47820
-- both edges above threshold: 7217
-- at least one edge: 944316
--
--
-- INITIAL EDGES
-- -------- ----------------------------------------
-- 1079489 reads are contained
-- 272183 reads have no best edges (singleton)
-- 19917 reads have only one best edge (spur)
-- 18371 are mutual best
-- 877385 reads have two best edges
-- 30163 have one mutual best edge
-- 844272 have two mutual best edges
--
--
-- FINAL EDGES
-- -------- ----------------------------------------
-- 1079489 reads are contained
-- 276081 reads have no best edges (singleton)
-- 19741 reads have only one best edge (spur)
-- 18653 are mutual best
-- 873663 reads have two best edges
-- 27683 have one mutual best edge
-- 843329 have two mutual best edges
--
--
-- EDGE FILTERING
-- -------- ------------------------------------------
-- 0 reads are ignored
-- 103380 reads have a gap in overlap coverage
-- 1428 reads have lopsided best edges
[UNITIGGING/CONTIGS]
-- Found, in version 1, after unitig construction:
-- contigs: 7979 sequences, total length 1476787006 bp (including 1592 repeats of total length 20950723 bp).
-- bubbles: 10985 sequences, total length 259471232 bp.
-- unassembled: 292879 sequences, total length 2476730635 bp.
--
-- Contig sizes based on genome size 900mbp:
--
-- NG (bp) LG (contigs) sum (bp)
-- ---------- ------------ ----------
-- 10 5291328 12 93138315
-- 20 2773657 36 181856844
-- 30 1417871 82 270387738
-- 40 913324 163 360285301
-- 50 674909 281 450404201
-- 60 533504 432 540515149
-- 70 446789 616 630436641
-- 80 372694 837 720116947
-- 90 315763 1099 810284330
-- 100 269516 1407 900039718
-- 110 229086 1770 990041518
-- 120 189625 2202 1080025950
-- 130 152439 2730 1170038631
-- 140 117214 3402 1260096765
-- 150 79994 4320 1350067079
-- 160 33063 5984 1440017427
--
[UNITIGGING/CONSENSUS]
-- Found, in version 2, after consensus generation:
-- contigs: 7979 sequences, total length 2134680991 bp (including 1592 repeats of total length 30250112 bp).
-- bubbles: 10985 sequences, total length 374940873 bp.
-- unassembled: 292879 sequences, total length 3578120417 bp.
--
-- Contig sizes based on genome size 900mbp:
--
-- NG (bp) LG (contigs) sum (bp)
-- ---------- ------------ ----------
-- 10 9290601 7 92668417
-- 20 5939898 20 185599728
-- 30 3795553 39 272304223
-- 40 2278086 69 360815916
-- 50 1619836 117 451400255
-- 60 1226869 181 541047031
-- 70 1013475 263 630851441
-- 80 857363 360 720465313
-- 90 735938 474 810633631
-- 100 659422 603 900514005
-- 110 573461 750 990159678
-- 120 513061 917 1080482960
-- 130 454263 1103 1170448747
-- 140 409305 1311 1260021121
-- 150 365009 1544 1350100316
-- 160 325568 1805 1440133640
-- 170 285867 2100 1530146016
-- 180 249145 2437 1620136827
-- 190 212963 2828 1710079055
-- 200 176670 3292 1800170649
-- 210 142522 3855 1890003798
-- 220 101779 4599 1980007599
-- 230 54179 5772 2070007410
--
Given the histogram peak at 8-11x, that still makes me think 2.4g is a reasonable size. I'd suggest checking genome size estimates using genome scope and seeing what that gives. I'd also run purge_dups and see how large the purged assembly ends up (you may have to manually adjust the purge_dups cutoffs).
Thanks I will keep you posted.
Hi Sergey, I tried to assemble an allotetraploid genome which has 900M as haploid genome size. I ran HiCanu with the below options and the output data.contigs.fasta has 2.4G size. Although I expected to have around 1.8G for genome, it generated much larger. Do you have any suggestions? Thanks. Won