large number of bubbles

guangtugao commented 2 years ago

Hello,

I ran hicanu (canu 2.2) in a cluster (USDA ARS ceres) with 67x hifi sequences for a 2.4 gb genome assembly. The command is:

canu -p bct -d canu_default genomeSize=2.4g \ batMemory=128 corMemory=64 cnsMemory=128 \ -pacbio-hifi ../haps/haplotype/haplotype-BCT.fasta.gz

You can see that the input reads are the output from trio-binning for one haplotype.

At the end I got the following results:

[UNITIGGING/CONSENSUS]
-- Found, in version 2, after consensus generation:
--   contigs:      2002 sequences, total length 2424310854 bp (including 2261 repeats of total length 62308581 bp).
--   bubbles:      13037 sequences, total length 355816681 bp.
--   unassembled:  900425 sequences, total length 12321358889 bp.
--
-- Contig sizes based on genome size 2.4 gbp:
--
--            NG (bp)  LG (contigs)    sum (bp)
--         ----------  ------------  ----------
--     10    18988145            10   245381771
--     20    13053145            26   489591026
--     30    10147561            47   728819180
--     40     7718659            73   962639093
--     50     5638902           109  1200292433
--     60     4021147           160  1441515365
--     70     2572114           236  1682063636
--     80     1705226           350  1920414653
--     90      848495           547  2160534934
--    100       78583          1257  2400069750

The number of contigs and their size make sense to me, but what I don't understand is why I got so many bubbles?

Thanks, Guangtu Gao

skoren commented 2 years ago

Your bubbles aren't that large with respect to the assembly, 15%. Trio binning of the reads isn't perfect, there's likely to be some mis-binned reads which can cause bubbles. There may also be recurrent errors in HiFi which would lead to bubbles.

The assembly with HiFi data is very stringent when considering overlaps, they have to be perfect to be used. That means any differences will show up as bubbles. In general, the assembly with HiFi reads is more stringent than trio binning. In cases of very diverse genomes, I actually prefer to assemble the full dataset and then use the trio information afterwards to split a contig if needed. Most contigs will be fully phased with HiFi data assuming there are enough variants (e.g. more frequent than a HiFi read length).

guangtugao commented 2 years ago

Thank you, Sergey! Actually, this is a very diverse genome. We made a hybrid F1 individual from the parents of two subspecies. For the way you suggested, do you mean I first run canu using the sequences from both haplotypes, and then run canu -haplotype to separate the contigs? I think this is smart approach. Thanks a lot! Guangtu Gao

skoren commented 2 years ago

I mean run Canu with all reads and then run merqury (https://github.com/marbl/merqury) to identify haplotype blocks and split contigs if needed.

guangtugao commented 2 years ago

Thanks, Sergey!

marbl / canu

large number of bubbles #2066