marbl / canu

A single molecule sequence assembler for genomes large and small.
http://canu.readthedocs.io/
644 stars 177 forks source link

bubble contigs #2278

Closed wez078 closed 7 months ago

wez078 commented 7 months ago

** Genome_Survey busco_figure busco_figure report_Cabu.pdf Dears, I survey the genome, and hetero is close to zero, and after I assembled with hiCanu, I still get some bubble contigs, wonder they are alternative or sequencing error, I used Pcabio hifi reads . Canu report as below

Thanks wentao

UNITIGGING/CONSENSUS]
-- Found, in version 2, after consensus generation:
--   contigs:      188 sequences, total length 68332715 bp (including 120 repeats of total length 2348159 bp).
--   bubbles:      145 sequences, total length 2808364 bp.
--   unassembled:  37102 sequences, total length 227250770 bp.
--
-- Contig sizes based on genome size 80mbp:
--
--            NG (bp)  LG (contigs)    sum (bp)
--         ----------  ------------  ----------
--     10     3472630             3    10534786
--     20     2182914             6    17418427
--     30     1881149            10    25335064
--     40     1361830            15    32768007
--     50     1107699            22    41002401
--     60      885302            30    48606077
--     70      487575            42    56319638
--     80      166589            66    64127837
--
skoren commented 7 months ago

The threshold in HiFi is very high so single-base differences would disallow reads to overlap. Thus, the bubbles likely contain both alternate haplotypes and noise/systematic errors in the reads. I would suggest setting some filters based on length (50kb or so) and number of reads (reported in the fasta header line, 5-10 would make sense I think) to see how many bubbles remain.

wez078 commented 7 months ago

Dear Skoren, Thanks for your comments, I will give it a try to see Cheers'wentao

skoren commented 7 months ago

Idle