marbl / canu

A single molecule sequence assembler for genomes large and small.
http://canu.readthedocs.io/
654 stars 179 forks source link

Canu genome assembly size double than expected size #1820

Closed farhan-phd closed 3 years ago

farhan-phd commented 3 years ago

Dear Canu developer,

I just finished my first genome draft of the canu assembly but the genome assembled is the almost double size (1.7 Gb) (evaluated by quast) than that expected/calculated using GenomeScope (0.9 Gb). Could you please have a look at my commands and parameters what could be the major problem? Please see the details below:

Assembly commads and parametres: canu-2.1/bin/canu -p Asta-latifasciata -d canu-01-assembly genomeSize=1g -pacbio-raw /00-raw-data_pacbio/ala-1b-Ge-Mus-M_PacBio.fastq correctedErrorRate=0.035 utgOvlErrorRate=0.065 trimReadsCoverage=2 trimReadsOverlap=500 > std.error 2> std.out &

Assembly corrected and trimmed reads (report) and genomeScope output attached below

Asta.report.txt GenomeScope_statistic.docx

Note: This assembled genome is of a cichlid fish with an extra (B) chromosome, which sequences are mainly duplicated from Autosomes. Is there any possibility that this extra chromosome might have caused this issue of genome size? Many thanks in advance, Best regards, Farhan

skoren commented 3 years ago

It's likely you're getting haplotypes separated in your assembly. I'd suggest using purge_dups as listed on the FAQ to remove the redundancy in the assembly. You can then use tools like KAT or Merqury to check if the assembly is indeed capturing both haplotypes and/or if purge_dups is working correctly.