marbl / canu

A single molecule sequence assembler for genomes large and small.
http://canu.readthedocs.io/
653 stars 179 forks source link

False chromosome fusion caused by assembly mistake #1627

Closed xieyichun50 closed 4 years ago

xieyichun50 commented 4 years ago

I am currently try to assemble a fungal genome with Canu-racon-pilon pipeline using Oxford nanopore reads and illumina short reads. My fungal genome size is estimated at 37.5 Mb. The nanopore reads have a total of 716,455 long reads, 8.9 Gb (canu report coverage over 230x), and N50 at over 24 kb. The illumina reads are paired-end 2x150 bp at total 1.5 Gb. All parameters in canu/racon/pilon were set to default except genome_size.

When I finished assembly and annotation, I compared the genome to a well done genome of the same species using MCScanX (which would tell the synteny and collinearity between two genomes). The assembly achieved the chromosome level. 1582386852994

However, I found that the longest scaffold (scaffold_1 at lower panel) I have, is actually the fusion of two chromosomes, 20000011 and 20000031 (upper panel), of the genome in the same species. Previous studies on Karyotype also give evidentce to my false assembly. From the long read alignment file, a 4kb gap can clearly be distinguished as only two reads cross the gap. Also, only 9 short reads were align to the gap area. 1582387292410 1582387432291

I would like to know, by changing what parameter in canu, could help to avoid the above chromosome fusion problem?

Thanks in advance!

skoren commented 4 years ago

I'm not sure which version of Canu you're using for assembly, there have been some recent changes to trimming and unitigging which may address this specific error. Since you have very high coverage, you could probably increase trimming stringency (trimReadsCoverage) from the default of 2 to 4 to 5. That said, there are always going to be a few errors in an assembly so we typically rely on orthogonal data (HiC/coverage as you've done) to fix those so the easiest solution may be to just break the contig at the coverage drop.

xieyichun50 commented 4 years ago

I'm not sure which version of Canu you're using for assembly, there have been some recent changes to trimming and unitigging which may address this specific error. Since you have very high coverage, you could probably increase trimming stringency (trimReadsCoverage) from the default of 2 to 4 to 5. That said, there are always going to be a few errors in an assembly so we typically rely on orthogonal data (HiC/coverage as you've done) to fix those so the easiest solution may be to just break the contig at the coverage drop.

My current assembly was done using canu v1.8. May be I should try canu v2.0 as what I can get from github and redo the assembly with stricter parameters. Thank you very much!

skoren commented 4 years ago

Idle, post if you do run newer assemblies.