marbl / canu

A single molecule sequence assembler for genomes large and small.
http://canu.readthedocs.io/
657 stars 179 forks source link

misassembly #343

Closed aswathyseb closed 7 years ago

aswathyseb commented 7 years ago

Hi

I have assembled my genome with canu using pacbio reads. It has assembled the genome into one large contig, however there is mis-assembly. Last half of the genome is put together in the beginning and first half toward the end. There is a repeat region towards the middle in the genome.

What parameters can I tweak to correct the mis-assembly?

Thank you very much

skoren commented 7 years ago

How were you comparing the assembly to a reference to identify the mis-assembly? If you have a dot plot from nucmer (or similar) you can upload to clarify the error that would help.

aswathyseb commented 7 years ago

I don't have a dotplot right now. I will try to make one add it here. What I have done is to map the assembly to the reference using bwa. It produced 4 alignments. When I look at the clippings on the alignments I see that the 1st alignment mapped to the beginning has around 34kb hardclipped on the left side while the last alignment which is mapped to the end of the genome has only 19base pairs hard clipped on the left.

mjpdejong commented 7 years ago

You are not trying to map a circular genome to a linear reference?

aswathyseb commented 7 years ago

No; it is a linear viral genome. I tried with the parameter corMaxEvidenceErate=0.15 and it gave me a single contig and single alignment bridging over the repeats. I am satisfied with this assembly except that it is a little under assembled. Genome is 152 kb and assembly produce 149kb contig. How can I improve this? The other parameters I am using are corMhapSensitivity=normal corOutCoverage=100 minOverlapLength=20

brianwalenz commented 7 years ago

Can you share this data? It sounds like a good test case for us.

I'll guess that it is missing a few KB on each end. A drop of read coverage at the ends of the genome could result in those bases not corrected. Setting corMhapSensitivity=high corMinCoverage=0 could help - the first setting should find more overlaps and the second setting makes correction output all bases, even those with inferior corrections.

Updating to the latest code on github could help too. We changed some low level details of correction which seem to result in more bases corrected.

The contig construction algorithm (bogart) is probably dropping reads with no overlaps at either end. This generally improves assembly since these are usually noise (poorly corrected or poorly trimmed). You can turn this off with batOptions=-nofilter spur (this probably also requires the latest code).

ghost commented 3 years ago

Hi,

I have the same problem using Canu 2.0. I have drew the dotplot.

As you can see in the plot, there are at least 3 mis-assemblies for the contigs I highlighed. FYI, the x-axis showed the gene order on the reference genome and y-axis showed the gene order on the contigs from canu2.0. All the contigs have been polished to improve the gene mapping identity.

Any suggestions to figure out this problem? Many thanks! Screen Shot 2020-12-26 at 10 32 15 PM

Best, Meiyuan

ghost commented 3 years ago

Sorry, I just forgot to say that the dashed line separated different chromosomes. The corresponding chromosomes are labeled on the upper side.

Meiyuan