marbl / canu

A single molecule sequence assembler for genomes large and small.
http://canu.readthedocs.io/
655 stars 179 forks source link

How to Output Unitigs for Chloroplast Assembly #1826

Closed malonge closed 3 years ago

malonge commented 3 years ago

Hi there,

I see that Canu v2.1.1 does not report unitigs in fasta format like previous versions. Is there a way to generate this file for special cases? In my experience, contigging (using CCS data) is typically not correct for chloroplast assemblies and I have had luck manually merging unitigs instead.

If you are interested, I can provide an example of incorrect chloroplast genome contigging.

Thanks

skoren commented 3 years ago

The unitigs were not very accurate (over-fragmented) and the edges were incomplete which is why they are not generated. There's no way to generate them without going back to an older version.

I'd rather fix the contigs, what do you mean not correct? Can you share examples along with the truth versions?

malonge commented 3 years ago

Hi Sergey,

Thanks for taking a look. I apologize if you already are familiar with this chloroplast genome background.

Chloroplast genomes are relatively conserved across species. They are around 155kbp circular genomes. The main source of difficulty is a large inverted repeat. Here is a diagram showing the unique (red) and duplicated (blue) regions. (ref: Ferrarini, Marco, et al. BMC genomics, 2013)

Screen Shot 2020-10-21 at 2 42 05 PM

I am working with data from a wild Nightshade plant. I baited CCS chloroplast reads and assembled with Canu v2.0. Here is what the unitig graph looks like:

H   VN:Z:1.0
S   tig00000001 *   LN:i:39149
S   tig00000002 *   LN:i:25382
S   tig00000003 *   LN:i:106979
L   tig00000002 -   tig00000001 +   9186M1D40M
L   tig00000002 -   tig00000001 -   103M1D10753M1D40M
L   tig00000001 -   tig00000002 +   43M1I9183M
L   tig00000001 +   tig00000002 +   43M1I10753M1I100M
L   tig00000003 +   tig00000002 -   9620M1I191M
L   tig00000003 -   tig00000002 -   9950M
L   tig00000002 +   tig00000003 +   9950M
L   tig00000002 +   tig00000003 -   195M1D9616M
Screen Shot 2020-10-21 at 2 40 57 PM

Here are the unitigs aligned to the reference genome (nucmer --maxmatch):

Screen Shot 2020-10-21 at 2 40 57 PM

Here is the unitigs.bed file:

ctg00000001  0      39148   utg00000001  0  +
ctg00000001  28251  53634   utg00000002  0  +
ctg00000001  43684  150663  utg00000003  0  +

Here is the Canu contig aligned to the reference genome:

Screen Shot 2020-10-21 at 2 40 57 PM

And here is the reference genome aligned to itself (I rotated the reference genome so that it starts at the same place as the canu assembly):

Screen Shot 2020-10-21 at 2 40 57 PM

Seeing as the contigs looked 5kbp too short and structurally inaccurate, I manually performed contigging. Here is what I came up with in AGP format:

ctg00000001  1       28251   1  W  tig00000001    1     28251   -
ctg00000001  28252   44407   2  W  tig00000002    9227  25382   +
ctg00000001  44408   141436  3  W  tig00000003    9951  106979  +
ctg00000001  141437  157007  4  W  tig00000002.2  1     15571   -

And when I align this contig to the reference, it looks correct:

Screen Shot 2020-10-21 at 2 40 57 PM

My contig is also closer to the correct size and was validated by aligning the CCS reads back to the contig.

To conclude, it appears that the inverted duplication can confuse contigging. I am assembling a separate tomato chloroplast genome now and I am seeing slightly different problems. I am currently in the process of figuring that out.

Thanks again for taking a look

malonge commented 3 years ago

Looking at it again for the first time in a while, I think the specific problem here is that tig2 basically represents a whole repeat unit, but tigs 1 and 3 also extend into the repeats themselves. And in fact, those repeat representations (I assume) are more reliable because they are anchored to the unique regions. So my strategy in the AGP contigging was to keep as much of the repeat units from tigs 1 and 3 and then fill in the remaining repeat portions with complementary fragments of tig2.

skoren commented 3 years ago

These examples are from 2.0 though right, have you checked what the assembled contig looks like in 2.1?

malonge commented 3 years ago

Yes these examples precede v2.1.*, but I reran it just now using v2.1.1. The dotplot shows more structural inaccuracies. Also, there are 3 contigs and the whole assembly is about 15k too long.

Screen Shot 2020-10-21 at 2 42 05 PM

It looks like there may be the same problem where tig4 represents a repeat unit but it is somewhat redundant with what the other tigs have represented.

skoren commented 3 years ago

Those are unfiltered nucmer plots right, because some of the inversions seem like they're redundant with the forward matches. The tigs look quite similar to the unitigs from 2.0 and were likely in a split and would have been in the order of tig3, tig4, tig5. I would guess they were split because tig4 is the repeat and it's not clear how it should be traversed.

Overlaps between tigs are normal and expected, it should also be larger because the chloroplast is circular so it'll wrap around. tig5 looks structurally correct. I'm not sure about tig3, it looks like it has an inversion which you were also correcting in 2.0. How confident are you this cannot be a true inversion?

skoren commented 3 years ago

Canu 2.1 contains seem as accurate as previous unitigs so it should be fine to use contigs instead.