Closed malonge closed 3 years ago
The unitigs were not very accurate (over-fragmented) and the edges were incomplete which is why they are not generated. There's no way to generate them without going back to an older version.
I'd rather fix the contigs, what do you mean not correct? Can you share examples along with the truth versions?
Hi Sergey,
Thanks for taking a look. I apologize if you already are familiar with this chloroplast genome background.
Chloroplast genomes are relatively conserved across species. They are around 155kbp circular genomes. The main source of difficulty is a large inverted repeat. Here is a diagram showing the unique (red) and duplicated (blue) regions. (ref: Ferrarini, Marco, et al. BMC genomics, 2013)
I am working with data from a wild Nightshade plant. I baited CCS chloroplast reads and assembled with Canu v2.0. Here is what the unitig graph looks like:
H VN:Z:1.0
S tig00000001 * LN:i:39149
S tig00000002 * LN:i:25382
S tig00000003 * LN:i:106979
L tig00000002 - tig00000001 + 9186M1D40M
L tig00000002 - tig00000001 - 103M1D10753M1D40M
L tig00000001 - tig00000002 + 43M1I9183M
L tig00000001 + tig00000002 + 43M1I10753M1I100M
L tig00000003 + tig00000002 - 9620M1I191M
L tig00000003 - tig00000002 - 9950M
L tig00000002 + tig00000003 + 9950M
L tig00000002 + tig00000003 - 195M1D9616M
Here are the unitigs aligned to the reference genome (nucmer --maxmatch):
Here is the unitigs.bed file:
ctg00000001 0 39148 utg00000001 0 +
ctg00000001 28251 53634 utg00000002 0 +
ctg00000001 43684 150663 utg00000003 0 +
Here is the Canu contig aligned to the reference genome:
And here is the reference genome aligned to itself (I rotated the reference genome so that it starts at the same place as the canu assembly):
Seeing as the contigs looked 5kbp too short and structurally inaccurate, I manually performed contigging. Here is what I came up with in AGP format:
ctg00000001 1 28251 1 W tig00000001 1 28251 -
ctg00000001 28252 44407 2 W tig00000002 9227 25382 +
ctg00000001 44408 141436 3 W tig00000003 9951 106979 +
ctg00000001 141437 157007 4 W tig00000002.2 1 15571 -
And when I align this contig to the reference, it looks correct:
My contig is also closer to the correct size and was validated by aligning the CCS reads back to the contig.
To conclude, it appears that the inverted duplication can confuse contigging. I am assembling a separate tomato chloroplast genome now and I am seeing slightly different problems. I am currently in the process of figuring that out.
Thanks again for taking a look
Looking at it again for the first time in a while, I think the specific problem here is that tig2 basically represents a whole repeat unit, but tigs 1 and 3 also extend into the repeats themselves. And in fact, those repeat representations (I assume) are more reliable because they are anchored to the unique regions. So my strategy in the AGP contigging was to keep as much of the repeat units from tigs 1 and 3 and then fill in the remaining repeat portions with complementary fragments of tig2.
These examples are from 2.0 though right, have you checked what the assembled contig looks like in 2.1?
Yes these examples precede v2.1.*, but I reran it just now using v2.1.1. The dotplot shows more structural inaccuracies. Also, there are 3 contigs and the whole assembly is about 15k too long.
It looks like there may be the same problem where tig4 represents a repeat unit but it is somewhat redundant with what the other tigs have represented.
Those are unfiltered nucmer plots right, because some of the inversions seem like they're redundant with the forward matches. The tigs look quite similar to the unitigs from 2.0 and were likely in a split and would have been in the order of tig3, tig4, tig5. I would guess they were split because tig4 is the repeat and it's not clear how it should be traversed.
Overlaps between tigs are normal and expected, it should also be larger because the chloroplast is circular so it'll wrap around. tig5 looks structurally correct. I'm not sure about tig3, it looks like it has an inversion which you were also correcting in 2.0. How confident are you this cannot be a true inversion?
Canu 2.1 contains seem as accurate as previous unitigs so it should be fine to use contigs instead.
Hi there,
I see that Canu v2.1.1 does not report unitigs in fasta format like previous versions. Is there a way to generate this file for special cases? In my experience, contigging (using CCS data) is typically not correct for chloroplast assemblies and I have had luck manually merging unitigs instead.
If you are interested, I can provide an example of incorrect chloroplast genome contigging.
Thanks