ndierckx / NOVOPlasty

NOVOPlasty - The organelle assembler and heteroplasmy caller
Other
178 stars 71 forks source link

Mitochondrial contig terminating early, failing to circularize? #50

Closed cgmayers closed 1 year ago

cgmayers commented 6 years ago

I am assembling a fungal mitochondrial contig in order to then test for heterogeneity. I already have a fully circularized mitochondrial genome from another assembler that is ~132,000 bp, but I want to produce the de novo assembly in Novoplasty as suggested in the readme. However, when Novoplasty attempts to assemble the genome from the same reads, it always produces a (non-circularized) contig of ~118,000 bp, which is too small. I noticed there are some ambiguous bases near the very end of the contig, so I wonder if this is the problem? What could it be? I attached the extended log here. log_extended_C4128Hetero.txt

And the resulting contigs here:

Contigs_1_C4128Hetero.fasta.txt

Thank you!

ndierckx commented 6 years ago

Hi,

NOVOPlasty has some problems with the 300 bp reads because they have a very high error rate compared to the 250 bp or shorter reads. I hope Illumina will improve it soon, because for the moment it's better to have 250 bp reads with higher accuracy. Maybe I will make a new setting for it.

Which assembler did you use to get the complete sequence? I will take a look if it can be circularized with NOVOPlasty but you could already check heteroplasmy in the 118 00 bp contig. I would suggest to cut away the ends with ambiguous nucleotides. Also check for tandem repeats, those can disrupt heteroplasmy detection as well. But I am not sure if the results will be that good for the 300 bp because the high error rate can give false positives and I haven't tested any dataset like that yet

cgmayers commented 6 years ago

Interesting. I used the proprietary Geneious assembler to circularize the mitochondrial genome, and with 1400+X mitochondrial coverage.

Could I just use the circularized contig created in Geneious, or could there be some issue there? I suspect that somewhere between 1 in 40 and 20 in 40 mitochondria per cell are heterogeneous in the absence or presence of one or more specific Group II introns, which are quite large (at least a couple thousand bp) and so I hoped they would be easily detected. Let me know if this is not the case!

If you are interested in 300bp paired end datasets like this for testing, just let me know. I have around 30 genomes like this with similar coverage and mitochondrial genome sizes.

ndierckx commented 6 years ago

So you are expecting very large introns? It will be possible to assemble those but I didn't knew it was possible in mitochondrial heteroplasmy so maybe need some adjustments in the code. Seems interesting to test, so you could send a dataset so I can try it out.

ndierckx commented 6 years ago

Hi,

Any success with detecting those introns. I think the online version is not ready to handle this problem. But I think the next version will, so if you want I can test on of your datasets. I would be interesting to adjust the code for these large introns.

cgmayers commented 6 years ago

Great! I will send you an email with a link to download the dataset to test.