ndierckx / NOVOPlasty

NOVOPlasty - The organelle assembler and heteroplasmy caller
Other
170 stars 63 forks source link

heteroplasmic assembly not at heteroplasmic position #131

Open rspfau opened 4 years ago

rspfau commented 4 years ago

In the attached results, there are several assemblies in the Heteroplasmy_assemblies_project.fasta file that do not contain the heteroplasmic position. The VCF file indicates that there is a SNP, but the assembly does not align to that position. For example, 704.

I've included a fasta file containing 704 aligned to the assembled genome showing that the 704 heteroplasmy assembly doesn't start until position 716. This alignment also includes #804 which does align to the 804 position.

What does this mean? TK151508_Heteroplasmy.zip

ndierckx commented 4 years ago

I would happy to check this, but could you run it again with extended log to 1. I would need that to see what's wrong and to fix the code if necessary.

ndierckx commented 4 years ago

I also saw your coverage is probably too low to check for heteroplasmy around 1%, maybe beter to increase to 2% or 3%

rspfau commented 4 years ago

All files are attached. Thanks! TK151508.zip

ndierckx commented 4 years ago

Hi,

Thanks but the extended log option was still set to 0, you should set it 1, else I can't see what is wrong

ndierckx commented 4 years ago

ow no sorry, just saw I forgot download your new zip file and opened the old one, nevermind :)

ndierckx commented 4 years ago

Those fastq files, is that all the data? And is capture data then, because it can't be WGS? And why are the read lengths so different, is it quality trimmend?

rspfau commented 4 years ago

That is all the data we have--for now. The libraries were generated from multiple displacement amplification products, because we are having great difficulties due to NUMT contamination. https://www.qiagen.com/us/products/discovery-and-translational-research/pcr-qpcr/preamplification/repli-g/repli-g-mitochondrial-dna-kit/#orderinginformation

There were some problems at the sequencing lab and we are having to wait until the lab reopens following the COVID-19 shut down. For some samples, the reverse reads terminated prematurely during the Illumina sequencing run. Also, our MiSeq Reporter software is malfunctioning so we had Illumina do the post-processing. Unfortunately, they also did quality trimming--which we did not ask for. We hope to eventually resolve both of these problems. Are they causing the issues that I am having?

ndierckx commented 4 years ago

No, the problem is not related to that, the assemblies were correct, but it sometimes cut a small part of the sequence in the output. I will fix this bug, but most of the short assemblies failed because probably originating from sequencing errors.

The big problem is the type of data you have, Illumina 300 bp reads have very bad quality, I don't know why but every dataset of 300 bp I saw is horrible, so I would recommend to stick to 250 bp max

This increased error rate makes it very difficult to detect heteroplasmy because it desn't look like other illumina data. Also the heavy trimming is not good for NOVOPlasty.

Don't you have any other data of this species, like a large WGS dataset?

rspfau commented 4 years ago

Thanks! Heteroplasmy wasn't our main objective of the project--we just wanted complete mitogenome sequences free of NUMTs. NOVOplasty helped us accomplish that objective quite well--except for this last Illumina run, which the lab will be doing over. If I had funding, I would like to compare levels of heteroplasmy across 12 related species--but the cost of WGS for 12 species is above my current funding. I am suspecting that some of these species have a very high mitochondrial mutation rate, so I am curious to see if that results in an elevated heteroplasmy rate. Perhaps funding will be available in the future for WGS.

Does NOVOplasty omit all reads less than the stated read length (301 in this case)? Or does it attempt to align the shorter reads? How does it cause a problem for NOVOplasty?

ndierckx commented 4 years ago

But what is the problem with the last run, if the de novo assembly was successful, there was no interference with NUMTs (I saw you have no ambiguous nucleotides in the used reference).

And it doesn't have to be WGS, the other method will probably exclude more NUMTS, but the NUMTs still in it will be harder to prove they are NUMTs. Just don't use 300 bp libraries, that's the most important, max 250 bp.

And reads below 300 bp will still be included, but I saw some are half the length, which could cause some problems in the code. But it seems the normal assembly was ok so not a big problem i guess, it's the reduced quality that makes heteroplasmy detection unreliable

rspfau commented 4 years ago

With this last Illumina run, we only successfully assembled mitogenomes from about half of our samples. I don't yet understand why, though, as some of the datasets are large. If you're interested in looking at these datasets to see why NOVOplasty failed to assemble mitogenomes, I can send the fastq files.

Based on some PCR/Sanger sequencing results, it's possible that the entire mitogenome is in the nucleus (as it is in the Panthera genus of large cats). In that case, there would be only two positions where NUMT sequences transition to non-NUMT sequences in the genome.

ndierckx commented 4 years ago

Those failed cases, how those log files look like?

rspfau commented 4 years ago

Now that I understand things better, I suspect that these assemblies failed because the multiple displacement amplification enrichment failed. Perhaps there were not enough mitochondrial reads. Log files and QC report attached. failed mitogenome assemblies.zip

ndierckx commented 4 years ago

It seems all except TK151533 have no mitochondrial reads in the dataset. TK151533 didn't really fail though, the complete genome should be in the output