ndierckx / NOVOPlasty

NOVOPlasty - The organelle assembler and heteroplasmy caller
Other
174 stars 63 forks source link

mtDNA assembly starts at random position #168

Open dana-obk opened 3 years ago

dana-obk commented 3 years ago

Hello,

I am working with human mtDNA. When I assemble it the circularized version starts from the random positions (like 6789, 7345 and so on splitting the mtDNA in the middle) with respect to the cambridge reference chrM. So the order of the assembled mtDNA is different. I plan to do heteroplasmy variant calling further on several samples as a time and compare the results. Is there a way to set the start position of the assembly as the starting position of the reference mtDNA?

ndierckx commented 3 years ago

Hi,

For assembly it is possible with a little chance in the code, I can send an adjusted version to you. Normally it assembles in both directions until both sides meet, so I would have to switch one side off.. If you don't have many samples, you can also just cut paste the sequence..

If you just worry for the heteroplasmy positions it is not a problem because you can use the cambridge reference as a reference genome for the heteropalsmy detection, or you using a different tool for heteroplasmy detection?

dana-obk commented 3 years ago

Hello,

Thank you very much your your reply. It would be great if you could send us the adjusted version, so we could test it. I actually run the heteroplasmy call both with cut and paste sequence and also using the cambridge version as a reference. Both analyses failed for some reason. It seems to be blocked at a certain nucleotide position. While for the output sequence that I get from assembly it worked fine, but I have problems with the nt position as I mentioned above. It is possible to solve this problem? I attached the config file I am using. Thanks again. config_step2.txt

ndierckx commented 3 years ago

I see you have read lengths of 300 bp, I am not sure if Illumina has fixed it or not, but they have much higher error rates than 100-250 bp reads. I would advice for the future to stick with 2x250bp, unless the problem was resolved by Illumina If the error rate is still that high, it will be impossible to detect heteroplasmy with a MAF of 0.006, which is already on the limit for good illumina data. You can do a run wit extended log set to 1 and send me that file, I can check what the problem is..

dana-obk commented 3 years ago

Hello Nicolas,

I attached a log file. Looking forward to your reply. Thanks

log_extended_MT_1349.txt

ndierckx commented 3 years ago

Hi,

I will have a look, is it WGS data, or targeted, or something else..?

dana-obk commented 3 years ago

it is only mtDNA data

ndierckx commented 3 years ago

But what kind of protocol did you used to sequence? Did a normal assembly work with this dataset?

dana-obk commented 3 years ago

the protocol we used is 300 paired end miseq, libraries were prepared using nextera xt illumina. We assembled using NOVOPlasty, then we cut and paste the mtDNA sequence so that it starts from nt 1 like the cambridge version. Actually if we use the cambridge version we have the same problem.

ndierckx commented 3 years ago

Sorry, wanted to know how you extracted your mitochondrial DNA. I need to know if the complete mitochondrial genome is covered or if there are gaps in the genome (which is usually the case for targeted approaches). But you did try a normal assembly (without heteroplasmy detection) on the sample you send me, and if so was the assembly complete?

dana-obk commented 3 years ago

No problem, so our libraries start from PCR-amplified whole mtDNA (two long PCR amplicons) and the sequences have good coverage (range 400-10000X) with no gaps. With NOVOPlasty the first assembly is successful but only setting 250bp as read length, not with 300bp. Then reassembling on the first assembly for heteroplasmy calling there is the same problem with 300bp, still it runs with 250bp. Maybe the problem is the bad quality of the reads from 200bp on? But anyway we can use the tool with reads of 250bp and try to see estimated heteroplasmies.

ndierckx commented 3 years ago

Reducing the read length in the config file to 250 is not solving the low quality problem of course. Maybe it is best to trim the reads to 200 bp? Illumina should stop selling those 300 bp kits, you should complain to them and ask a refund because they don't deliver the promised quality :) Besides the 300 bp problem, the fact that you PCR amplified your DNA makes heteroplasmy calling also more complicated. It induces additional errors and I witnessed a lot of cross-over between homologous regions during PCR amplification, which makes phasing the heteroplasmy positions impossible.

You do have sufficient coverage so I would recommend to trim your reads to 200, longer reads are only useful if you want to phase and that would be limited anyway because of the PCR amplification. And I would also first try to detect heteroplasmy with a MAF of at least 0.008, maybe best 0.01. That will improve your results significantly and will also speed up the analysis. The biggest strength of NOVOPlasty is that it can assemble around each SNP and phase nearby positions, although with PCR amplified data it will work only for a few hundred bps. cross-over percentages grow fast with increasing length between SNPs

If you need help with the interpretation of the heteroplasmy output you can always send them to me (best to run with extended log to 1 if you send met he results)