ndierckx / NOVOPlasty

NOVOPlasty - The organelle assembler and heteroplasmy caller
Other
174 stars 63 forks source link

Issue with the seed #134

Closed foala closed 3 years ago

foala commented 4 years ago

Hello, Thank you for your great effort in developing this software. I have 43 samples of closely related species of birds (Sequenced at 30x Coverage) I used the refseq of these birds as a Reference Sequence in the config, and have been trying different seeds from different regions of the mt, and I only got circularized assemblies for only few. But these few assemblies can either be: 1) correct but missing 2 kb less compared to the the reference which is 18 kb. 2) 16-17 kb assemblies but when aligned to the reference, less than 1 kb aligns only.

The majority of the 43 samples fail to give a results, expect for fragmented configs, with most of them not mapping to the refseq, while those that map, only map to short regions, such as part of 16S rRNA.. I also tried without using any Ref Sequence, only a seed, but still the same issue... and usually the assembled fragmented contigs are few 100 bases around the seed in both directions..

Please assist. here is my updated config, where i tried optimizing other options too such as kmer, genome range, etc..:

Project:

Project name = mt Type = mito Genome Range = 17000-19000 K-mer = 39 Max memory = 50000 Extended log = 0 Save assembled reads = no Seed Input = loopseed.fasta Extend seed directly = no Reference sequence = ref.fasta Variance detection = Chloroplast sequence = Dataset 1:

Read Length = 151 Insert size = 350 Platform = illumina Single/Paired = PE Combined reads = Forward reads = SP43_0200591331_R1.fastq.gz Reverse reads = SP43_0200591331_R2.fastq.gz Heteroplasmy:

MAF = HP exclude list = PCR-free = Optional:

Insert size auto = yes Insert Range = 1.9 Insert Range strict = 1.3 Use Quality Scores = yes

ndierckx commented 4 years ago

Can you run one or two failed ones again, but with extended log option to 1 and send me the file

foala commented 4 years ago

Can you run one or two failed ones again, but with extended log option to 1 and send me the file Hi, Thank your for your reply. Here you go two attempts using the same raw data, and keeping other configs the same.

Without a reference (Seed only) log_extended_mt no ref1.txt

With a Reference seq (+ same seed) Results: circularized assembly (16 kbp) but only the last 1 kb of the CR aligns to the reference
log_extended_mt Ref.txt

Thank you very much

ndierckx commented 4 years ago

The region it is trying to assemble is very complex and seems heavily duplicated, haven't seen this before, especially in mitochondrial genomes. Have you tried with different seed, sometimes better to start the assembly from a different location. The seed can be short (only first 200 bp matters) and it can be from distant species. It is all flexible so don't worry about that, just try from a different location.

foala commented 4 years ago

Hello, Thank you for looking into this. I have tried many regions actually.. but this time I will keep the extended log on, and I will update you here either way. Thanks

ndierckx commented 4 years ago

Weird mitochondrial genomes are usually very easy to assemble (besides some repetitive control regions in some species), have you tried with other software too?

foala commented 4 years ago

Just Novoplasty, actually. I was attracted to the heteroplasmy feature as I am interested in studying it in my samples.

foala commented 4 years ago

Do you suggest a well tried universal seed?

foala commented 4 years ago

Hello again, I tried using the seed you used in the test for the human mt (CytB), it said invalid seed. Then I aligned the seed with my refseq mitogenome, then copied that sequence from the refseq, and used it for assembly. I got a partial assembly of the CytBe reigon + a 19kb contig that covers few kbs of the control reigon, while the rest of the seqeunce doesnt align at all with the mt genome. Here is the extended log.

log_extended_mt_CytB.txt

Thank you for your help

ndierckx commented 4 years ago

Hi I think you also have long repetitive regions in the control region, I know some insects have it too. The problem is that you use a reference as a seed, those references on NCBI usually end in that repetitive region and is therefore not a good spot to start an assembly. I would take 200 bp right in the middle of your reference and use that as a seed. Then you should get most of the genome, but that repetitive region is probably impossible to assemble with short reads. If it does circularize, you should not be confident that it is a complete assembly. But usually those repetitive regions don't contain any genes..

If you want to test for heteroplasmy, you should remove that repetitive control region anyway, so for that option it is best to remove any repetitive sequences.

What kind of data do you have, WGS or captured?

foala commented 4 years ago

Hi, Thank you for your detailed feedback. So I shouldn't use CytB as well, right? I am using WGS

ndierckx commented 4 years ago

My github was still open from yesterday so didn't saw that last message.

Seems that last seed was fine, as long as it is not too close to the repetitive regions. But I don't think this genome can be circularized. That third contig with all those repetitions is too long (it got stuck in a loop) but I think a part of the repetition belongs to the genome. Maybe only around 400 bp long. The fact that it is not in the reference doesn't mean it doesn't belong to the mitochondrial genome, there are many mitochondrial references on NCBI that are incomplete! Some are even incorrect, because some assemblers just remove repetitive regions.

I have a new version you could test, it should work better on repetitive regions, maybe I can send next week.

But for heteroplasmy detection, you don't have enough coverage, it 's only around 60...

foala commented 4 years ago

Thank you very much. I am looking forward to the new version.

foala commented 4 years ago

Hi, A kind reminder. Thank you very much

ndierckx commented 4 years ago

Hi, I didn't had time to work on NOVOPlasty, but you could try this version:

NOVOPlasty4.0b.zip