mikolmogorov / Ragout

Chromosome-level scaffolding using multiple references
Other
149 stars 27 forks source link

Unlocalized Scaffolds #20

Closed ReemaSingh closed 7 years ago

ReemaSingh commented 7 years ago

I would like to ask a question (Please note this is not an issue) related the scaffolds generated by Ragout. My question is :- What are these Unlocalized scaffolds? If these contigs that assembled into large unlocalised scaffold match to the given references then why they are name as unlocalised?

Many Thanks, Reema,

mikolmogorov commented 7 years ago

Hi Reema,

These scaffolds are homologous to the corresponding reference chromosomes, but represent only their minor portion. This naming convention follows the one from NCBI databases. Please refer to "Naming reference" sction of the manual for the details.

Best, Mikhail

ReemaSingh commented 7 years ago

Hello Mikhail,

Thanks for your reply. I refers the "Naming reference" section and it's more clear now. However I still have some questions. Please see below: -

1) I tried to generate the scaffolds by using 5 reference genome and 10 reference genomes seperately. The numbers are different in both cases : -

With 5 Reference sets

chr_ref|NC_011035.1| chr_ref|NC_011035.1|_unlocalized

With 10 reference set :-

chr_ref|NC_011035.1| chr_ref|NC_011035.1|_unlocalized.1 chr_ref|NC_011035.1|_unlocalized.2 chr_ref|NC_011035.1|_unlocalized.3 chr_ref|NC_011035.1|_unlocalized.4

Why is this difference in number?

2) Multiple scaffolds with same name (as shown above) belongs to the same reference. Then would it be right to say that these scaffold fragments can be joined together using primer walking?

I look forward to the reply.

Many Thanks, Reema,

mikolmogorov commented 7 years ago

Hi Reema,

  1. Usually, the more references you have, the harder reconstruction problem becomes, which might bring additional imperfections to resulting assemblies. Every particular case might be unique, it is hard to tell why this is happening in general.

  2. As Ragout did not join these fragments into one sequence, there are some inconsistencies with respect to the fragment ends among different references. The naming procedure only consider one closest reference and is designed for convenience of the future analysis, it does not necessarily imply that the fragments homologous to the same chromosome in one of the references are adjacent in the target genome.

Hope this helps, Mikhail

ReemaSingh commented 7 years ago

Thanks Mikhail. Your answers are very helpful. I compared the final scaffolds after including the assembly results with one best reference (which is NC_011035.1 in this case). This is what I did : -

With 5 reference set :-

chr_ref|NC_011035.1| (length = 1715216, GC = 52.19) and one unlocalized scaffold

With 10 reference set :-

chr_ref|NC_011035.1| (length = 583896, GC = 52.54) and four unlocalized scaffold

With 1 best reference :-

chr_ref|NC_011035.1| (length = 2228825, GC = 52.35) 

So, I am thinking of doing this in two ways - In first round including all the reference sets (closely related one) and running the second round using one best reference set (selected from first round). In this way there would be scaffold long in length and short in numbers. I am looking forward to hear your opinion on this

Many Thanks, Reema,

mikolmogorov commented 7 years ago

Hi,

It is actually quite strange that the results are so different in these cases. How different are the references that you are using in terms of the distance from the target genome? If there is one that is significantly closer than the others, I would recommend to use only that references, adding significantly more divergent genomes usually does not make much sense.

Mikhail

ReemaSingh commented 7 years ago

Hi,

These references are quite different from the target genome. Well, we can say they have similarities as they belong to same species, but not very same. The NC_0011035 is actually the closest one. The datasets that we will be using will be from different isolates (without any prior knowledge of which strains it belong to). And in that case it would be hard to say which genome its belong to without using all the references. That's why I thought might be its good to use all references in the first round and then limit the second round only with the best reference.

Reema,

mikolmogorov commented 7 years ago

Hi,

I think I misunderstood you initially - do you want to just select the best reference in the first round, and then make a new run on the original set of fragments with this reference? If so, yes, this makes perfect sense.

Mikhail

ReemaSingh commented 7 years ago

Thanks Mikhail, Yes, That's what I meant. And after second round the full scaffold is actually seems of good quality (I checked the full genome alignment in MAUVE). I have one more question, but for that I going to start the new ticket,

Reema,