malonge / RaGOO

RaGOO is no longer supported. Please use RagTag instead: https://github.com/malonge/RagTag
MIT License
171 stars 29 forks source link

Question about the confident scores #27

Open YPGG1234 opened 5 years ago

YPGG1234 commented 5 years ago

Hello, I see you have confidence scores associated with the grouping, localization, and orientation for each contig, and I want to know more details about it. For example, I have a contig in the final fasta file, and I get it's location confidence scores = 0.03104861142651071 ,and It's orientation confidence scores = 0.9638021314266446 (This contig I think it should belong to chr Y <ref dosen't have chr Y> and should not be broken,but it is assembled to chr X, and It is broken ), I want to know these scores are good or bad? And I want to know how can I judge what scores are reliable? Here it's my command: ragoo.py -R raw.corrected.fasta -C -m /bin/minimap2 -gff stringtie.generated.gff3 -T corr -t 28 assembly.fa ref.fna If you can help me,I will be very grateful to you.

malonge commented 5 years ago

Hi there,

I am happy to give a detailed explanation. First, can you tell me what the grouping confidence score is? That will be in the "groupings" folder.

Thanks

YPGG1234 commented 5 years ago

Hi, this contig's grouping confidence score is 0.9568822757353695.

malonge commented 5 years ago

Thanks for sharing this. I would say that the grouping and orientation scores look pretty good. The location score is low, though that is perhaps the least descriptive since it is based on alignment coordinates with respect to the reference.

In general, it is not optimal to use a reference which is missing chromosomes (in your case, Y). In that case, as long as a contig has a >10kbp alignment anywhere else in the genome, that is where it will get placed. Is it possible to use a reference with the Y chromosome?

If not, perhaps the next best thing to do would be to increase the specificity by requiring a minimum alignment length that is much longer than 10kbp. Though I would like to add this functionality at some point in the future, it is not currently available.

However, RaGOO will not try to remake alignment files if they are already present in the output directories. So you can filter those alignment files (for example, only include alignments > 50kbp) and place them in the output directories. If they have the same names as they do now, RaGOO will not recreate them. If that doesn't make sense, I can give a more detailed example.

malonge commented 5 years ago

also, please see the preprint for a better description of the confidence scores:

https://www.biorxiv.org/content/10.1101/519637v1

YPGG1234 commented 5 years ago

Thanks for your help, I will try it. This contig length is longer than 5Mb, but be broken at position 236K. And the first part is placed on chr13 ,second part is placed on chrX.

I have another question, you say if a contig has a >10kbp alignment anywhere else in the genome, that is where it will get placed , does this mean if a contig has lots of repeat contents (such as from sex chromosomes) , then it perhaps be wrong assembled to other chromosomes (or another sex chromosome) ? And maybe occured in many places in the final fasta files ?

malonge commented 5 years ago

No that is not what I mean. Allow me to clarify.

By default, each contig is placed exactly once, unaltered, in the final ragoo.fasta file. So the final file represents just an ordered and oriented version of the input contig set.

Beyond that, one can correct misassemblies as you have, but that just breaks contigs in certain places. So If a repetitive contig has many alignments, ragoo will pick the "best" alignment to use. However, that is exactly the sort of thing that would make the confidence scores go down.

YPGG1234 commented 5 years ago

Ok, I understand. Thanks for your answer !

malonge commented 5 years ago

No problem. I will respond again to this issue when I have made the alignment length a tunable parameter.

YPGG1234 commented 5 years ago

Hi malonge,

Recently I meet some new problems.When I used assembly‘s scaffolds and reference genome to draw CIRCOS,It looks pretty,but when I used Ragoo assembly and reference to draw CIRCOS. It looks even messy.

image

Here it's my ragoo's command: ragoo.py -R raw.corrected.fasta -m /bin/minimap2 -gff stringtie.generated.gff3 -T corr -t 28 -i 0.8 -j Y.candidate.txt assembly.fa ref.fna

For the previous one,I used lastal to generate link.txt, for the last one,I used minimap to generate link.txt. I am not sure it's my ragoo assembly has some problems or it's just my alignment tools has some problems.

Can you help me?Thanks.

malonge commented 5 years ago

Hi there,

Can you tell me what exactly is in the link.txt file?

Personally, I think a dotplot would be the best visualization here. You can use mummerplot or assemblytics.

YPGG1234 commented 5 years ago

OK, link.txt is an input file required by CIRCOS to draw collinearity graph.It records the collinearity relation between assembly and reference, and the format is as follows: QueryChr/ScaffoldName QueryChr/ScaffoldStart QueryChr/ScaffoldEnd RefChr RefChrStart RefChrEnd Scaffold_1 0 100000 Chr2 50000 150000

It can generated from lastal , minimap2 and such alignment tools.

malonge commented 5 years ago

It sounds like you used 2 different aligners to generate the plots. Can you show me what they look like if you use minimap2 for both of them? Also, what does your minimap2 command look like?

RaGOO scaffolds strictly based on minimap2 alignments, so it doesn't make sense that they would disagree that much.

YPGG1234 commented 5 years ago

My contigs_against_ref.paf.log contain this minimap2 command: minimap2 -k19 -w19 -t24 ref.fa assembly.fa

So my minimap2 command is : minimap2 -k19 -w19 -t 24 --secondary=no -cx asm10 ref.fa assembly.fa

I think it is possible that I opened the parameter "assembly correction", which led to the scaffold being broken.But when I ran RaGOO without any parameters, the results I drew still didn't change. My colleague told me lastal may better than minimap2 in this case,I will try it.

malonge commented 5 years ago

What organism is this? And what is the expected genome size/ploidy?

Peng-Y3 commented 5 years ago

The organism is sheep and expected genome size is 2.6-2.7G just like the reference genome

malonge commented 5 years ago

Well I am puzzled because those two minimap2 commands should give very similar results. And I don't see why minimap2 would not work just fine on this genome.

One thing you can do is replace the original contigs_against_ref.paf with the PAF file used to generate the circos plot. Let's say you have circos.paf. You can do the following.

cd ragoo_output
mv contigs_against_ref.paf contigs_against_ref.paf.old
cp /path/to/circos/circos.paf .
mv circos.paf contigs_against_ref.paf

Then, remove every other file/directory in ragoo_output except those paf files (you can keep the log file around too). Finally, rerun ragoo.

Ragoo will use your circos alignments for scaffolding instead of generating its own alignments.

malonge commented 5 years ago

Of course, you would have to rerun minimap2 on the original scaffolds rather than the ragoo pseudomolecules

YPGG1234 commented 5 years ago

I wonder if I can modify the RaGOO's built-in minimap2 parameter, where should I change it? Such as I want to change built-in "minimap2 -k19 -w19 -t24 ref.fa assembly.fa" to "minimap2 -k19 -w19 -t 24 --secondary=no -cx asm10 ref.fa assembly.fa".

malonge commented 5 years ago

Well you can fork the repo and change it in the source code by all means, but I was just suggesting how to run whatever minimap2 command you want, save it to a paf file, then just plug that paf file into ragoo. Ragoo won't make a new paf file if it already sees one there.

YPGG1234 commented 5 years ago

Ok, I will try it, thank you.

malonge commented 4 years ago

Hi there,

RagTag, the successor to RaGOO, is now available here:

https://github.com/malonge/RagTag

This feature is implemented in RagTag, and will likely not ever be implemented in RaGOO, which will eventually be deprecated.

Thanks