sanger-pathogens / Roary

Rapid large-scale prokaryote pan genome analysis
http://sanger-pathogens.github.io/Roary
Other
314 stars 189 forks source link

Roary seemed to have stopped prematurely; any way to continue the run? #380

Closed sentausa closed 6 years ago

sentausa commented 6 years ago

Hi,

I don't know why, my roary run seemed to have stopped too soon after producing gene_presence_absence.Rtab file. Is there a way to continue the run from this point?

This is the command that I used to run it: roary -p 14 -e --mafft -r -o pan-bacteria -v ./gff/*.gff

And these are the output that I have: -rw-r--r-- 1 es249628 big 6,5G janv. 11 21:16 gene_presence_absence.Rtab -rw-r--r-- 1 es249628 big 11G janv. 11 20:10 gene_presence_absence.csv -rw-r--r-- 1 es249628 big 96M janv. 11 18:34 core_accessory_graph.dot -rw-r--r-- 1 es249628 big 96M janv. 11 18:21 accessory_graph.dot -rw-r--r-- 1 es249628 big 123K janv. 11 16:01 _accessory_clusters.clstr -rw-r--r-- 1 es249628 big 594K janv. 11 16:01 _accessory_clusters -rw-r--r-- 1 es249628 big 115K janv. 11 15:54 accessory_binary_genes.fa.newick -rw-r--r-- 1 es249628 big 13M janv. 11 15:51 accessory_binary_genes.fa -rw-r--r-- 1 es249628 big 368M janv. 11 10:01 pan-bacteria -rw-r--r-- 1 es249628 big 368M janv. 11 08:45 _labeled_mcl_groups -rw-r--r-- 1 es249628 big 354M janv. 11 08:45 _inflated_mcl_groups drwx------ 2 es249628 big 4,0K janv. 11 08:43 2agvKs2tFr drwx------ 2 es249628 big 188K janv. 11 08:23 07NXVoFAfK -rw-r--r-- 1 es249628 big 354M janv. 11 07:12 _inflated_unsplit_mcl_groups -rw-r--r-- 1 es249628 big 43M janv. 11 07:11 _uninflated_mcl_groups -rw-r--r-- 1 es249628 big 65 janv. 11 06:15 blast_identity_frequency.Rtab -rw-r--r-- 1 es249628 big 819M déc. 26 10:33 _clustered.clstr -rw-r--r-- 1 es249628 big 635M déc. 26 10:32 _clustered -rw-r--r-- 1 es249628 big 6,2G déc. 26 00:47 _combined_files -rw-r--r-- 1 es249628 big 0 déc. 24 12:45 _combined_files.groups

The 07NXVoFAfK directory contains .gff.proteome.faa files, while 2agvKs2tFr contains "group" files.

Thanks in advance for your kind help.

andrewjpage commented 6 years ago

Hi, It can sometimes be possible to restart and run the other roary scripts (roary-*), however looking at the enormous file sizes for your gene_presense_absense.Rtab file I suspect theres something up with your input data. I would double check that all your samples are from the same species and QC the input data. Regards, Andrew

On 12 January 2018 at 09:44, sentausa notifications@github.com wrote:

Hi,

I don't know why, my roary run seemed to have stopped too soon after producing gene_presence_absence.Rtab file. Is there a way to continue the run from this point?

This is the command that I used to run it: roary -p 14 -e --mafft -r -o pan-bacteria -v ./gff/*.gff

And these are the output that I have: -rw-r--r-- 1 es249628 big 6,5G janv. 11 21:16 gene_presence_absence.Rtab -rw-r--r-- 1 es249628 big 11G janv. 11 20:10 gene_presence_absence.csv -rw-r--r-- 1 es249628 big 96M janv. 11 18:34 core_accessory_graph.dot -rw-r--r-- 1 es249628 big 96M janv. 11 18:21 accessory_graph.dot -rw-r--r-- 1 es249628 big 123K janv. 11 16:01 _accessory_clusters.clstr -rw-r--r-- 1 es249628 big 594K janv. 11 16:01 _accessory_clusters -rw-r--r-- 1 es249628 big 115K janv. 11 15:54 accessory_binary_genes.fa. newick -rw-r--r-- 1 es249628 big 13M janv. 11 15:51 accessory_binary_genes.fa -rw-r--r-- 1 es249628 big 368M janv. 11 10:01 pan-bacteria -rw-r--r-- 1 es249628 big 368M janv. 11 08:45 _labeled_mcl_groups -rw-r--r-- 1 es249628 big 354M janv. 11 08:45 _inflated_mcl_groups drwx------ 2 es249628 big 4,0K janv. 11 08:43 2agvKs2tFr drwx------ 2 es249628 big 188K janv. 11 08:23 07NXVoFAfK -rw-r--r-- 1 es249628 big 354M janv. 11 07:12 _inflated_unsplit_mcl_groups -rw-r--r-- 1 es249628 big 43M janv. 11 07:11 _uninflated_mcl_groups -rw-r--r-- 1 es249628 big 65 janv. 11 06:15 blast_identity_frequency.Rtab -rw-r--r-- 1 es249628 big 819M déc. 26 10:33 _clustered.clstr -rw-r--r-- 1 es249628 big 635M déc. 26 10:32 _clustered -rw-r--r-- 1 es249628 big 6,2G déc. 26 00:47 _combined_files -rw-r--r-- 1 es249628 big 0 déc. 24 12:45 _combined_files.groups

The 07NXVoFAfK directory contains .gff.proteome.faa files, while 2agvKs2tFr contains "group" files.

Thanks in advance for your kind help.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/sanger-pathogens/Roary/issues/380, or mute the thread https://github.com/notifications/unsubscribe-auth/AABeV8PFCYI496WlJUseAjvh0D4NN2vIks5tJylugaJpZM4RcCnM .

sentausa commented 6 years ago

Actually they are from the same genus, different species. The whole genus. Roary doesn't like that?

If I want to continue anyway, how to run the other roary scripts? Where can I find them? I just want to get the core_gene_alignment.aln file to make tree later.

Thanks very much.

andrewjpage commented 6 years ago

Roary is is tuned for the same species. For more diverse datasets you'll need to play around with the the identity and the inflation factor (different for every genus). I wouldn't try to continue in this instance ( the scripts are probably in your path if you type roary- then tab complete). I would recommend you QC your assemblies first, then test a small subset (its quick) and go from there.

On 12 January 2018 at 09:57, sentausa notifications@github.com wrote:

Actually they are from the same genus, different species. The whole genus. Roary doesn't like that?

If I want to continue anyway, how to run the other roary scripts? Where can I find them?

Thanks very much.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/sanger-pathogens/Roary/issues/380#issuecomment-357194267, or mute the thread https://github.com/notifications/unsubscribe-auth/AABeVwzN7yOOUIwGOCfInyLPMHIwoC7jks5tJyx6gaJpZM4RcCnM .

sentausa commented 6 years ago

So you mean that roary failed to produce the alignment? Because I couldn't see anything that indicates that it was done during the run.

andrewjpage commented 6 years ago

It failed before producing an alignment. The cause is likely to be your input data.

sentausa commented 6 years ago

Does it have anything to do with the fact that there are many draft genomes in my data? Do you think it'll be better if I use only complete genomes?

Thank you again.

andrewjpage commented 6 years ago

I use draft genomes all of the time and its not a problem. Have you QC'd your assemblies? For example removing highly fragmented assemblies, assemblies that are too big/small, checked the species is as expected or if there is contamination etc...

On 12 January 2018 at 13:34, sentausa notifications@github.com wrote:

Does it have anything to do with the fact that there are many draft genomes in my data? Do you think it'll be better if I use only complete genomes?

Thank you again.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/sanger-pathogens/Roary/issues/380#issuecomment-357239347, or mute the thread https://github.com/notifications/unsubscribe-auth/AABeVwV9r90xDwT3ag6nTJSN-h8YVPupks5tJ19YgaJpZM4RcCnM .

sentausa commented 6 years ago

I'm reusing publicly available genomes from a whole genus, and it's true that there are highly fragmented assemblies. So well, I think I'll try to remove them. Or it might be easier to just ignore the draft genomes.... I just need a global picture of the genus anyway.

Thank you for all your help!

Regards,