sanger-pathogens / Roary

Rapid large-scale prokaryote pan genome analysis
http://sanger-pathogens.github.io/Roary
Other
323 stars 189 forks source link

roary_plots.py generating flawed plots #221

Closed swlong closed 8 years ago

swlong commented 8 years ago

I am running roary_plots.py after successfully running Roary as well as FastTree similar to the instructions provided, and the following occurs:

1) A warning is generated: FutureWarning: order is deprecated. use sort_values(...) idx = roary.sum(axis=1).order(ascending=False).index

2) The three plots are generated but they are all erroneous in one way or another.

3) Unsure if it is related, but I am finding the following error generated apparently during the MAFFT step (apologies if this is a completely separate issue, I'll branch to a different issue report): ------------- EXCEPTION: Bio::Root::Exception ------------- MSG: Could not open pan_genome_sequences/group_16429.fa.aln: No such file or directory STACK: Error::throw STACK: Bio::Root::Root::throw /usr/share/perl5/Bio/Root/Root.pm:486 STACK: Bio::Root::IO::_initialize_io /usr/share/perl5/Bio/Root/IO.pm:351 STACK: Bio::SeqIO::_initialize /usr/share/perl5/Bio/SeqIO.pm:491 STACK: Bio::SeqIO::fasta::_initialize /usr/share/perl5/Bio/SeqIO/fasta.pm:87 STACK: Bio::SeqIO::new /usr/share/perl5/Bio/SeqIO.pm:372 STACK: Bio::SeqIO::new /usr/share/perl5/Bio/SeqIO.pm:413 STACK: Bio::Roary::SortFasta::_input_seqio /usr/local/share/perl/5.18.2/Bio/Roary/SortFasta.pm:27 STACK: Bio::Roary::SortFasta::sort_fasta /usr/local/share/perl/5.18.2/Bio/Roary/SortFasta.pm:68 STACK: Bio::Roary::CommandLine::GeneAlignmentFromNucleotides::run /usr/local/share/perl/5.18.2/Bio/Roary/CommandLine/GeneAlignmentFromNucleotides.pm:107

STACK: /usr/local/bin/protein_alignment_from_nucleotides:14

This seems to be happening for some (but not all) clusters... yet core_gene_alignment.aln is still being generated and contains data.

Has anyone else seen this problem? I will attach example data momentarily. I am running Roary on a Biolinux 8 box.

Best, S. W. Long

andrewjpage commented 8 years ago

Something seems to have gone quite wrong indeed, sorry about that. How many core genes do you get in your summary statistics file? Where did the input files come from (PROKKA?) and does each one have a unique prefix so that the IDs of each gene are unique to the set?

swlong commented 8 years ago

Summary stats file: Core genes (99% <= strains <= 100%) 835 Soft core genes (95% <= strains < 99%) 2547 Shell genes (15% <= strains < 95%) 2145 Cloud genes (0% <= strains < 15%) 10903 Total genes (0% <= strains <= 100%) 16430

This was run with default settings for blastp. Input files were downloaded from genbank as gb files with full sequence then converted to GFF3 using the bp_genbank2gff3.pl script. Input files have unique prefixes - a few input files were "fixed" by Roary for having duplicate gene IDs.

I was going to upload a smaller sample run to see if I could replicate problems but IT forced a reboot on my system overnight and killed my run... hopefully later today or tomorrow I should have some actual datafiles to share.

Additional oddity: pangenome_matrix.png reports 51 strains in tree even though only 48 strains were used to generate the dataset.

swlong commented 8 years ago

I have replicated the issues with a smaller dataset (5 Cdiff genomes) and the issues appear to be the same. The tar.gz of the directory is a bit too large to upload directly here so I created a repository to allow for easy access. Hoping to get to the bottom of this, as Roary is a very useful tool.

Directory containing all files and output can be found here: https://github.com/swlong/SampleData.git

In short, here was my workflow:

1) Downloaded 5 Cdiff complete genomes from Genbank. 2) Converted .gb to .gff using bp_genbank2gff3.pl. 3) Ran Roary with "roary -p 12 -e --mafft -r -v *.gff" 4) Made a tree with "fasttreeMP -nt -gtr core_gene_alignment.aln > CdiffTree.newick" 5) Ran roary_plots " roary_plots.py CdiffTree.newick gene_presence_absence.csv "

The issues, including the Bio::Root::Exception error during the post analysis step and the appearance of the roary_plots remains the same. I'm hoping providing this data helps find a solution. Let me know if I can be of any further service.

Best, S. Wesley Long

P.S. Summary stats for the Cdiff dataset (only 5 genomes): Core genes (99% <= strains <= 100%) 2629 Soft core genes (95% <= strains < 99%) 0 Shell genes (15% <= strains < 95%) 2635 Cloud genes (0% <= strains < 15%) 0 Total genes (0% <= strains <= 100%) 5264

andrewjpage commented 8 years ago

Thanks for the data. It looks like Roary ran to completion. I reran the roary_plots script and it produced a proper tree, so I suspect theres an issue with versioning of the python dependancies for this script (Phylo). We'll take a look to see if we can track it down.

JSCandy should be able to show you the same information (its an experimental interactive viewer) if you want to give it a shot. Once you load up your data you can change the viewing mode by clicking on the JSCandy logo. http://jameshadfield.github.io/JScandy/

Theres also the roary2svg.pl script which gives a similar view (but not against a tree).

swlong commented 8 years ago

Andrew,

Thanks for the help. I do agree that it looks like roary is running appropriately. I'm not sure what to make of the downstream Bio::Root::Exception. Thanks for the JSCandy suggestion, it appears to work in a similar manner to generate the matrix plot and may have other uses as well.

Best, Wesley

mgalardini commented 8 years ago

Hi Wesley,

I believe that the very last version of the script fixes all the problems you've witnessed:

1) the deprecation warning 2) The tree straight line is due to a bug in the Bio.Phylo package, solved by upgrading Biopython; the other errors are probably due to the change in format of the gene_presence_absence.csv file, which is now properly taken care of.

Plus, now there's a new option "--labels" to add sample names to the tree.

I also agree that JScandy is very cool and useful.

Marco

alichenari2018 commented 6 years ago

Hi everybody I used tutorial scripts. roary -f -e -n -v *.gff It is done and finish successfully. But when I run "python roary_plots.py core_gene_alignment.nwk gene_presence_absence.csv" I had an error: python: can't open file 'roary_plots.py': [Errno 2] No such file or directory By the way, there is not core_gene_alignment.nwk among my obtained files and folders. Please help me. Regards

mgalardini commented 6 years ago

Hi there,

from the look of the error, it seems that you do not have the script in the same directory where you are calling it. Please download it from here and place it in your working directory.

Also, please keep in mind that the the script doesn't necessarily expect you to have the input files named in a certain way; just run the command above by changing the input to match your input files:

python roary_plots.py YOUR_TREE.nwk YOUR_ROARY_OUTPUT.csv

Hope this helps, Marco

alichenari2018 commented 6 years ago

Dear Marco Thanks for your email. But Please send me a clear script to draw plots of pan genome. Regards

On Mon, Feb 26, 2018 at 4:52 AM, Marco Galardini notifications@github.com wrote:

Hi there,

from the look of the error, it seems that you do not have the script in the same directory where you are calling it. Please download it from here https://raw.githubusercontent.com/sanger-pathogens/Roary/master/contrib/roary_plots/roary_plots.py and place it in your working directory.

Also, please keep in mind that the the script doesn't necessarily expect you to have the input files named in a certain way; just run the command above by changing the input to match your input files:

python roary_plots.py YOUR_TREE.nwk YOUR_ROARY_OUTPUT.csv

Hope this helps, Marco

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/sanger-pathogens/Roary/issues/221#issuecomment-368446405, or mute the thread https://github.com/notifications/unsubscribe-auth/AjGNO3Dwzxis740WkLJ-f_MRSzDiCh_0ks5tYn7VgaJpZM4G_3IR .

-- Ali Chenari Bouket Ph.D. in Plant Pathology

アリ チェナリ ブーケット

哲学博士 植物病理学

mgalardini commented 6 years ago

Hi,

I'm not sure I understood your last message: if you look again at my previous reply you'll see a link to the roary_plots.py script. At any rate, you can download it like this:

wget -O roary_plots.py "https://raw.githubusercontent.com/sanger-pathogens/Roary/master/contrib/roary_plots/roary_plots.py"

Hope this helps, Marco

vappiah commented 4 years ago

Hi All,

I executed the roary_plots.py for 12 gffs and the trees were drawn alright but no labels were given.

mgalardini commented 4 years ago

Hi, did you add the --labels option to it?

vappiah commented 4 years ago

Hi @mgalardini . I was confused about the --label options so I did not include that . Below is the command I used ./roary_plots.py roaryresult2/mytree.newick roaryresult2/gene_presence_absence.csv

mgalardini commented 4 years ago

I see; retry it with the --labels option added and see if that solves your problem

vappiah commented 4 years ago

Thanks @mgalardini the --labels worked. I noticed that some of my label were truncated. The labels with 4 characters were okay but those longer (such as mycobacterium_ulcerans_strain) were truncated to Mycobacter. Is there a way to show the full names?

mgalardini commented 4 years ago

I believe you could add the --format svg option, so that the output files are saved in that format (i.e. pangenome_matrix.svg), which can then be manipulated with inkscape or illustrator. I believe the full labels are there, just hidden below the presence/absence matrix.

Hope this helps.

On Wed, Jun 17, 2020 at 2:49 PM vincentappiah notifications@github.com wrote:

Thanks @mgalardini https://github.com/mgalardini the --labels worked. I noticed that some of my label were truncated. The labels with 4 characters were okay but those longer (such as mycobacterium_ulcerans_strain) were truncated to Mycobacter. Is there a way to show the full names?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/sanger-pathogens/Roary/issues/221#issuecomment-645556008, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAISWX2VXKERPRUU6FMNURDRXEF3ZANCNFSM4BX7OIIQ .

-- Marco Galardini

vappiah commented 4 years ago

Thanks @mgalardini I added the --format svg option. I can now edit using inkscape.

mgalardini commented 4 years ago

Great, glad it worked!

Julio92-C commented 3 years ago

Hi @mgalardini, is there a way to change the color of the output graph?

mgalardini commented 3 years ago

Yes, see this line and change the plt.cm.Blues part to have a different color for the heatmap.

Julio92-C commented 3 years ago

Hi there,

Thanks for your reply, I am going to check it out.

Greetings, Julio

On Fri, Jul 9, 2021, 8:21 PM Marco Galardini @.***> wrote:

Yes, see this line https://github.com/sanger-pathogens/Roary/blob/12a726e9ef87bb73a19ed4d22fe7e6b3551d6da1/contrib/roary_plots/roary_plots.py#L119 and change the plt.cm.Blues part to have a different color for the heatmap.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/sanger-pathogens/Roary/issues/221#issuecomment-877147102, or unsubscribe https://github.com/notifications/unsubscribe-auth/APOSHU46CVV7TVDC3UBJHM3TW3SWPANCNFSM4BX7OIIQ .