sanger-pathogens / Roary

Rapid large-scale prokaryote pan genome analysis
http://sanger-pathogens.github.io/Roary
Other
323 stars 189 forks source link

Roary 3.6.5 giving different (erroneous) results compared to 3.5.7 and 3.6.1/3.6.3/3.6.4 #263

Closed dutchscientist closed 8 years ago

dutchscientist commented 8 years ago

I have just installed a new Biolinux8 workstation and included Roary (of course). It installed v. 3.6.5, and when I used it with a known Campylobacter coli test set, it claimed that there were 0 core, 0 soft core genes, and dumped everything into shell, cloud etc. Even with -i 70 and -s switches.

I previously used v.3.5.7 and that gave me >1,000 core genes with the same set, and when I downgraded to earlier versions on CPAN (3.6.3 and 3.6.4) and an older one I had on a virtual machine (3.6.1), it also gave me the earlier result of 1,350 core genes. Something has been changed in 3.6.5 that could cause this difference/error?

dutchscientist commented 8 years ago

I have after 3.6.4 installed 3.6.5 specifically (instead of the standard route), and it again drops to no core genes. Hence it is specifically something in 3.6.5.

If you want specific logging, let me know :-)

andrewjpage commented 8 years ago

I've just uploaded a fix for this, thanks for reporting it. Version 3.6.6 should be in CPAN in a few hours.

dutchscientist commented 8 years ago

Unfortunately the update to 3.6.6 has not resolved the problem, I still get the 0 core, 0 soft core output. All dependencies are up to date.

Happy to make a ZIP with the files for you?

This is the summary.txt: Core genes (99% <= strains <= 100%) 0 Soft core genes (95% <= strains < 99%) 0 Shell genes (15% <= strains < 95%) 1678 Cloud genes (0% <= strains < 15%) 1138 Total genes (0% <= strains <= 100%) 2816

This is the verbose output: arnoud@T130[roary] roary -v -i 80 -s *.gff [ 1:21AM]

Please cite Roary if you use any of the results it produces: Andrew J. Page, Carla A. Cummins, Martin Hunt, Vanessa K. Wong, Sandra Reuter, Matthew T. G. Holden, Maria Fookes, Daniel Falush, Jacqueline A. Keane, Julian Parkhill, "Roary: Rapid large-scale prokaryote pan genome analysis", Bioinformatics, 2015 Nov 15;31(22):3691-3693 doi: http://doi.org/10.1093/bioinformatics/btv421 Pubmed: 26198102

2016/07/26 01:21:53 Fixing input GFF files 2016/07/26 01:22:10 Extracting proteins from GFF files Extracting proteins from /home/arnoud/data/roary/CL1_AGR_LDI12895.gff Extracting proteins from /home/arnoud/data/roary/CL1_AGR_LDI12896.gff Extracting proteins from /home/arnoud/data/roary/CL1_AGR_LDI12897.gff Extracting proteins from /home/arnoud/data/roary/CL1_AGR_LDI12903.gff Extracting proteins from /home/arnoud/data/roary/CL1_AGR_LDI12904.gff Extracting proteins from /home/arnoud/data/roary/CL1_AGR_LDI12905.gff Extracting proteins from /home/arnoud/data/roary/CL1_AGR_LDI12910.gff Extracting proteins from /home/arnoud/data/roary/CL1_AGR_LDI12912.gff Extracting proteins from /home/arnoud/data/roary/CL1_AGR_LDI12913.gff Extracting proteins from /home/arnoud/data/roary/CL1_AGR_LDI12918.gff Extracting proteins from /home/arnoud/data/roary/CL1_AGR_LDI12921.gff Extracting proteins from /home/arnoud/data/roary/CL1_AGR_LDI12926.gff Extracting proteins from /home/arnoud/data/roary/CL1_AGR_LDI12927.gff Extracting proteins from /home/arnoud/data/roary/CL1_AGR_LDI12928.gff Extracting proteins from /home/arnoud/data/roary/CL1_AGR_LDI12929.gff Extracting proteins from /home/arnoud/data/roary/CL1_AGR_LDI12934.gff Extracting proteins from /home/arnoud/data/roary/CL1_AGR_LDI12935.gff Extracting proteins from /home/arnoud/data/roary/CL1_AGR_LDI12936.gff Extracting proteins from /home/arnoud/data/roary/CL1_AGR_LDI12937.gff Extracting proteins from /home/arnoud/data/roary/CL1_AGR_LDI12942.gff Extracting proteins from /home/arnoud/data/roary/CL1_AGR_LDI12943.gff Extracting proteins from /home/arnoud/data/roary/CL1_AGR_LDI12944.gff Extracting proteins from /home/arnoud/data/roary/CL1_AGR_LDI12945.gff Extracting proteins from /home/arnoud/data/roary/CL1_AGR_LDI12950.gff Extracting proteins from /home/arnoud/data/roary/CL1_AGR_LDI12951.gff Extracting proteins from /home/arnoud/data/roary/CL1_AGR_LDI12952.gff Extracting proteins from /home/arnoud/data/roary/CL2_HUM_LDI9893.gff Extracting proteins from /home/arnoud/data/roary/CL2_HUM_LDI9898.gff Extracting proteins from /home/arnoud/data/roary/CL2_WBI_LDI12965.gff Extracting proteins from /home/arnoud/data/roary/CL2_WBI_LDI4911.gff Extracting proteins from /home/arnoud/data/roary/CL2_WBI_LDI6751.gff Extracting proteins from /home/arnoud/data/roary/CL2_WBI_LDI6782.gff Extracting proteins from /home/arnoud/data/roary/CL2_WBI_LDI6791.gff Extracting proteins from /home/arnoud/data/roary/CL2_WBI_LDI9152.gff Extracting proteins from /home/arnoud/data/roary/CL3_ENV_LDI9149.gff Extracting proteins from /home/arnoud/data/roary/CL3_ENV_LDI9163.gff Extracting proteins from /home/arnoud/data/roary/CL3_ENV_LDI9195.gff Extracting proteins from /home/arnoud/data/roary/CL3_ENV_LDI9196.gff Extracting proteins from /home/arnoud/data/roary/CL3_ENV_LDI9198.gff Extracting proteins from /home/arnoud/data/roary/CL3_ENV_LDI9203.gff Extracting proteins from /home/arnoud/data/roary/CL3_ENV_LDI9205.gff Extracting proteins from /home/arnoud/data/roary/CL3_ENV_LDI9868.gff Extracting proteins from /home/arnoud/data/roary/CL3_ENV_LDI9871.gff Extracting proteins from /home/arnoud/data/roary/CL3_ENV_LDI9876.gff Extracting proteins from /home/arnoud/data/roary/CL3_ENV_LDI9879.gff Extracting proteins from /home/arnoud/data/roary/CL3_ENV_LDI9882.gff Extracting proteins from /home/arnoud/data/roary/CL3_ENV_LDI9888.gff Extracting proteins from /home/arnoud/data/roary/CL3_ENV_LDI9901.gff Extracting proteins from /home/arnoud/data/roary/CL3_ENV_LDI9909.gff Extracting proteins from /home/arnoud/data/roary/CL3_ENV_LDI9921.gff Extracting proteins from /home/arnoud/data/roary/CL4_WBI_LDI12894.gff Extracting proteins from /home/arnoud/data/roary/CL4_WBI_LDI12979.gff Extracting proteins from /home/arnoud/data/roary/CL4_WBI_LDI6735.gff Extracting proteins from /home/arnoud/data/roary/CL4_WBI_LDI6743.gff Extracting proteins from /home/arnoud/data/roary/CL4_WBI_LDI6745.gff Extracting proteins from /home/arnoud/data/roary/CL4_WBI_LDI6759.gff Extracting proteins from /home/arnoud/data/roary/CL4_WBI_LDI6781.gff Extracting proteins from /home/arnoud/data/roary/CL4_WBI_LDI6783.gff Extracting proteins from /home/arnoud/data/roary/CL4_WBI_LDI9153.gff Extracting proteins from /home/arnoud/data/roary/CL4_WBI_LDI9160.gff Extracting proteins from /home/arnoud/data/roary/CL4_WBI_LDI9161.gff Extracting proteins from /home/arnoud/data/roary/CL4_WBI_LDI9169.gff Combine proteins into a single file Iteratively run cd-hit Parallel all against all blast Cluster with MCL 2016/07/26 01:28:17 Running command: pan_genome_post_analysis -o clustered_proteins -p pan_genome.fa -s gene_presence_absence.csv -c _clustered.clstr -i /home/arnoud/data/roary/ev2984x4_I//_gff_files -f /home/arnoud/data/roary/ev2984x4_I//_fasta_files -t 11 --dont_create_rplots --dont_split_groups -v -j Local --processors 1 --group_limit 50000 -cd 99 2016/07/26 01:28:17 Reinflate clusters 2016/07/26 01:28:17 Split groups with paralogs 2016/07/26 01:28:17 Labelling the groups 2016/07/26 01:28:17 Transfering the annotation to the groups 2016/07/26 01:28:36 Creating accessory binary gene presence and absence fasta 2016/07/26 01:28:37 Creating accessory binary gene presence and absence tree 2016/07/26 01:28:37 Running command: /usr/bin/fasttree -fastest -nt accessory_binary_genes.fa > accessory_binary_genes.fa.newick FastTree Version 2.1.7 SSE3 Alignment: accessory_binary_genes.fa Nucleotide distances: Jukes-Cantor Joins: balanced Support: SH-like 1000 Search: Fastest+2nd +NNI +SPR (2 rounds range 10) +ML-NNI opt-each=1 TopHits: 1.00*sqrtN close=default refresh=0.50 ML Model: Jukes-Cantor, CAT approximation with 20 rate categories Initial topology in 0.04 seconds Refining topology: 24 rounds ME-NNIs, 2 rounds ME-SPRs, 12 rounds ML-NNIs Total branch-length 9.218 after 0.76 sec, 101 of 122 nodes
ML-NNI round 1: LogLk = -73659.596 NNIs 13 max delta 190.05 Time 1.12 Switched to using 20 rate categories (CAT approximation)1 of 20
Rate categories were divided by 0.947 so that average rate = 1.0 CAT-based log-likelihoods may not be comparable across runs Use -gamma for approximate but comparable Gamma(20) log-likelihoods ML-NNI round 2: LogLk = -71584.459 NNIs 3 max delta 16.21 Time 1.36 ML-NNI round 3: LogLk = -71582.788 NNIs 0 max delta 0.00 Time 1.42 Turning off heuristics for final round of ML NNIs (converged) ML-NNI round 4: LogLk = -71543.812 NNIs 1 max delta 26.06 Time 1.69 (final) Optimize all lengths: LogLk = -71519.255 Time 1.77 Total time: 2.32 seconds Unique: 62/62 Bad splits: 1/59 Worst delta-LogLk 0.11

dutchscientist commented 8 years ago

This is the output with 3.6.4:

Core genes (99% <= strains <= 100%) 1350 Soft core genes (95% <= strains < 99%) 20 Shell genes (15% <= strains < 95%) 466 Cloud genes (0% <= strains < 15%) 1353 Total genes (0% <= strains <= 100%) 3189

arnoud@T130[roary] roary -v -i 80 -s *.gff [ 1:41AM]

Please cite Roary if you use any of the results it produces: Andrew J. Page, Carla A. Cummins, Martin Hunt, Vanessa K. Wong, Sandra Reuter, Matthew T. G. Holden, Maria Fookes, Daniel Falush, Jacqueline A. Keane, Julian Parkhill, "Roary: Rapid large-scale prokaryote pan genome analysis", Bioinformatics, 2015 Nov 15;31(22):3691-3693 doi: http://doi.org/10.1093/bioinformatics/btv421 Pubmed: 26198102

2016/07/26 01:46:08 Fixing input GFF files 2016/07/26 01:46:25 Extracting proteins from GFF files

Combine proteins into a single file Iteratively run cd-hit Parallel all against all blast Cluster with MCL 2016/07/26 01:56:56 Running command: pan_genome_post_analysis -o clustered_proteins -p pan_genome.fa -s gene_presence_absence.csv -c _clustered.clstr -i /home/arnoud/data/roary/yxviRwP7O8//_gff_files -f /home/arnoud/data/roary/yxviRwP7O8//_fasta_files -t 11 --dont_create_rplots --dont_split_groups -v -j Local --processors 1 --group_limit 50000 -cd 99 2016/07/26 01:56:56 Reinflate clusters 2016/07/26 01:56:57 Split groups with paralogs 2016/07/26 01:56:57 Labelling the groups 2016/07/26 01:56:57 Transfering the annotation to the groups 2016/07/26 01:57:15 Creating accessory binary gene presence and absence fasta 2016/07/26 01:57:16 Creating accessory binary gene presence and absence tree 2016/07/26 01:57:16 Running command: /usr/bin/fasttree -fastest -nt accessory_binary_genes.fa > accessory_binary_genes.fa.newick FastTree Version 2.1.7 SSE3 Alignment: accessory_binary_genes.fa Nucleotide distances: Jukes-Cantor Joins: balanced Support: SH-like 1000 Search: Fastest+2nd +NNI +SPR (2 rounds range 10) +ML-NNI opt-each=1 TopHits: 1.00*sqrtN close=default refresh=0.50 ML Model: Jukes-Cantor, CAT approximation with 20 rate categories Initial topology in 0.02 seconds Refining topology: 23 rounds ME-NNIs, 2 rounds ME-SPRs, 12 rounds ML-NNIs Total branch-length 3.557 after 0.25 sec, 101 of 112 nodes ML-NNI round 1: LogLk = -13812.193 NNIs 15 max delta 20.13 Time 0.37 Switched to using 20 rate categories (CAT approximation)1 of 20 Rate categories were divided by 0.853 so that average rate = 1.0 CAT-based log-likelihoods may not be comparable across runs Use -gamma for approximate but comparable Gamma(20) log-likelihoods ML-NNI round 2: LogLk = -13295.617 NNIs 7 max delta 16.76 Time 0.46 ML-NNI round 3: LogLk = -13294.223 NNIs 1 max delta 0.00 Time 0.51 Turning off heuristics for final round of ML NNIs (converged) ML-NNI round 4: LogLk = -13287.413 NNIs 2 max delta 2.06 Time 0.60 (final) Optimize all lengths: LogLk = -13287.028 Time 0.62 Total time: 0.81 seconds Unique: 57/62 Bad splits: 0/54
andrewjpage commented 8 years ago

Sorry about that, yes a zip would be really useful. Andrew

On 26 July 2016 at 02:03, dutchscientist notifications@github.com wrote:

This is the output with 3.6.4:

Core genes (99% <= strains <= 100%) 1350 Soft core genes (95% <= strains < 99%) 20 Shell genes (15% <= strains < 95%) 466 Cloud genes (0% <= strains < 15%) 1353 Total genes (0% <= strains <= 100%) 3189

arnoud@T130[roary] roary -v -i 80 -s *.gff [ 1:41AM]

Please cite Roary if you use any of the results it produces: Andrew J. Page, Carla A. Cummins, Martin Hunt, Vanessa K. Wong, Sandra Reuter, Matthew T. G. Holden, Maria Fookes, Daniel Falush, Jacqueline A. Keane, Julian Parkhill, "Roary: Rapid large-scale prokaryote pan genome analysis", Bioinformatics, 2015 Nov 15;31(22):3691-3693 doi: http://doi.org/10.1093/bioinformatics/btv421 Pubmed: 26198102

2016/07/26 01:46:08 Fixing input GFF files 2016/07/26 01:46:25 Extracting proteins from GFF files

Combine proteins into a single file Iteratively run cd-hit Parallel all against all blast Cluster with MCL 2016/07/26 01:56:56 Running command: pan_genome_post_analysis -o clustered_proteins -p pan_genome.fa -s gene_presence_absence.csv -c _clustered.clstr -i /home/arnoud/data/roary/yxviRwP7O8//_gff_files -f /home/arnoud/data/roary/yxviRwP7O8//_fasta_files -t 11 --dont_create_rplots --dont_split_groups -v -j Local --processors 1 --group_limit 50000 -cd 99 2016/07/26 01:56:56 Reinflate clusters 2016/07/26 01:56:57 Split groups with paralogs 2016/07/26 01:56:57 Labelling the groups 2016/07/26 01:56:57 Transfering the annotation to the groups 2016/07/26 01:57:15 Creating accessory binary gene presence and absence fasta 2016/07/26 01:57:16 Creating accessory binary gene presence and absence tree 2016/07/26 01:57:16 Running command: /usr/bin/fasttree -fastest -nt accessory_binary_genes.fa > accessory_binary_genes.fa.newick FastTree Version 2.1.7 SSE3 Alignment: accessory_binary_genes.fa Nucleotide distances: Jukes-Cantor Joins: balanced Support: SH-like 1000 Search: Fastest+2nd +NNI +SPR (2 rounds range 10) +ML-NNI opt-each=1 TopHits: 1.00*sqrtN close=default refresh=0.50 ML Model: Jukes-Cantor, CAT approximation with 20 rate categories Initial topology in 0.02 seconds Refining topology: 23 rounds ME-NNIs, 2 rounds ME-SPRs, 12 rounds ML-NNIs Total branch-length 3.557 after 0.25 sec, 101 of 112 nodes

ML-NNI round 1: LogLk = -13812.193 NNIs 15 max delta 20.13 Time 0.37 Switched to using 20 rate categories (CAT approximation)1 of 20

Rate categories were divided by 0.853 so that average rate = 1.0 CAT-based log-likelihoods may not be comparable across runs Use -gamma for approximate but comparable Gamma(20) log-likelihoods ML-NNI round 2: LogLk = -13295.617 NNIs 7 max delta 16.76 Time 0.46 ML-NNI round 3: LogLk = -13294.223 NNIs 1 max delta 0.00 Time 0.51 Turning off heuristics for final round of ML NNIs (converged) ML-NNI round 4: LogLk = -13287.413 NNIs 2 max delta 2.06 Time 0.60 (final) Optimize all lengths: LogLk = -13287.028 Time 0.62 Total time: 0.81 seconds Unique: 57/62 Bad splits: 0/54

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/sanger-pathogens/Roary/issues/263#issuecomment-235132827, or mute the thread https://github.com/notifications/unsubscribe-auth/AABeV442u7Ho6W6Fjz7viVKi4ekDTIovks5qZVztgaJpZM4JSzSj .

dutchscientist commented 8 years ago

The GFF files (from Prokka): https://drive.google.com/open?id=0B6RiTqKBNQg6Zm5UdE93OEdrOVk

The Roary 3.6.4 output: https://drive.google.com/open?id=0B6RiTqKBNQg6ODNtdUEzc2hBVlk

The Roary 3.6.6 output: https://drive.google.com/open?id=0B6RiTqKBNQg6ZG5KV0U0SG8yZVE

Runs on Biolinux8, all dependencies up to date.

andrewjpage commented 8 years ago

Thanks a million

On 26 July 2016 at 10:15, dutchscientist notifications@github.com wrote:

The GFF files (from Prokka): https://drive.google.com/open?id=0B6RiTqKBNQg6Zm5UdE93OEdrOVk

The Roary 3.6.4 output: https://drive.google.com/open?id=0B6RiTqKBNQg6ODNtdUEzc2hBVlk

The Roary 3.6.6 output: https://drive.google.com/open?id=0B6RiTqKBNQg6ZG5KV0U0SG8yZVE

Runs on Biolinux8, all dependencies up to date.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/sanger-pathogens/Roary/issues/263#issuecomment-235208861, or mute the thread https://github.com/notifications/unsubscribe-auth/AABeV-Q58P0l9c9cqYMQXkjKParHBm73ks5qZdAegaJpZM4JSzSj .

andrewjpage commented 8 years ago

Thanks for the files, its allowed me to track down the underlying issue. I've added tests which replicated the bug, fixed it and deployed a new version (v3.6.7).

dutchscientist commented 8 years ago

Great, will give it a go!

On 26 Jul 2016 4:33 p.m., "andrewjpage" notifications@github.com wrote:

Thanks for the files, its allowed me to track down the underlying issue. I've added tests which replicated the bug, fixed it and deployed a new version (v3.6.7).

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/sanger-pathogens/Roary/issues/263#issuecomment-235306405, or mute the thread https://github.com/notifications/unsubscribe-auth/AJ8e0M74bgDSpDwfu_g0Y9xZJ90xC_F-ks5qZii_gaJpZM4JSzSj .

dutchscientist commented 8 years ago

Cool, 3.6.7 gives the "correct" outcome. Will test it with some other datasets soon.

Out of curiosity, what was wrong?

andrewjpage commented 8 years ago

Excellent, I owe you a pint for putting up with my bugs! It was only reading genes from every second contig because I incorrectly used sed. Andrew

On 26 Jul 2016 17:23, "dutchscientist" notifications@github.com wrote:

Cool, 3.6.7 gives the "correct" outcome. Will test it with some other datasets soon.

Out of curiosity, what was wrong?

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/sanger-pathogens/Roary/issues/263#issuecomment-235322823, or mute the thread https://github.com/notifications/unsubscribe-auth/AABeV6aEbksj0bJwrB6_6qZ3Geyaz68qks5qZjR2gaJpZM4JSzSj .

dutchscientist commented 8 years ago

Glad I can help, Roary has enabled my bioinformatics skills to no end! So the pints are on me ;-)

Sent from my BlackBerry 10 smartphone on the EE network. From: andrewjpage Sent: Tuesday, 26 July 2016 18:28 To: sanger-pathogens/Roary Reply To: sanger-pathogens/Roary Cc: dutchscientist; Author Subject: Re: [sanger-pathogens/Roary] Roary 3.6.5 giving different (erroneous) results compared to 3.5.7 and 3.6.1/3.6.3/3.6.4 (#263)

Excellent, I owe you a pint for putting up with my bugs! It was only reading genes from every second contig because I incorrectly used sed. Andrew

On 26 Jul 2016 17:23, "dutchscientist" notifications@github.com wrote:

Cool, 3.6.7 gives the "correct" outcome. Will test it with some other datasets soon.

Out of curiosity, what was wrong?

You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/sanger-pathogens/Roary/issues/263#issuecomment-235322823, or mute the thread https://github.com/notifications/unsubscribe-auth/AABeV6aEbksj0bJwrB6_6qZ3Geyaz68qks5qZjR2gaJpZM4JSzSj .

You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/sanger-pathogens/Roary/issues/263#issuecomment-235342481, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AJ8e0G0LxsaG931eu6mUzhdVN-kc_Kxvks5qZkOsgaJpZM4JSzSj.