sanger-pathogens / Roary

Rapid large-scale prokaryote pan genome analysis
http://sanger-pathogens.github.io/Roary
Other
314 stars 189 forks source link

Error: Couldnt open GFF file #314

Closed ivaatanas closed 7 years ago

ivaatanas commented 7 years ago

Hello! I am trying to use Roary to make a core-genome alignment for around 900 isolates of Pseudomonas aeruginosa. For each isolate I have an assembled genome annotated with prokka. I did some test runs on 400 random files and this works fine. When I try doing a run on the entire data set of 900 files, I get this error: Couldnt open GFF file at /usr/local/share/perl5/Bio/Roary/ContigsToGeneIDsFromGFF.pm line 24.

This happens after 8 hours of the run and some files do get generated: accessory_binary_genes.fa, _combined_files.groups, blast_identity_frequency.Rtab, _inflated_mcl_groups, _clustered, _inflated_unsplit_mcl_groups, _clustered.clstr, _labeled_mcl_groups, clustered_proteins, _uninflated_mcl_groups,_combined_files.

The accessory_binary_genes.fa is the only empty file. Do you maybe know what's the problem? Could it be that I have too many files I am trying to run? Thank you :)

Iva

andrewjpage commented 7 years ago

Hi Iva, The point at which it fails indicates that GNU awk or sed are not available on your system. Linux systems should be fine, but Unix systems (like OSX) can have a different implementation installed by default with the OSX installation instructions taking this into account. If you type 'awk --version' and 'sed --version' it should look like:

$ awk --version GNU Awk 3.1.8 Copyright (C) 1989, 1991-2010 Free Software Foundation.

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.

$ sed --version GNU sed version 4.2.1 Copyright (C) 2009 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE, to the extent permitted by law.

GNU sed home page: http://www.gnu.org/software/sed/. General help using GNU software: http://www.gnu.org/gethelp/. E-mail bug reports to: bug-gnu-utils@gnu.org. Be sure to include the word sed'' somewhere in theSubject:'' field.

On 7 March 2017 at 12:49, ivaatanas notifications@github.com wrote:

Hello! I am trying to use Roary to make a core-genome alignment for around 900 isolates of Pseudomonas aeruginosa. For each isolate I have an assembled genome annotated with prokka. I did some test runs on 400 random files and this works fine. When I try doing a run on the entire data set of 900 files, I get this error: Couldnt open GFF file at /usr/local/share/perl5/Bio/Roary/ContigsToGeneIDsFromGFF.pm line 24.

This happens after 8 hours of the run and some files do get generated: accessory_binary_genes.fa, _combined_files.groups, blast_identity_frequency.Rtab, _inflated_mcl_groups, _clustered, _inflated_unsplit_mcl_groups, _clustered.clstr, _labeled_mcl_groups, clustered_proteins, _uninflated_mcl_groups,_combined_files.

The accessory_binary_genes.fa is the only empty file. Do you maybe know what's the problem? Could it be that I have too many files I am trying to run? Thank you :)

Iva

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/sanger-pathogens/Roary/issues/314, or mute the thread https://github.com/notifications/unsubscribe-auth/AABeV7AuWmlxuY9ELV_u4RGoi3WtMSg1ks5rjVJsgaJpZM4MVcIY .

ivaatanas commented 7 years ago

Dear Andrew,

Thank you very much on your fast reply! I am using Feodra, and my Awk version is 4.0.1. Sed is 4.2.1. So it looks like both of these are available on my system. Is the problem that Awk is not 3.1.8? Or it might be something else I have to change?

Thank you again!

Iva

ivaatanas commented 7 years ago

(In other words - my installation worked fine and I managed to do runs on up to 400 files. Now when I am trying to run 900 files, it gives the aforementioned error.)

andrewjpage commented 7 years ago

Yes indeed, since you've run it successfully before you installation is working. Roary has been run on 10,000 genomes, so its not the size of the dataset thats the issue. Do you have enough free disk space, or are the GFF files you are working with on a network storage system?

On 7 March 2017 at 13:39, ivaatanas notifications@github.com wrote:

(In other words - my installation worked fine and I managed to do runs on up to 400 files. Now when I am trying to run 900 files, it gives the aforementioned error.)

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/sanger-pathogens/Roary/issues/314#issuecomment-284723514, or mute the thread https://github.com/notifications/unsubscribe-auth/AABeVzj_KFzk-wReVE1NRrTdNgg4Rt_6ks5rjV4mgaJpZM4MVcIY .

ivaatanas commented 7 years ago

The GFF files I am working with are stored on my computer. Regarding the available memory, I will copy in the dc - h output: devtmpfs (available 7.8 G, mounted on /dev), tmpfs (avilable 7.8 G, mounted on /dev/shm), tmpfs (available 7.1 G, mounted on /run), tmpfs (available 7.8 G, mounted on /sys/fs/cgroup), /dev/sdb3 (available 30 G, mounted on /), tmpfs (available 7.8 G, mounted on /tmp), /dev/sdb5 (available 103 G, mounted on /tmp), /dev/sdb1 (available 297 M), mounted on /boot). I am running Roary on files in the /home directory,where I have 103 G available.

andrewjpage commented 7 years ago

A quick back of an envelope indicates your GFF files are about 13 GBytes in size. If roary happens to be writing to anything other than the 103GB partition then you will run into issues. Could you send me the raw output of 'df -h' because something looks off about the layout of your disks. ​

ivaatanas commented 7 years ago

memory.txt

ivaatanas commented 7 years ago

I have 962 GFF files, which is 8.9 GB. Maybe it is also important to point out that I was running everything on 8 threads. Thank you again Andrew for replying so quickly!

ivaatanas commented 7 years ago

This could also help in solving the puzzle: In my previous runs I had 4 separate batches of gff files. I was running Roary with the mafft command for each of these batches, and it worked perfectly fine (size of core and other numbers from summary statistics look ok). So I know that all my gff files should be fine. Now I have to pull all of these 4 batches into one, and to run Roary on all 962 files together. This is where I get the error.

andrewjpage commented 7 years ago

It is most likely an issue of insufficient resources if smaller batches work fine and a combined larger batch does not. I would recommend trying to run it on a bigger machine (or VM on the Amazon cloud) with more RAM and disk space.

ivaatanas commented 7 years ago

Dear Andrew,

Thank you again for the fast reply. I truly hope that this is the problem. I will try to get access to one of the servers at our department. I will get back to you and hopefuly close this question if the run goes fine.

ivaatanas commented 7 years ago

Dear Andrew,

You think it would be possible to add Roary on Galaxy? I managed to get access to the CLIMB server and I would like to use it for running Roary on my dataset. My aplogies if this question was discussed somewhere else before.

andrewjpage commented 7 years ago

I'm afraid we dont use Galaxy, but if you want to integrate it, fire ahead. I use CLIMB as well and I find SSHing in works best for me.

andrewjpage commented 7 years ago

@ivaatanas Thanks to the great work of @slugger70 Roary will be in Galaxy very soon.