sanger-pathogens / Roary

Rapid large-scale prokaryote pan genome analysis
http://sanger-pathogens.github.io/Roary
Other
323 stars 190 forks source link

MSG: The sequence does not appear to be FASTA format (lacks a descriptor line '>') #346

Closed semiramisCJ closed 7 years ago

semiramisCJ commented 7 years ago

We don't have problems running Roary with GFF3 files from Prokka, but Roary dies when we try to use different GFF3 files (described at the end), even though all the GFF3 files have the nucleotide sequence at the end of the file, they have the optional '##FASTA' line and they have the fasta headers.

Roary gives the following message:

2017/09/01 20:32:01 Extracting proteins from GFF files

------------- EXCEPTION: Bio::Root::Exception ------------- MSG: The sequence does not appear to be FASTA format (lacks a descriptor line '>') STACK: Error::throw STACK: Bio::Root::Root::throw /usr/local/share/perl5/Bio/Root/Root.pm:472 STACK: Bio::SeqIO::fasta::next_seq /usr/local/share/perl5/Bio/SeqIO/fasta.pm:126 STACK: Bio::Roary::FilterUnknownsFromFasta::_filter_fasta_sequences_and_return_new_file /usr/local/share/perl5/Bio/Roary/FilterUnknownsFromFasta.pm:58 STACK: Bio::Roary::FilterUnknownsFromFasta::filtered_fasta_files /usr/local/share/perl5/Bio/Roary/FilterUnknownsFromFasta.pm:28 STACK: Bio::Roary::PrepareInputFiles::_input_fasta_files_filtered /usr/local/share/perl5/Bio/Roary/PrepareInputFiles.pm:58 STACK: Bio::Roary::PrepareInputFiles::fasta_files /usr/local/share/perl5/Bio/Roary/PrepareInputFiles.pm:82 STACK: Bio::Roary::CommandLine::Roary::run /usr/local/share/perl5/Bio/Roary/CommandLine/Roary.pm:277 STACK: /usr/local/bin/roary:14

I converted the GBK files to GFF3 with: a) seqret module + python to send the all fasta records at the end of the file [seqret* files] b) GFF in BCBio and SeqIO in python 2.7 + SeqIO (again) to add the nucleotide sequence at the end [py* files]

py_A_sp_B1.gff3.txt py_A_denitrificans_K601.gff3.txt py_A_denitrificans_BC.gff3.txt seqret_A_sp_B1.gbk.gff3.txt seqret_A_denitrificans_K601.gbk.gff3.txt seqret_A_denitrificans_BC.gbk.gff3.txt

Could somebody help us to find out how to solve this issue? Thanks in advance & best regards.

P.S.- roary -a gives the following details

Please cite Roary if you use any of the results it produces: Andrew J. Page, Carla A. Cummins, Martin Hunt, Vanessa K. Wong, Sandra Reuter, Matthew T. G. Holden, Maria Fookes, Daniel Falush, Jacqueline A. Keane, Julian Parkhill, "Roary: Rapid large-scale prokaryote pan genome analysis", Bioinformatics, 2015 Nov 15;31(22):3691-3693 doi: http://doi.org/10.1093/bioinformatics/btv421 Pubmed: 26198102

2017/09/01 20:18:49 Looking for 'Rscript' - found /usr/local/R/bin/Rscript 2017/09/01 20:18:49 Determined Rscript version is 3.2 2017/09/01 20:18:49 Looking for 'awk' - found /bin/awk 2017/09/01 20:18:49 Looking for 'bedtools' - found /usr/local/bedtools/bin/bedtools 2017/09/01 20:18:49 Determined bedtools version is 2.25 2017/09/01 20:18:49 Looking for 'blastp' - found /usr/local/blast+/bin/blastp 2017/09/01 20:18:49 Determined blastp version is 2.2.28 2017/09/01 20:18:49 Looking for 'grep' - found /bin/grep 2017/09/01 20:18:49 Optional tool 'kraken' not found in your $PATH 2017/09/01 20:18:49 Optional tool 'kraken-report' not found in your $PATH 2017/09/01 20:18:49 Looking for 'mafft' - found /usr/bin/mafft Use of uninitialized value in concatenation (.) or string at /usr/local/share/perl5/Bio/Roary/External/CheckTools.pm line 129. 2017/09/01 20:18:49 Determined mafft version is 2017/09/01 20:18:49 Looking for 'makeblastdb' - found /usr/local/blast+/bin/makeblastdb 2017/09/01 20:18:49 Determined makeblastdb version is 2.2.28 2017/09/01 20:18:49 Looking for 'mcl' - found /usr/local/bin/mcl 2017/09/01 20:18:49 Determined mcl version is 12-068 2017/09/01 20:18:49 Looking for 'parallel' - found /usr/local/masurca/bin/parallel 2017/09/01 20:18:49 Determined parallel version is 20120822 2017/09/01 20:18:49 Roary needs parallel 20130422 or higher. Please upgrade and try again. 2017/09/01 20:18:49 Looking for 'prank' - found /usr/local/bin/prank 2017/09/01 20:18:49 Looking for 'sed' - found /bin/sed 2017/09/01 20:18:49 Looking for 'cd-hit' - found /usr/local/cd-hit/cd-hit Use of uninitialized value in concatenation (.) or string at /usr/local/share/perl5/Bio/Roary/External/CheckTools.pm line 129. 2017/09/01 20:18:49 Determined cd-hit version is Use of uninitialized value in numeric lt (<) at /usr/local/share/perl5/Bio/Roary/External/CheckTools.pm line 130. 2017/09/01 20:18:49 Roary needs cd-hit 4.6 or higher. Please upgrade and try again. 2017/09/01 20:18:49 Looking for 'FastTree' - found /usr/local/bin/FastTree 2017/09/01 20:18:50 Determined FastTree version is 2.1 2017/09/01 20:18:50 Roary version 3.7.0

tseemann commented 7 years ago

The pyA GFF3 file is a bit unusual. it has the ##sequence-region stuff littered throughout rather than at the top.

I loaded your pyA file into http://genometools.org/cgi-bin/gff3validator.cgi and got this error

Validation unsuccessful!

GenomeTools error: attribute "pseudo=" on line 5 in file "/var/www/servers/genometools.org/htdocs/cgi-bin/gff3/py_A_sp_B1.gff3.txt" has no value

The convertor you used has a problem... The /pseudo tag in Genbank is a value-less key. However, GFF3 does not support value-less keys and is putting pseudo= in the file. This is wrong.

You could try sed -e 's/;pseudo=//g' < old.gff > new.gff and see if that works.

semiramisCJ commented 7 years ago

Thank you very much for your soon reply!!

I fixed the py_* files in order to solve all the issues I found via the GFF3 validator and I put the ##sequence region lines at the top. However, Roary dies with the same message even though the GFF3 online validator says that the validation was successful for each of the files

A_sp_B1.gff3.txt A_denitrificans_K601.gff3.txt A_denitrificans_BC.gff3.txt

Could you please help us to find what else is wrong with the converted files? Thank you very much in advance and best regards.

2017/09/04 20:14:31 Fixing input GFF files 2017/09/04 20:14:42 Extracting proteins from GFF files

MSG: The sequence does not appear to be FASTA format (lacks a descriptor line '>') STACK: Error::throw STACK: Bio::Root::Root::throw /usr/local/share/perl5/Bio/Root/Root.pm:472 STACK: Bio::SeqIO::fasta::next_seq /usr/local/share/perl5/Bio/SeqIO/fasta.pm:126 STACK: Bio::Roary::FilterUnknownsFromFasta::_filter_fasta_sequences_and_return_new_file /usr/local/share/perl5/Bio/Roary/FilterUnknownsFromFasta.pm:58 STACK: Bio::Roary::FilterUnknownsFromFasta::filtered_fasta_files /usr/local/share/perl5/Bio/Roary/FilterUnknownsFromFasta.pm:28 STACK: Bio::Roary::PrepareInputFiles::_input_fasta_files_filtered /usr/local/share/perl5/Bio/Roary/PrepareInputFiles.pm:58 STACK: Bio::Roary::PrepareInputFiles::fasta_files /usr/local/share/perl5/Bio/Roary/PrepareInputFiles.pm:82 STACK: Bio::Roary::CommandLine::Roary::run /usr/local/share/perl5/Bio/Roary/CommandLine/Roary.pm:277 STACK: /usr/local/bin/roary:14

andrewjpage commented 7 years ago

If the filename ends in '.gff' it is assumed to be a GFF file, otherwise it is assumed to be a FASTA file of genes. So the solution is to rename your file extensions from '.gff3.txt' to '.gff'. I've run your data and it works fine after renaming.

On 5 September 2017 at 02:22, Semiramis C notifications@github.com wrote:

Thank you very much for your soon reply!!

I fixed the py_* files in order to solve all the issues I found via the GFF3 validator and I put the ##sequence region lines at top. However, Roary dies with the same message even though the GFF3 online validator says "Validation successful!"

A_sp_B1.gff3.txt https://github.com/sanger-pathogens/Roary/files/1275756/A_sp_B1.gff3.txt A_denitrificans_K601.gff3.txt https://github.com/sanger-pathogens/Roary/files/1275757/A_denitrificans_K601.gff3.txt A_denitrificans_BC.gff3.txt https://github.com/sanger-pathogens/Roary/files/1275758/A_denitrificans_BC.gff3.txt

Could you please help us to find what else is wrong with the converted files? Thank you very much in advance and best regards.

2017/09/04 20:14:31 Fixing input GFF files 2017/09/04 20:14:42 Extracting proteins from GFF files

MSG: The sequence does not appear to be FASTA format (lacks a descriptor line '>') STACK: Error::throw STACK: Bio::Root::Root::throw /usr/local/share/perl5/Bio/Root/Root.pm:472 STACK: Bio::SeqIO::fasta::next_seq /usr/local/share/perl5/Bio/SeqIO/ fasta.pm:126 STACK: Bio::Roary::FilterUnknownsFromFasta::_filter_fasta_sequences_and_return_new_file /usr/local/share/perl5/Bio/Roary/FilterUnknownsFromFasta.pm:58 STACK: Bio::Roary::FilterUnknownsFromFasta::filtered_fasta_files /usr/local/share/perl5/Bio/Roary/FilterUnknownsFromFasta.pm:28 STACK: Bio::Roary::PrepareInputFiles::_input_fasta_files_filtered /usr/local/share/perl5/Bio/Roary/PrepareInputFiles.pm:58 STACK: Bio::Roary::PrepareInputFiles::fasta_files /usr/local/share/perl5/Bio/Roary/PrepareInputFiles.pm:82 STACK: Bio::Roary::CommandLine::Roary::run /usr/local/share/perl5/Bio/ Roary/CommandLine/Roary.pm:277 STACK: /usr/local/bin/roary:14

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/sanger-pathogens/Roary/issues/346#issuecomment-327047567, or mute the thread https://github.com/notifications/unsubscribe-auth/AABeVxuxed-ka91MbYKl4YmRJWDWb8xHks5sfKI6gaJpZM4PKw9f .

andrewjpage commented 7 years ago

Additionally I have updated Roary to capture this case and fix it on the fly.

tseemann commented 7 years ago

@semiramisCJ do use the auto-detect you will need to upgrade via CPAN.