sanger-pathogens / Roary

Rapid large-scale prokaryote pan genome analysis
http://sanger-pathogens.github.io/Roary
Other
323 stars 189 forks source link

MSG: Got a sequence without letters. Could not guess alphabet #127

Closed mushalallam closed 9 years ago

mushalallam commented 9 years ago

Hi I got this error when I try to create a core alignment Thanks

andrewjpage commented 9 years ago

Hi, Just to double check, did you install muscle and RevTrans.py? Could you send me a directory listing (ls -alrt ) so I can get an idea about whats gone wrong. Thanks, Andrew

mushalallam commented 9 years ago

Hi Andrew, this how I run the command roary -e -i 70 --core_definition 90 --dont_delete_files *.gff I have Muscle and revtrans.py installed in my path, below is the ls -alrt ma11:v3 ma11$ ls -alrt total 34480 drwxr-xr-x 20 ma11 staff 680 May 27 10:39 .. -rwxr-xr-x 1 ma11 staff 2818200 May 27 10:39 NT45_03212015.gff -rwxr-xr-x 1 ma11 staff 2684389 May 27 10:39 NT224_03212015.gff -rwxr-xr-x 1 ma11 staff 2753976 May 27 10:39 NT12_03212015.gff -rwxr-xr-x 1 ma11 staff 2763286 May 27 10:39 NT11_03212015.gff -rw-r--r-- 1 ma11 staff 37095 May 27 13:10 database_masking.asnb -rw-r--r-- 1 ma11 staff 224049 May 27 13:10 _combined_files.groups -rw-r--r-- 1 ma11 staff 571707 May 27 13:10 _combined_files -rw-r--r-- 1 ma11 staff 115891 May 27 13:10 _clustered.clstr -rw-r--r-- 1 ma11 staff 381953 May 27 13:10 _clustered -rw-r--r-- 1 ma11 staff 211 May 27 13:10 blast_identity_frequency.Rtab -rw-r--r-- 1 ma11 staff 41872 May 27 13:10 _uninflated_mcl_groups -rw-r--r-- 1 ma11 staff 73 May 27 13:10 _gff_files -rw-r--r-- 1 ma11 staff 125 May 27 13:10 _fasta_files -rw-r--r-- 1 ma11 staff 604198 May 27 13:10 _blast_results -rw-r--r-- 1 ma11 staff 314397 May 27 13:10 _labeled_mcl_groups -rw-r--r-- 1 ma11 staff 288108 May 27 13:10 _inflated_unsplit_mcl_groups -rw-r--r-- 1 ma11 staff 288108 May 27 13:10 _inflated_mcl_groups -rw-r--r-- 1 ma11 staff 170 May 27 13:10 number_of_unique_genes.Rtab -rw-r--r-- 1 ma11 staff 153 May 27 13:10 number_of_new_genes.Rtab -rw-r--r-- 1 ma11 staff 200 May 27 13:10 number_of_genes_in_pan_genome.Rtab -rw-r--r-- 1 ma11 staff 200 May 27 13:10 number_of_conserved_genes.Rtab -rw-r--r-- 1 ma11 staff 413887 May 27 13:10 gene_presence_absence.csv -rw-r--r-- 1 ma11 staff 0 May 27 13:10 core_accessory.tab -rw-r--r-- 1 ma11 staff 314397 May 27 13:10 clustered_proteins -rw-r--r-- 1 ma11 staff 156 May 27 13:10 core_accessory.header.embl -rw-r--r-- 1 ma11 staff 0 May 27 13:10 accessory.tab -rw-r--r-- 1 ma11 staff 156 May 27 13:10 accessory.header.embl drwxr-xr-x 4569 ma11 staff 155346 May 27 13:13 pan_genome_sequences -rw-r--r-- 1 ma11 staff 662815 May 27 13:14 NT11_03212015.gff.proteome.faa -rw-r--r-- 1 ma11 staff 661577 May 27 13:14 NT12_03212015.gff.proteome.faa -rw-r--r-- 1 ma11 staff 646061 May 27 13:14 NT224_03212015.gff.proteome.faa -rw-r--r-- 1 ma11 staff 282267 May 27 13:14 pan_genome_results -rw-r--r-- 1 ma11 staff 677878 May 27 13:14 NT45_03212015.gff.proteome.faa drwxr-xr-x 38 ma11 staff 1292 May 27 13:16 . -rw-r--r--@ 1 ma11 staff 15364 May 27 13:18 .DS_Store -rw-r--r-- 1 ma11 staff 65 May 27 13:41 output_alignment.aln -rw-r--r-- 1 ma11 staff 65 May 27 13:50 core_gene_alignment.aln thanks

andrewjpage commented 9 years ago

Thanks for that, Could you email me the spreadsheet file called gene_presence_absence.csv ?
Its path-help@sanger.ac.uk as usual. Regards, Andrew

andrewjpage commented 9 years ago

Hi Mushal, I've just released a new version which I 'hope' will resolve the issue your having (2.3.0). Could you give it a whirl and let me know how you get along? Andrew

mushalallam commented 9 years ago

Many thanks @andrewjpage its working well :)

andrewjpage commented 9 years ago

Thanks for letting me know.

jsan4christ commented 4 years ago

Hi @andrewjpage

I'm using roary 3.13.0 and have this problem: --------------------- WARNING --------------------- MSG: Got a sequence without letters. Could not guess alphabet

--------------------- WARNING --------------------- MSG: Got a sequence without letters. Could not guess alphabet

--------------------- WARNING --------------------- MSG: Got a sequence without letters. Could not guess alphabet

--------------------- WARNING --------------------- MSG: Got a sequence without letters. Could not guess alphabet

--------------------- WARNING --------------------- MSG: Got a sequence without letters. Could not guess alphabet

--------------------- WARNING --------------------- MSG: Got a sequence without letters. Could not guess alphabet

--------------------- WARNING --------------------- MSG: Got a sequence without letters. Could not guess alphabet

--------------------- WARNING --------------------- MSG: Got a sequence without letters. Could not guess alphabet

--------------------- WARNING --------------------- MSG: Got a sequence without letters. Could not guess alphabet

--------------------- WARNING --------------------- MSG: Got a sequence without letters. Could not guess alphabet

and an alignment file the looks like this:


PRES009

PRES012

PRES014

PRES019

PRES021

PRES024

PRES025

PRES026

PRES028

The command line I used is: roary -e --mafft -p 8 -t 1 -f prokka/gffs/roary_output/ prokka/gffs/*.gff

Please advise,

tseemann commented 4 years ago

I think the current version of roary is 3.14.0 That warning comes from bioperl and it usually means you have lots of - or N letters in your sequence. What version of prokka did you use.

jsan4christ commented 4 years ago

Thanks for your response,

And what is the best approach to the warning, identify and remove the sequences with many Ns? What is considered an acceptable threshold for Ns?

Kind regards.

On Sat, Feb 29, 2020 at 3:38 PM Torsten Seemann notifications@github.com wrote:

I think the current version of roary is 3.14.0 That warning comes from bioperl and it usually means you have lots of - or N letters in your sequence.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/sanger-pathogens/Roary/issues/127?email_source=notifications&email_token=ABGBQRXUI5V5I2TX6URMSWDRFGOAHA5CNFSM4BFPFLP2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOENMI5GI#issuecomment-593006233, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABGBQRSY5S3ELPE3JEORXBDRFGOAHANCNFSM4BFPFLPQ .

-- San Emmanuel James Skype: jsan4christ Mobile: UG +256752900304, SA +27 67 833 1444

The Lord is my shepherd, I shall not want! Psalms 23

jsan4christ commented 4 years ago

I'm having trouble locating version 3.14.0 installer,

Thanks

On Sat, Feb 29, 2020 at 5:07 PM San Emmanuel James < sanemmanueljames@gmail.com> wrote:

Thanks for your response,

And what is the best approach to the warning, identify and remove the sequences with many Ns? What is considered an acceptable threshold for Ns?

Kind regards.

On Sat, Feb 29, 2020 at 3:38 PM Torsten Seemann notifications@github.com wrote:

I think the current version of roary is 3.14.0 That warning comes from bioperl and it usually means you have lots of - or N letters in your sequence.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/sanger-pathogens/Roary/issues/127?email_source=notifications&email_token=ABGBQRXUI5V5I2TX6URMSWDRFGOAHA5CNFSM4BFPFLP2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOENMI5GI#issuecomment-593006233, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABGBQRSY5S3ELPE3JEORXBDRFGOAHANCNFSM4BFPFLPQ .

-- San Emmanuel James Skype: jsan4christ Mobile: UG +256752900304, SA +27 67 833 1444


The Lord is my shepherd, I shall not want! Psalms 23

-- San Emmanuel James Skype: jsan4christ Mobile: UG +256752900304, SA +27 67 833 1444

The Lord is my shepherd, I shall not want! Psalms 23

thorellk commented 3 years ago

Hi! I would also like to revive this issue. I am running roary (3.12.0) on a dataset consisting of 2170 bacterial genomes. The command I ran was the following:

roary -p 16 -e -s -n -f roary_id85-s -i 85 *gff

The process runs seemingly fine and the correct output files are generated but I get the following error message twice

--------------------- WARNING ---------------------
MSG: Got a sequence without letters. Could not guess alphabet
---------------------------------------------------

Also the core_gene_alignment.aln file is very small (seems to consist only of one gene or so) despite the summary statistics file stating that there should be 1141 core genes.

I have previously tried to QC my genomes by running sendsketch and validated with Kraken on dubious ones. I also made a mash tree to double check and remove outliers and by removing assemblies with over 200 contigs. After this I used prokka v 1.12 for annotation (I know it's an old version). Is this error message due to low quality/high divergence among the genomes as suggested by some answers I have found or N's in the sequences or what do you think? And most importantly; how can I mitigate it? I visualised the nwk and gene_presence_absence.csv file in Phandango and I cannot see any genome behaving weirdly (eg containing very few core genes/being very divergent from the others) there.

Thank you for your help!

thorellk commented 3 years ago

To follow up on this, is there anyway to identify the sequences that give rise to this error and modify them/exclude them? Since the error message was repeated twice I assume they are two? @andrewjpage @tseemann

xin-bang commented 10 months ago

I had this problem too, but mine was caused by a roary version issue. When I initially installed conda, I didn't add a new channel to the conda config, which caused me to use: conda install bioconda::roary to install roary from conda's default chanel, version 3.7.0, instead of anaconda.org version of version 3.13.0. You can check your version with roary -w. Versions 3.7.0 will encounter this problem. You can solve this troble with that commands :

conda config --add channels conda-forge
conda config --add channels r
conda config --add channels bioconda

and then useconda install bioconda::roaryto install roary to version 3.13.0.