sanger-pathogens / Roary

Rapid large-scale prokaryote pan genome analysis
http://sanger-pathogens.github.io/Roary
Other
323 stars 189 forks source link

MSG: Got a sequence without letters. Could not guess alphabet #229

Closed abremges closed 8 years ago

abremges commented 8 years ago

I tried to run Roary on 7 different bacterial strains (draft Illumina assemblies, annotated with Prokka). I'm not entirely sure how related the strains are, maybe we have more diversity in there than initially thought. The command-line call was roary -v -e --mafft -p 4 *.gff 2>&1, and everything seems to run fine until:

[snip]
2016/02/04 15:04:45 Running command: /usr/local/bin/FastTree -fastest -nt accessory_binary_genes.fa > accessory_binary_genes.fa.newick 
FastTree Version 2.1.8 Double precision (No SSE3), OpenMP (4 threads)
Alignment: accessory_binary_genes.fa
Nucleotide distances: Jukes-Cantor Joins: balanced Support: SH-like 1000
Search: Fastest+2nd +NNI +SPR (2 rounds range 10) +ML-NNI opt-each=1
TopHits: 1.00*sqrtN close=default refresh=0.50
ML Model: Jukes-Cantor, CAT approximation with 20 rate categories
Initial topology in 0.00 seconds
Refining topology: 11 rounds ME-NNIs, 2 rounds ME-SPRs, 6 rounds ML-NNIs
Total branch-length 3.147 after 0.03 sec
ML-NNI round 1: LogLk = -10156.152 NNIs 2 max delta 13.48 Time 0.06
Switched to using 20 rate categories (CAT approximation)
Rate categories were divided by 0.670 so that average rate = 1.0
CAT-based log-likelihoods may not be comparable across runs
Use -gamma for approximate but comparable Gamma(20) log-likelihoods
ML-NNI round 2: LogLk = -9984.591 NNIs 1 max delta 0.00 Time 0.08
Turning off heuristics for final round of ML NNIs (converged)
ML-NNI round 3: LogLk = -9984.591 NNIs 0 max delta 0.00 Time 0.10 (final)
      0.10 seconds: ML Lengths 1 of 5 splits
Optimize all lengths: LogLk = -9984.591 Time 0.11
Total time: 0.15 seconds Unique: 7/7 Bad splits: 0/4
Use of uninitialized value in require at (eval 7471) line 1.
2016/02/04 15:04:56 Running command: pan_genome_core_alignment -cd 99

--------------------- WARNING ---------------------
MSG: Got a sequence without letters. Could not guess alphabet
---------------------------------------------------

--------------------- WARNING ---------------------
MSG: Got a sequence without letters. Could not guess alphabet
---------------------------------------------------

--------------------- WARNING ---------------------
MSG: Got a sequence without letters. Could not guess alphabet
---------------------------------------------------

--------------------- WARNING ---------------------
MSG: Got a sequence without letters. Could not guess alphabet
---------------------------------------------------

--------------------- WARNING ---------------------
MSG: Got a sequence without letters. Could not guess alphabet
---------------------------------------------------

--------------------- WARNING ---------------------
MSG: Got a sequence without letters. Could not guess alphabet
---------------------------------------------------

--------------------- WARNING ---------------------
MSG: Got a sequence without letters. Could not guess alphabet
---------------------------------------------------

This seems to be related to https://github.com/tseemann/snippy/issues/24, and got resolved by @tseemann by telling Bio::SeqIO->new the alphabet upfront. Maybe a similar strategy resolves this issue?

andrewjpage commented 8 years ago

Hi, Could you send me the output of your summary statistics file? Andrew

abremges commented 8 years ago

Uh, empty (soft) core set? Is this the reason for Use of uninitialized value in require at (eval 7471) line 1. and all the warnings?

Core genes  (99% <= strains <= 100%)    0
Soft core genes (95% <= strains < 99%)  0
Shell genes (15% <= strains < 95%)  2435
Cloud genes (0% <= strains < 15%)   203
Total genes (0% <= strains <= 100%) 2638
andrewjpage commented 8 years ago

It looks like you might have some outliers? I find Kraken ( https://ccb.jhu.edu/software/kraken/) is good for QCing data. Of if you open up the gene presence and absense spreadsheet in excel , you should be able to spot the odd one out quite easily.

On 5 February 2016 at 11:25, Andreas Bremges notifications@github.com wrote:

Uh, empty (soft) core set? Is this the reason for Use of uninitialized value in require at (eval 7471) line 1. and all the warnings?

Core genes (99% <= strains <= 100%) 0 Soft core genes (95% <= strains < 99%) 0 Shell genes (15% <= strains < 95%) 2435 Cloud genes (0% <= strains < 15%) 203 Total genes (0% <= strains <= 100%) 2638

— Reply to this email directly or view it on GitHub https://github.com/sanger-pathogens/Roary/issues/229#issuecomment-180306381 .

abremges commented 8 years ago

Thanks, will have a closer look at my data, which seems to be the cause of this issue. I now believe my input genomes are not as closely related as we thought, making them unsuitable – at least as a whole – for a Roary analysis.

As a minor enhancement, I'd propose catching this error (empty core set) in your pipeline, instead of printing somewhat obscure Use of uninitialized value and Bio::SeqIO warnings. This would boost usability in some edge cases.

Thanks for your input & efforts!

andrewjpage commented 8 years ago

Yes I agree that I should be catching this case. Thanks for the feedback.

tseemann commented 8 years ago

@abremges thanks for the reminder on how I fixed that! I knew I'd solved it somewhere before but I couldn't remember.

jsan4christ commented 4 years ago

@tseemann, how do I apply this to roary? my $fh = Bio::SeqIO->new(-file=>$aln_file, -format=>'fasta', -alphabet=>'dna');

I'm having the same problem SG: Got a sequence without letters. Could not guess alphabet MSG: Got a sequence without letters. Could not guess alphabet MSG: Got a sequence without letters. Could not guess alphabet MSG: Got a sequence without letters. Could not guess alphabet MSG: Got a sequence without letters. Could not guess alphabet MSG: Got a sequence without letters. Could not guess alphabet MSG: Got a sequence without letters. Could not guess alphabet MSG: Got a sequence without letters. Could not guess alphabet MSG: Got a sequence without letters. Could not guess alphabet MSG: Got a sequence without letters. Could not guess alphabet MSG: Got a sequence without letters. Could not guess alphabet MSG: Got a sequence without letters. Could not guess alphabet MSG: Got a sequence without letters. Could not guess alphabet MSG: Got a sequence without letters. Could not guess alphabet MSG: Got a sequence without letters. Could not guess alphabet MSG: Got a sequence without letters. Could not guess alphabet