merenlab / anvio

An analysis and visualization platform for 'omics data
http://merenlab.org/software/anvio
GNU General Public License v3.0
432 stars 145 forks source link

A solution for prodigal segmentation fault errors #2282

Closed meren closed 3 months ago

meren commented 3 months ago

We recently have been hearing about some prodigal issues from our users that led to errors similar to this one:

  gene_calls_dict, amino_acid_sequences_dict = gene_caller.process(self.fasta_file_path, output_dir) 

  File folderpath/github/anvio/anvio/drivers/prodigal.py", line 161, in process 
    state = prodigal_runner.run() 

  File " folderpath/github/anvio/anvio/threadingops.py", line 182, in run 
    **self._run_commands(), 

  File folderpath /github/anvio/anvio/threadingops.py", line 291, in _run_commands 
    self._check_threads_for_errors() 

  File "folderpath/github/anvio/anvio/threadingops.py", line 333, in _check_threads_for_errors raise thread.target_return_value 

anvio.errors.CommandError: 

Command Error: Command failed to run. What command, you say? This: 'prodigal -m -i
folderpath/sample.fasta.0 -o  folderpath/contigs.genes.split_0  -a folderpath/contigs.amino_acid_sequences.split_0 -p meta'

By going down in the error logs with --debug flag and through manual attempts I realized that this was related to some weird memory issues on prodigal side:

$ prodigal -m -i err.fa -o genes.txt -a aa.txt -p meta
-------------------------------------
PRODIGAL v2.6.3 [February, 2016]
Univ of Tenn / Oak Ridge National Lab
Doug Hyatt, Loren Hauser, et al.
-------------------------------------
Request:  Metagenomic, Phase:  Training
Initializing training files...done!
-------------------------------------
Request:  Metagenomic, Phase:  Gene Finding
Finding genes in sequence #1 (841195 bp)...Segmentation fault: 11

I further realized that the removal of -p meta parameter solved the issues for the same exact file:

$ prodigal -m -i err.fa -o genes.txt -a aa.txt
-------------------------------------
PRODIGAL v2.6.3 [February, 2016]
Univ of Tenn / Oak Ridge National Lab
Doug Hyatt, Loren Hauser, et al.
-------------------------------------
Request:  Single Genome, Phase:  Training
Reading in the sequence(s) to train...841195 bp seq created, 57.83 pct GC
Locating all potential starts and stops...84766 nodes
Looking for GC bias in different frames...frame bias scores: 0.71 0.24 2.05
Building initial set of genes to train from...done!
Creating coding model and scoring nodes...done!
Examining upstream regions and training starts...done!
-------------------------------------
Request:  Single Genome, Phase:  Gene Finding
Finding genes in sequence #1 (841195 bp)...done!

This problem caused only by some files in contigs fasta files, but it fails the entire process when it happens. There is no good solution for that, but this PR adds a new parameter to anvi-gen-contigs-db, --prodigal-single-mode, which omits the use of -p meta and solves the issue.

We included -p meta by default, since it seemed to perform better for metagenomic assemblies compared to -p single. But this cases shows that we have to have an option to prevent this weird problem.

I thank @tdelmont who brought this up the first time, and all the others who helped us diagnose it in anvi'o Discord channel.