merenlab / anvio

An analysis and visualization platform for 'omics data
http://merenlab.org/software/anvio
GNU General Public License v3.0
439 stars 145 forks source link

Error in pangeome createing protein fasta #486

Closed jmeppley closed 7 years ago

jmeppley commented 7 years ago

Runing the pangenome command:

anvi-pan-genome -g anvio/Parvarchaea-GENOMES.h5 -J Parvarchaea

Gives me this error:

Genomes storage ..............................: Initialized (storage hash: 98ba7297)
Num genomes in storage .......................: 2
Num genomes will be used .....................: 2
Pan database .................................: A new database, /lus/scratch/usr/jmeppley/opt/workflows/test/scratch/pang/Parvarchaea/Parvarchaea-PAN.db, has been created.
Exclude partial gene calls ...................: False

[23 Mar 17 20:34:23 Uniquing the output FASTA file] ...                                                                                                             Traceback (most recent call last):
  File "/home/jmeppley/opt/workflows/test/conda/envs/anvi2/bin/anvi-pan-genome", line 4, in <module>
    __import__('pkg_resources').run_script('anvio==2.1.0', 'anvi-pan-genome')
  File "/home/jmeppley/opt/workflows/test/conda/envs/anvi2/lib/python2.7/site-packages/pkg_resources/__init__.py", line 739, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/home/jmeppley/opt/workflows/test/conda/envs/anvi2/lib/python2.7/site-packages/pkg_resources/__init__.py", line 1494, in run_script
    exec(code, namespace, namespace)
  File "/lus/scratch/usr/jmeppley/opt/workflows/test/conda/envs/anvi2/lib/python2.7/site-packages/anvio-2.1.0-py2.7-linux-x86_64.egg-info/scripts/anvi-pan-genome", line 99, in <module>
    pan.process()
  File "/home/jmeppley/opt/workflows/test/conda/envs/anvi2/lib/python2.7/site-packages/anvio/panops.py", line 1075, in process
    unique_proteins_FASTA_path, unique_proteins_names_dict = self.genomes_storage.gen_combined_protein_sequences_FASTA(combined_proteins_FASTA_path, exclude_partial_gene_calls=self.exclude_partial_gene_calls)
  File "/home/jmeppley/opt/workflows/test/conda/envs/anvi2/lib/python2.7/site-packages/anvio/auxiliarydataops.py", line 357, in gen_combined_protein_sequences_FASTA
    unique_proteins_FASTA_path, unique_proteins_names_file_path, unique_proteins_names_dict = utils.unique_FASTA_file(output_file_path, store_frequencies_in_deflines=False)
  File "/home/jmeppley/opt/workflows/test/conda/envs/anvi2/lib/python2.7/site-packages/anvio/utils.py", line 950, in unique_FASTA_file
    input_fasta = u.SequenceSource(input_file_path, unique=True)
  File "/home/jmeppley/opt/workflows/test/conda/envs/anvi2/lib/python2.7/site-packages/anvio/fastalib.py", line 101, in __init__
    raise FastaLibError, "File '%s' does not seem to be a FASTA file." % self.fasta_file_path
anvio.fastalib.FastaLibError: Fasta Lib Error: File '/lus/scratch/usr/jmeppley/opt/workflows/test/scratch/pang/Parvarchaea/combined-proteins.fa' does not seem to be a FASTA file.

However, this only happens on our cray system using a Lustre file share. Our vanilla centos boxes work OK, even over NFS. I've diagnosed the problem. In the gen_combined_protein_sequences_FASTA function in auxiliarydataops.py, the output file is never closed before calling unique_FASTA_file.

I haven't yet figured out if this is still a problem with later versions.

meren commented 7 years ago

Thank you very much, John! Great detective work there. I just merged your PR. I hope it wasn't too frustrating :(

Best,