phac-nml / sistr_cmd

SISTR (Salmonella In Silico Typing Resource) command-line tool
Apache License 2.0
25 stars 9 forks source link

sistr_cmd consistently drops one genome from output tables #26

Closed sabrinadiemert closed 6 years ago

sabrinadiemert commented 6 years ago

Hi @peterk87,

I'm having a strange problem running sistr_cmd. It seems like all of the output .tab files that are produced through sistr_cmd are missing the first genome that is assessed. For example, if I run the following command within a folder containing three genomes (genome1.fa, genome2.fa, and genome3.fa):

sistr -i *.fa -f tab -o SISTR_output.tab

... the SISTR_output.tab file only has two row entries. Even weirder, it seems to be combining the first two genomes it encounters (at least, as far as I can tell from two of the columns in SISTR_output.tab):

fasta_filepath genome
/home/.../test_sistr/genome1.fa genome2.fa
/home/.../test_sistr/genome3.fa genome3

Any idea what might be happening here? I'm running this in Linux (Ubuntu 17.04) within a conda environment with Python 2.7; I previously ran sistr_cmd in a separate conda environment with Python 3.6 but it seemed like I was having some package interference with ETE3. Looking back over my results from those runs, this problem was happening at that time, too.

peterk87 commented 6 years ago

Hi @sabrinadiemert

I think I may know what's going on. The first 2 FASTA file inputs are being treated as a FASTA file path and genome name pair with the -i option where genome1.fa is the file path and genome2.fa is the genome name. The rest of the input files are treated as leftover arguments and treated as normal input files.

The -i option is useful for when you have FASTA files that don't have the desired genome name encoded in the filename:

  -i fasta_path genome_name, --input-fasta-genome-name fasta_path genome_name
                        fasta file path to genome name pair

You could try running the command without the -i option:

sistr -f tab -o SISTR_output.tab *.fa

sistr_cmd should work with Python 3.6 if it's installed into a clean conda env. If you happen to have the error message or stacktrace you received when trying to run it with Python 3.6, then that would help me figure out what the issue might be.

Hope that helps!

sabrinadiemert commented 6 years ago

Aha! Thanks @peterk87, that definitely solved the problem. Thanks to that explanation, I can see that I misinterpreted the description of the -i flag.

Here's the error that I received when running in my conda env with Python 3.6:

Traceback (most recent call last):
  File "/home/sabrina/miniconda3/envs/bioinfo/bin/sistr", line 11, in <module>
    load_entry_point('sistr-cmd==1.0.2', 'console_scripts', 'sistr')()
  File "/home/sabrina/miniconda3/envs/bioinfo/lib/python3.6/site-packages/sistr_cmd-1.0.2-py3.6.egg/sistr/sistr_cmd.py", line 324, in main
  File "/home/sabrina/miniconda3/envs/bioinfo/lib/python3.6/site-packages/sistr_cmd-1.0.2-py3.6.egg/sistr/sistr_cmd.py", line 324, in <listcomp>
  File "/home/sabrina/miniconda3/envs/bioinfo/lib/python3.6/site-packages/sistr_cmd-1.0.2-py3.6.egg/sistr/sistr_cmd.py", line 194, in sistr_predict
  File "/home/sabrina/miniconda3/envs/bioinfo/lib/python3.6/site-packages/sistr_cmd-1.0.2-py3.6.egg/sistr/src/cgmlst/__init__.py", line 342, in run_cgmlst
  File "/home/sabrina/miniconda3/envs/bioinfo/lib/python3.6/site-packages/sistr_cmd-1.0.2-py3.6.egg/sistr/src/cgmlst/__init__.py", line 134, in get_allele_sequences
  File "/home/sabrina/miniconda3/envs/bioinfo/lib/python3.6/site-packages/sistr_cmd-1.0.2-py3.6.egg/sistr/src/cgmlst/msa.py", line 58, in msa_ref_vs_novel
  File "/home/sabrina/miniconda3/envs/bioinfo/lib/python3.6/site-packages/sistr_cmd-1.0.2-py3.6.egg/sistr/src/cgmlst/msa.py", line 37, in msa_mafft
  File "/home/sabrina/miniconda3/envs/bioinfo/lib/python3.6/subprocess.py", line 709, in __init__
    restore_signals, start_new_session)
  File "/home/sabrina/miniconda3/envs/bioinfo/lib/python3.6/subprocess.py", line 1344, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'mafft': 'mafft'
(bioinfo)

I think this is because ete3's mafft installation, because I noticed this issue right after that installation, although not positive. At any rate, I reinstalled sistr_cmd into a clean Python 3.6 conda env and it's working fine.

peterk87 commented 6 years ago

That's great that it's working now!

I'll make the usage -i option clearer in the next version and in the docs.

Thanks for the info about ete3's mafft not playing nice with the version sistr_cmd needs. That's definitely something to keep in mind as development continues.