phac-nml / sistr_cmd

SISTR (Salmonella In Silico Typing Resource) command-line tool
Apache License 2.0
25 stars 9 forks source link

Multiple instances of SISTR on same machine can interfere with each other #18

Closed apetkau closed 7 years ago

apetkau commented 7 years ago

I've found that if you run multiple instances of SISTR on the same machine, starting all of them at the exact same time, they can interfere with each other's results.

For example, running:

for i in {1..2}; do sistr -f csv -o predictions_$i AE014613.fasta 2> $i.err 1> $i.out & done

Will produce the following in the stderr files:

...
2017-03-24 14:25:36,982 ERROR: Missing cgmlst_results for NZ_AOXE01000004.1_101 [in /home/aaron/miniconda2/lib/python2.7/site-packages/sistr_cmd-0.3.4-py2.7.egg/sistr/src/cgmlst/__init__.py:357]
2017-03-24 14:25:36,982 ERROR: Missing cgmlst_results for NZ_AOXE01000008.1_59 [in /home/aaron/miniconda2/lib/python2.7/site-packages/sistr_cmd-0.3.4-py2.7.egg/sistr/src/cgmlst/__init__.py:357]
2017-03-24 14:25:36,983 ERROR: Missing cgmlst_results for NZ_AOXE01000053.1_113 [in /home/aaron/miniconda2/lib/python2.7/site-packages/sistr_cmd-0.3.4-py2.7.egg/sistr/src/cgmlst/__init__.py:357]
/home/aaron/miniconda2/lib/python2.7/site-packages/sistr_cmd-0.3.4-py2.7.egg/sistr/src/cgmlst/__init__.py:293: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
2017-03-24 14:25:38,061 ERROR: blastn on db AE014613_fasta and query wzy.fasta did not produce expected output file at /tmp/20170324142534-SISTR-AE014613/wzy.fasta-AE014613_fasta-2017Mar24_14_25_37.blast [in /home/aaron/miniconda2/lib/python2.7/site-packages/sistr_cmd-0.3.4-py2.7.egg/sistr/src/blast_wrapper/__init__.py:125]
Traceback (most recent call last):
  File "/home/aaron/miniconda2/bin/sistr", line 11, in <module>
    load_entry_point('sistr-cmd==0.3.4', 'console_scripts', 'sistr')()
  File "/home/aaron/miniconda2/lib/python2.7/site-packages/sistr_cmd-0.3.4-py2.7.egg/sistr/sistr_cmd.py", line 320, in main
  File "/home/aaron/miniconda2/lib/python2.7/site-packages/sistr_cmd-0.3.4-py2.7.egg/sistr/sistr_cmd.py", line 221, in sistr_predict
  File "/home/aaron/miniconda2/lib/python2.7/site-packages/sistr_cmd-0.3.4-py2.7.egg/sistr/src/blast_wrapper/__init__.py", line 130, in cleanup
  File "/home/aaron/miniconda2/lib/python2.7/shutil.py", line 239, in rmtree
    onerror(os.listdir, path, sys.exc_info())
  File "/home/aaron/miniconda2/lib/python2.7/shutil.py", line 237, in rmtree
    names = os.listdir(path)
OSError: [Errno 2] No such file or directory: '/tmp/20170324142534-SISTR-AE014613'

This error does not occur if only running one instance at a time. I'm guessing each instance is interfering with each other's tmp files.

peterk87 commented 7 years ago

Yes, the error is definitely occurring due to the same tmp directory being created and used by each instance in that case. One instance completes before the other cleaning up the tmp directory.

Would there be a scenario where files with the same base filename are run at the same time?

A potential workaround would be to distinguish different input files by providing a genome_name along with the path to the input fasta using the -i arg:

for i in {1..2}; do 
  sistr -f csv -o predictions_$i \
    -i /path/to/AE014613.fasta <genome_name>_$i \
    2> $i.err 1> $i.out & done

This should produce tmp dirs:

/tmp/<timestamp>-SISTR-<genome_name>_1
/tmp/<timestamp>-SISTR-<genome_name>_2

Or you could specify different base tmp directories to produce the output files in.

I could add a condition to the tmp dir creation to check if the directory already exists, and if so, create a tmp dir with a slightly different name (e.g. append _<number>).

apetkau commented 7 years ago

Hmmm... with the current setup I have the files are named the same as I do an assembly first, so the file becomes something like contigs.fasta.

The scenario I'm thinking of is automatically running SISTR on upload of sequencing data from a sequencing run. However, in general, they probably won't all run at the same time, except for my small test data.

I do think it's something to fix up though, either through your suggesting, or by using one of the tempfile functions (which will assign just random names).

peterk87 commented 7 years ago

Okay, I'll work up a fix and a new release with the check on tmp dir creation.

In the scenario you describe, would you be able to provide a genome name (or some kind of unique and useful identifier) to your input fasta? You could keep it as /path/to/contigs.fasta but also supply a genome_name, e.g.

sistr -o output -i /path/to/contigs.fasta genome_1337

So the SISTR output would show the name as genome_1337 which might be useful in the other output files like the cgMLST profile output or the detailed cgMLST allele search results.

apetkau commented 7 years ago

Awesome, thanks :)

Yes, I'll also look at giving the genomes passed to SISTR a better name.