Fix problem with computing clusters that use separate file systems

On our system, and presumably other large compute clusters, the compute nodes share a file system that is accessible by all (over the network), and typically have a scratch space hard drive connected to each machine mapped to /data or /tmp. Since the /data and/or /tmp folders are on a different file system, using the OS link commands fails. When I use the default temp folder settings, or when specifically pointing to the /data drive, I get an OSError:

python ../ls_bsr.py -d genomes -g genes/ecoli_markers.fasta -b blastn
LOG: 2016/12/19 14:36:58 - Testing paths of dependencies
/nfs1/BPP/Chang_Lab/opt/libs/bin/blastn
citation: Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, and Lipman DJ. 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25:3389-3402
Traceback (most recent call last):
  File "../ls_bsr.py", line 492, in <module>
    options.filter_scaffolds,options.prefix,options.temp_dir,options.min_pep_length,options.debug)
  File "../ls_bsr.py", line 113, in main
    os.link("%s" % infile, "%s/%s.new" % (fastadir,name))
OSError: [Errno 18] Invalid cross-device link

To resolve this, I imported the shutil copy function, and added a try/except statement, to fall back to the copy function when os.link fails.

New output:

python ../ls_bsr.py -d genomes -g genes/ecoli_markers.fasta -b blastn
LOG: 2016/12/19 14:39:08 - Testing paths of dependencies
/nfs1/BPP/Chang_Lab/opt/libs/bin/blastn
citation: Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, and Lipman DJ. 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25:3389-3402
LOG: 2016/12/19 14:39:08 - Using pre-compiled set of predicted genes
LOG: 2016/12/19 14:39:08 - using blastn
LOG: 2016/12/19 14:39:08 - starting BLAST
LOG: 2016/12/19 14:39:15 - BLAST done
LOG: 2016/12/19 14:39:15 - starting matrix building
LOG: 2016/12/19 14:39:16 - matrix built
LOG: 2016/12/19 14:39:16 - all Done

You could also just do a os.system("cp ... ...") command to follow the formatting of the rest of your code. I like the copy() function as the syntax is the same as the os.link function.

Additionally, I noticed that the provided stx2a nucleotide sequence, while corresponding to the STEC strain, fails when using tblastn (see notable LOG output):

LOG: 2016/12/19 14:31:11 - The following genes had no hits in datasets or are too short, values changed to 0, check names and output: stx2a

I updated the test dataset with the coding sequence of the stx2a gene so that it can properly be translated.

Output:

cat 20161219143111_bsr_matrix.txt
        O157_H7_sakai_all       H10407_all      SSON_046_all    E2348_69_all
IpaH3   0.10    0.05    1.00    0.00
LT      0.00    1.00    0.00    0.00
ST2     0.00    0.93    0.00    0.00
bfpB    0.03    0.00    0.00    1.00
stx2a   0.00    0.00    0.00    0.00
stx2a_revised   1.00    0.00    0.00    0.00

jasonsahl / LS-BSR

Fix problem with computing clusters that use separate file systems #14