jasonsahl / LS-BSR

Large scale Blast Score Ratio (BSR) analysis
GNU General Public License v3.0
38 stars 17 forks source link

Fix problem with computing clusters that use separate file systems #14

Closed davised closed 7 years ago

davised commented 7 years ago

On our system, and presumably other large compute clusters, the compute nodes share a file system that is accessible by all (over the network), and typically have a scratch space hard drive connected to each machine mapped to /data or /tmp. Since the /data and/or /tmp folders are on a different file system, using the OS link commands fails. When I use the default temp folder settings, or when specifically pointing to the /data drive, I get an OSError:

python ../ls_bsr.py -d genomes -g genes/ecoli_markers.fasta -b blastn
LOG: 2016/12/19 14:36:58 - Testing paths of dependencies
/nfs1/BPP/Chang_Lab/opt/libs/bin/blastn
citation: Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, and Lipman DJ. 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25:3389-3402
Traceback (most recent call last):
  File "../ls_bsr.py", line 492, in <module>
    options.filter_scaffolds,options.prefix,options.temp_dir,options.min_pep_length,options.debug)
  File "../ls_bsr.py", line 113, in main
    os.link("%s" % infile, "%s/%s.new" % (fastadir,name))
OSError: [Errno 18] Invalid cross-device link

To resolve this, I imported the shutil copy function, and added a try/except statement, to fall back to the copy function when os.link fails.

New output:

python ../ls_bsr.py -d genomes -g genes/ecoli_markers.fasta -b blastn
LOG: 2016/12/19 14:39:08 - Testing paths of dependencies
/nfs1/BPP/Chang_Lab/opt/libs/bin/blastn
citation: Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, and Lipman DJ. 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25:3389-3402
LOG: 2016/12/19 14:39:08 - Using pre-compiled set of predicted genes
LOG: 2016/12/19 14:39:08 - using blastn
LOG: 2016/12/19 14:39:08 - starting BLAST
LOG: 2016/12/19 14:39:15 - BLAST done
LOG: 2016/12/19 14:39:15 - starting matrix building
LOG: 2016/12/19 14:39:16 - matrix built
LOG: 2016/12/19 14:39:16 - all Done

You could also just do a os.system("cp ... ...") command to follow the formatting of the rest of your code. I like the copy() function as the syntax is the same as the os.link function.

Additionally, I noticed that the provided stx2a nucleotide sequence, while corresponding to the STEC strain, fails when using tblastn (see notable LOG output):

LOG: 2016/12/19 14:31:11 - The following genes had no hits in datasets or are too short, values changed to 0, check names and output: stx2a

I updated the test dataset with the coding sequence of the stx2a gene so that it can properly be translated.

Output:

cat 20161219143111_bsr_matrix.txt
        O157_H7_sakai_all       H10407_all      SSON_046_all    E2348_69_all
IpaH3   0.10    0.05    1.00    0.00
LT      0.00    1.00    0.00    0.00
ST2     0.00    0.93    0.00    0.00
bfpB    0.03    0.00    0.00    1.00
stx2a   0.00    0.00    0.00    0.00
stx2a_revised   1.00    0.00    0.00    0.00
davised commented 7 years ago

Started a run with my own data that was failing without any scripted error messages. I realized that my input files were .fna, not .fasta. I added a check to see if any input files are found in the given folder and exits instead of attempting to continue.

Currently, the script is still running and passed the point where it was dying previously.

To reproduce the error I was getting, give a directory without any .fasta files in it. I thought I had a problem with the format of the content of my input files, not the filenames.