Closed spleonard1 closed 6 years ago
Can you check the brew installs of other programs, particularly cd-hit? This seems similar to an issue reported by someone else (see https://github.com/ryanmelnyk/PyParanoid/issues/2)
Oh! Yes my cd-hit installation was bad, I reinstalled manually. It still wasn't working, so I tried turning off MP since I figured it was silently failing. In sequential mode I repeatedly got this error:
Sequential mode...
Traceback (most recent call last):
File "/usr/local/bin/BuildGroups.py", line 551, in <module>
main()
File "/usr/local/bin/BuildGroups.py", line 531, in main
cdhit_seqs()
File "/usr/local/bin/BuildGroups.py", line 330, in cdhit_seqs
CDHIT_RUN(f)
File "/usr/local/bin/BuildGroups.py", line 340, in CDHIT_RUN
proc = subprocess.Popen(cmds.split(),stdout=FNULL,stderr=FNULL)
File "/usr/local/Cellar/python@2/2.7.14_3/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 390, in __init__
errread, errwrite)
File "/usr/local/Cellar/python@2/2.7.14_3/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 1025, in _execute_child
raise child_exception
OSError: [Errno 2] No such file or directory
Eventually I got it working by passing the absolute path for each argument in BuildGroups.py. Relative paths kept throwing that error... not sure why. But its working.. so thanks!
Nevermind... still not working. cd-hit seems to run fine, but still no output in all_groups.faa and all_groups.hmm Any other ideas?
Can you double check that all of the programs (hmmbuild, hmmemit, muscle, mcxload, mcxdump, mcl, cd-hit) are accessible in your PATH and run them with no args to verify that they are all installed?
If they are, can you do a test run to generate a new database using ~5 genomes without the --clean and --use_MP flags and with the --verbose output? Then attach the whole output here - your previous error message looks like an issue writing to /dev/null but I want to be sure.
PS thanks for being a beta tester on this.
No problem! Thanks for being so responsive with my issues... I'm happy to beta test and excited to use the tool! I checked, all of those programs are available in my path and seem to be working fine. Here's the output from a test run with 5 genomes and without --use_MP and --clean tags.
| => BuildGroups.py --verbose --cpus 8 snod_only/ snod_only/strainlist.txt sal_debug/
Formatting 5 fasta files...
Making diamond databases for 5 strains...
Running DIAMOND on all 5 strains...
Done!
Getting gene lengths...
Parsing diamond results for 5 strains...
Done!
Running InParanoid on 10 pairs of strains...
Sequential mode...
0 remaining...
Parsing 10 output files.
Done!
[mclIO] writing </Users/seanleonard/Dropbox/projects/2017_ortho_play/pyparanoid/sal_debug/mcl/data.mci>
.......................................
[mclIO] wrote native interchange 10133x10133 matrix with 36164 entries to stream </Users/seanleonard/Dropbox/projects/2017_ortho_play/pyparanoid/sal_debug/mcl/data.mci>
[mclIO] wrote 10133 tab entries to stream </Users/seanleonard/Dropbox/projects/2017_ortho_play/pyparanoid/sal_debug/mcl/data.tab>
[mcxload] tab has 10133 entries
[mclIO] reading </Users/seanleonard/Dropbox/projects/2017_ortho_play/pyparanoid/sal_debug/mcl/data.mci>
.......................................
[mclIO] read native interchange 10133x10133 matrix with 36164 entries
[mcl] pid 12320
ite chaos time hom(avg,lo,hi) m-ie m-ex i-ex fmv
1 3.79 0.01 1.00/0.60/1.75 1.02 1.02 1.02 0
2 5.46 0.03 1.00/0.61/1.65 1.00 1.00 1.02 0
3 7.15 0.03 0.99/0.23/1.48 1.00 1.00 1.02 0
4 7.13 0.02 0.99/0.26/2.28 1.00 0.99 1.01 0
5 2.92 0.01 0.99/0.50/1.29 1.00 0.99 1.01 0
6 2.25 0.02 1.00/0.47/1.00 1.00 0.98 0.99 0
7 1.84 0.02 1.00/0.39/1.00 1.01 0.99 0.98 0
8 1.19 0.02 1.00/0.55/1.00 1.00 1.00 0.97 0
9 0.86 0.03 1.00/0.61/1.00 1.00 1.00 0.97 0
10 0.29 0.03 1.00/0.77/1.00 1.00 1.00 0.97 0
11 0.25 0.03 1.00/0.76/1.00 1.00 1.00 0.97 0
12 0.27 0.03 1.00/0.82/1.00 1.00 1.00 0.97 0
13 0.35 0.01 1.00/0.77/1.00 1.00 1.00 0.97 0
14 0.25 0.01 1.00/0.76/1.00 1.00 1.00 0.97 0
15 0.13 0.03 1.00/0.87/1.00 1.00 1.00 0.97 0
16 0.01 0.03 1.00/0.99/1.00 1.00 1.00 0.97 0
17 0.02 0.01 1.00/1.00/1.00 1.00 1.00 0.97 0
18 0.04 0.01 1.00/1.00/1.00 1.00 1.00 0.97 0
19 0.08 0.01 1.00/0.98/1.00 1.00 1.00 0.97 0
20 0.14 0.01 1.00/0.93/1.00 1.00 1.00 0.97 0
21 0.22 0.01 1.00/0.83/1.00 1.00 1.00 0.97 0
22 0.24 0.01 1.00/0.76/1.00 1.00 1.00 0.97 0
23 0.12 0.01 1.00/0.88/1.00 1.00 1.00 0.97 0
24 0.01 0.01 1.00/0.99/1.00 1.00 1.00 0.97 0
25 0.00 0.02 1.00/1.00/1.00 1.00 1.00 0.97 0
[mcl] jury pruning marks: <100,100,100>, out of 100
[mcl] jury pruning synopsis: <100.0 or really really really good> (cf -scheme, -do log)
[mclIO] writing </Users/seanleonard/Dropbox/projects/2017_ortho_play/pyparanoid/sal_debug/mcl/mcl.out>
.......................................
[mclIO] wrote native interchange 10133x2431 matrix with 10133 entries to stream </Users/seanleonard/Dropbox/projects/2017_ortho_play/pyparanoid/sal_debug/mcl/mcl.out>
[mcl] 2431 clusters found
[mcl] output is in /Users/seanleonard/Dropbox/projects/2017_ortho_play/pyparanoid/sal_debug/mcl/mcl.out
Please cite:
Stijn van Dongen, Graph Clustering by Flow Simulation. PhD thesis,
University of Utrecht, May 2000.
( http://www.library.uu.nl/digiarchief/dip/diss/1895620/full.pdf
or http://micans.org/mcl/lit/svdthesis.pdf.gz)
OR
Stijn van Dongen, A cluster algorithm for graphs. Technical
Report INS-R0010, National Research Institute for Mathematics
and Computer Science in the Netherlands, Amsterdam, May 2000.
( http://www.cwi.nl/ftp/CWIreports/INS/INS-R0010.ps.Z
or http://micans.org/mcl/lit/INS-R0010.ps.Z)
[mclIO] reading </Users/seanleonard/Dropbox/projects/2017_ortho_play/pyparanoid/sal_debug/mcl/mcl.out>
.......................................
[mclIO] read native interchange 10133x2431 matrix with 10133 entries
Writing fasta files and parsing descriptions...
2432 groups equal to or larger than 0 sequences.
Clustering sequences...
Sequential mode...
Traceback (most recent call last):
File "/usr/local/bin/BuildGroups.py", line 551, in <module>
main()
File "/usr/local/bin/BuildGroups.py", line 531, in main
cdhit_seqs()
File "/usr/local/bin/BuildGroups.py", line 330, in cdhit_seqs
CDHIT_RUN(f)
File "/usr/local/bin/BuildGroups.py", line 340, in CDHIT_RUN
proc = subprocess.Popen(cmds.split(),stdout=FNULL,stderr=FNULL)
File "/usr/local/Cellar/python@2/2.7.14_3/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 390, in __init__
errread, errwrite)
File "/usr/local/Cellar/python@2/2.7.14_3/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 1025, in _execute_child
raise child_exception
OSError: [Errno 2] No such file or directory
OK, it's the same message as before...So all of the external programs I pipe the stdout and stderr to the bitbucket (/dev/null) to avoid cluttering up the terminal where it's running. I use the os.devnull function which should is platform agnostic but something is going wrong...
Can you open the python interpreter and run: import os os.devnull
Then from your terminal window run the command: echo fasdaadsvbkjl >> /dev/null
| ~/Dropbox/projects/2017_ortho_play/pyparanoid @ Seans-MacBook-Pro (seanleonard)
| => python2
Python 2.7.14 (default, Mar 22 2018, 14:43:05)
[GCC 4.2.1 Compatible Apple LLVM 9.0.0 (clang-900.0.39.2)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> os.devnull
'/dev/null'
>>> exit()
________________________________________________________________________________
| ~/Dropbox/projects/2017_ortho_play/pyparanoid @ Seans-MacBook-Pro (seanleonard)
| => echo fasdaadsvbkjl >> /dev/null
try running the following as a script - i'm pretty perplexed about whats going on
import subprocess, os
cmds = "echo foo" FNULL = open(os.devnull,'w') proc = subprocess.Popen(cmds.split(),stdout=FNULL,stderr=FNULL) proc.wait() FNULL.close() print "echo OK"
cmds = "cd-hit" FNULL = open(os.devnull,'w') proc = subprocess.Popen(cmds.split(),stdout=FNULL,stderr=FNULL) proc.wait() FNULL.close() print "cd-hit OK"
| => python2 test.py
echo OK
Traceback (most recent call last):
File "test.py", line 12, in <module>
proc = subprocess.Popen(cmds.split(),stdout=FNULL,stderr=FNULL)
File "/usr/local/Cellar/python@2/2.7.14_3/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 390, in __init__
errread, errwrite)
File "/usr/local/Cellar/python@2/2.7.14_3/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 1025, in _execute_child
raise child_exception
OSError: [Errno 2] No such file or directory
ok! It was on my end. I could run cd-hit fine, but "which cd-hit" returned nothing, see this:
| ~/Dropbox/projects/2017_ortho_play/pyparanoid/debug @ Seans-MacBook-Pro (seanleonard)
| => cd-hit
====== CD-HIT version 4.7 (built on Jul 18 2018) ======
Usage: cd-hit [Options]
Options
-i input filename in fasta format, required
-o output filename, required
-c sequence identity threshold, default 0.9
this is the default cd-hit's "global sequence identity" calculated as:
number of identical amino acids in alignment
divided by the full length of the shorter sequence
-G use global sequence identity, default 1
if set to 0, then use local sequence identity, calculated as :
number of identical amino acids in alignment
divided by the length of the alignment
NOTE!!! don't use -G 0 unless you use alignment coverage controls
see options -aL, -AL, -aS, -AS
-b band_width of alignment, default 20
-M memory limit (in MB) for the program, default 800; 0 for unlimitted;
-T number of threads, default 1; with 0, all CPUs will be used
-n word_length, default 5, see user's guide for choosing it
-l length of throw_away_sequences, default 10
-t tolerance for redundance, default 2
-d length of description in .clstr file, default 20
if set to 0, it takes the fasta defline and stops at first space
-s length difference cutoff, default 0.0
if set to 0.9, the shorter sequences need to be
at least 90% length of the representative of the cluster
-S length difference cutoff in amino acid, default 999999
if set to 60, the length difference between the shorter sequences
and the representative of the cluster can not be bigger than 60
-aL alignment coverage for the longer sequence, default 0.0
if set to 0.9, the alignment must covers 90% of the sequence
-AL alignment coverage control for the longer sequence, default 99999999
if set to 60, and the length of the sequence is 400,
then the alignment must be >= 340 (400-60) residues
-aS alignment coverage for the shorter sequence, default 0.0
if set to 0.9, the alignment must covers 90% of the sequence
-AS alignment coverage control for the shorter sequence, default 99999999
if set to 60, and the length of the sequence is 400,
then the alignment must be >= 340 (400-60) residues
-A minimal alignment coverage control for the both sequences, default 0
alignment must cover >= this value for both sequences
-uL maximum unmatched percentage for the longer sequence, default 1.0
if set to 0.1, the unmatched region (excluding leading and tailing gaps)
must not be more than 10% of the sequence
-uS maximum unmatched percentage for the shorter sequence, default 1.0
if set to 0.1, the unmatched region (excluding leading and tailing gaps)
must not be more than 10% of the sequence
-U maximum unmatched length, default 99999999
if set to 10, the unmatched region (excluding leading and tailing gaps)
must not be more than 10 bases
-B 1 or 0, default 0, by default, sequences are stored in RAM
if set to 1, sequence are stored on hard drive
!! No longer supported !!
-p 1 or 0, default 0
if set to 1, print alignment overlap in .clstr file
-g 1 or 0, default 0
by cd-hit's default algorithm, a sequence is clustered to the first
cluster that meet the threshold (fast cluster). If set to 1, the program
will cluster it into the most similar cluster that meet the threshold
(accurate but slow mode)
but either 1 or 0 won't change the representatives of final clusters
-sc sort clusters by size (number of sequences), default 0, output clusters by decreasing length
if set to 1, output clusters by decreasing size
-sf sort fasta/fastq by cluster size (number of sequences), default 0, no sorting
if set to 1, output sequences by decreasing cluster size
-bak write backup cluster file (1 or 0, default 0)
-h print this help
Questions, bugs, contact Limin Fu at l2fu@ucsd.edu, or Weizhong Li at liwz@sdsc.edu
For updated versions and information, please visit: http://cd-hit.org
cd-hit web server is also available from http://cd-hit.org
If you find cd-hit useful, please kindly cite:
"CD-HIT: a fast program for clustering and comparing large sets of protein or nucleotide sequences", Weizhong Li & Adam Godzik. Bioinformatics, (2006) 22:1658-1659
"CD-HIT: accelerated for clustering the next generation sequencing data", Limin Fu, Beifang Niu, Zhengwei Zhu, Sitao Wu & Weizhong Li. Bioinformatics, (2012) 28:3150-3152
________________________________________________________________________________
| ~/Dropbox/projects/2017_ortho_play/pyparanoid/debug @ Seans-MacBook-Pro (seanleonard)
| => which cd-hit
________________________________________________________________________________
| ~/Dropbox/projects/2017_ortho_play/pyparanoid/debug @ Seans-MacBook-Pro (seanleonard)
I checked my bash profile and like I fool I had used a relative path to the cd-hit directory... so I fixed my bash profile and your test script works fine, now.
Ok cool that makes more sense now - let me know how it goes when you rebuild the database. If you have the initial mcl clustering from your original database saved you can use --mode extract
for BuildGroups.py to skip the DIAMOND searches and clustering.
Thanks! Appreciate you helping me through this. I'm going to close this issue to hide my shame in using a relative path ;)
Trying to pull out orthologs after BuildGroups.py using the following command:
BuildGroups.py seemed to run fine
But the all_groups.faa and all_groups.hmm files are both empty. Any reason the .hmm wouldn't actually be written?