ryanmelnyk / PyParanoid

Rapid and scalable homolog identification for bacterial genomes
MIT License
32 stars 7 forks source link

no hmm output after BuildGroups.py #5

Closed spleonard1 closed 6 years ago

spleonard1 commented 6 years ago

Trying to pull out orthologs after BuildGroups.py using the following command:

Subfolder exists: /Users/seanleonard/Dropbox/projects/2017_ortho_play/pyparanoid/sal_pyp/ortho_align
Subfolder exists: /Users/seanleonard/Dropbox/projects/2017_ortho_play/pyparanoid/sal_pyp/concat
Subfolder exists: /Users/seanleonard/Dropbox/projects/2017_ortho_play/pyparanoid/sal_pyp/hmms
Parsing matrix to identify orthologs...
1471 orthologs found.
Indexing all_groups.hmm...

Error: File format problem in trying to open HMM file /Users/seanleonard/Dropbox/projects/2017_ortho_play/pyparanoid/sal_pyp/all_groups.hmm.
File exists, but appears to be empty?

Extracting 1471 HMM files...0 already found.
    1400 remaining...
    1300 remaining...
    1200 remaining...
    1100 remaining...
    1000 remaining...
    900 remaining...
    800 remaining...
    700 remaining...
    600 remaining...
    500 remaining...
    400 remaining...
    300 remaining...
    200 remaining...
    100 remaining...
    0 remaining...
    Done!
Concatenating 1471 ortholog files...
Aligning 1471 ortholog files...

Error: File existence/permissions problem in trying to open HMM file /Users/seanleonard/Dropbox/projects/2017_ortho_play/pyparanoid/sal_pyp/hmms/group_00026.hmm.
HMM file /Users/seanleonard/Dropbox/projects/2017_ortho_play/pyparanoid/sal_pyp/hmms/group_00026.hmm not found (nor an .h3m bin

BuildGroups.py seemed to run fine

...
Clustering sequences...
Aligning groups...
Cleaning up /Users/seanleonard/Dropbox/projects/2017_ortho_play/pyparanoid/sal_pyp/clustered
Building hmms...
Cleaning up /Users/seanleonard/Dropbox/projects/2017_ortho_play/pyparanoid/sal_pyp/aligned
Emitting consensus sequences...
Writing multi-hmm file...
Writing multi-fasta consensus file...
Cleaning up /Users/seanleonard/Dropbox/projects/2017_ortho_play/pyparanoid/sal_pyp/hmms
Cleaning up /Users/seanleonard/Dropbox/projects/2017_ortho_play/pyparanoid/sal_pyp/consensus_seqs
Cleaning up /Users/seanleonard/Dropbox/projects/2017_ortho_play/pyparanoid/sal_pyp/m8
Cleaning up /Users/seanleonard/Dropbox/projects/2017_ortho_play/pyparanoid/sal_pyp/paranoid_output
Cleaning up /Users/seanleonard/Dropbox/projects/2017_ortho_play/pyparanoid/sal_pyp/dmnd_tmp
Cleaning up /Users/seanleonard/Dropbox/projects/2017_ortho_play/pyparanoid/sal_pyp/faa
________________________________________________________________________________
| ~/Dropbox/projects/2017_ortho_play/pyparanoid @ Seans-MacBook-Pro (seanleonard) 

But the all_groups.faa and all_groups.hmm files are both empty. Any reason the .hmm wouldn't actually be written?

ryanmelnyk commented 6 years ago

Can you check the brew installs of other programs, particularly cd-hit? This seems similar to an issue reported by someone else (see https://github.com/ryanmelnyk/PyParanoid/issues/2)

spleonard1 commented 6 years ago

Oh! Yes my cd-hit installation was bad, I reinstalled manually. It still wasn't working, so I tried turning off MP since I figured it was silently failing. In sequential mode I repeatedly got this error:

Sequential mode...
Traceback (most recent call last):
  File "/usr/local/bin/BuildGroups.py", line 551, in <module>
    main()
  File "/usr/local/bin/BuildGroups.py", line 531, in main
    cdhit_seqs()
  File "/usr/local/bin/BuildGroups.py", line 330, in cdhit_seqs
    CDHIT_RUN(f)
  File "/usr/local/bin/BuildGroups.py", line 340, in CDHIT_RUN
    proc = subprocess.Popen(cmds.split(),stdout=FNULL,stderr=FNULL)
  File "/usr/local/Cellar/python@2/2.7.14_3/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 390, in __init__
    errread, errwrite)
  File "/usr/local/Cellar/python@2/2.7.14_3/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 1025, in _execute_child
    raise child_exception
OSError: [Errno 2] No such file or directory

Eventually I got it working by passing the absolute path for each argument in BuildGroups.py. Relative paths kept throwing that error... not sure why. But its working.. so thanks!

spleonard1 commented 6 years ago

Nevermind... still not working. cd-hit seems to run fine, but still no output in all_groups.faa and all_groups.hmm Any other ideas?

ryanmelnyk commented 6 years ago

Can you double check that all of the programs (hmmbuild, hmmemit, muscle, mcxload, mcxdump, mcl, cd-hit) are accessible in your PATH and run them with no args to verify that they are all installed?

If they are, can you do a test run to generate a new database using ~5 genomes without the --clean and --use_MP flags and with the --verbose output? Then attach the whole output here - your previous error message looks like an issue writing to /dev/null but I want to be sure.

PS thanks for being a beta tester on this.

spleonard1 commented 6 years ago

No problem! Thanks for being so responsive with my issues... I'm happy to beta test and excited to use the tool! I checked, all of those programs are available in my path and seem to be working fine. Here's the output from a test run with 5 genomes and without --use_MP and --clean tags.

| => BuildGroups.py --verbose --cpus 8 snod_only/ snod_only/strainlist.txt sal_debug/
Formatting 5 fasta files...
Making diamond databases for 5 strains...
Running DIAMOND on all 5 strains...
    Done!
Getting gene lengths...
Parsing diamond results for 5 strains...
    Done!
Running InParanoid on 10 pairs of strains...
Sequential mode...
    0 remaining...
Parsing 10 output files.
    Done!
[mclIO] writing </Users/seanleonard/Dropbox/projects/2017_ortho_play/pyparanoid/sal_debug/mcl/data.mci>
.......................................
[mclIO] wrote native interchange 10133x10133 matrix with 36164 entries to stream </Users/seanleonard/Dropbox/projects/2017_ortho_play/pyparanoid/sal_debug/mcl/data.mci>
[mclIO] wrote 10133 tab entries to stream </Users/seanleonard/Dropbox/projects/2017_ortho_play/pyparanoid/sal_debug/mcl/data.tab>
[mcxload] tab has 10133 entries
[mclIO] reading </Users/seanleonard/Dropbox/projects/2017_ortho_play/pyparanoid/sal_debug/mcl/data.mci>
.......................................
[mclIO] read native interchange 10133x10133 matrix with 36164 entries
[mcl] pid 12320
 ite   chaos  time hom(avg,lo,hi) m-ie m-ex i-ex fmv
  1     3.79  0.01 1.00/0.60/1.75 1.02 1.02 1.02   0
  2     5.46  0.03 1.00/0.61/1.65 1.00 1.00 1.02   0
  3     7.15  0.03 0.99/0.23/1.48 1.00 1.00 1.02   0
  4     7.13  0.02 0.99/0.26/2.28 1.00 0.99 1.01   0
  5     2.92  0.01 0.99/0.50/1.29 1.00 0.99 1.01   0
  6     2.25  0.02 1.00/0.47/1.00 1.00 0.98 0.99   0
  7     1.84  0.02 1.00/0.39/1.00 1.01 0.99 0.98   0
  8     1.19  0.02 1.00/0.55/1.00 1.00 1.00 0.97   0
  9     0.86  0.03 1.00/0.61/1.00 1.00 1.00 0.97   0
 10     0.29  0.03 1.00/0.77/1.00 1.00 1.00 0.97   0
 11     0.25  0.03 1.00/0.76/1.00 1.00 1.00 0.97   0
 12     0.27  0.03 1.00/0.82/1.00 1.00 1.00 0.97   0
 13     0.35  0.01 1.00/0.77/1.00 1.00 1.00 0.97   0
 14     0.25  0.01 1.00/0.76/1.00 1.00 1.00 0.97   0
 15     0.13  0.03 1.00/0.87/1.00 1.00 1.00 0.97   0
 16     0.01  0.03 1.00/0.99/1.00 1.00 1.00 0.97   0
 17     0.02  0.01 1.00/1.00/1.00 1.00 1.00 0.97   0
 18     0.04  0.01 1.00/1.00/1.00 1.00 1.00 0.97   0
 19     0.08  0.01 1.00/0.98/1.00 1.00 1.00 0.97   0
 20     0.14  0.01 1.00/0.93/1.00 1.00 1.00 0.97   0
 21     0.22  0.01 1.00/0.83/1.00 1.00 1.00 0.97   0
 22     0.24  0.01 1.00/0.76/1.00 1.00 1.00 0.97   0
 23     0.12  0.01 1.00/0.88/1.00 1.00 1.00 0.97   0
 24     0.01  0.01 1.00/0.99/1.00 1.00 1.00 0.97   0
 25     0.00  0.02 1.00/1.00/1.00 1.00 1.00 0.97   0
[mcl] jury pruning marks: <100,100,100>, out of 100
[mcl] jury pruning synopsis: <100.0 or really really really good> (cf -scheme, -do log)
[mclIO] writing </Users/seanleonard/Dropbox/projects/2017_ortho_play/pyparanoid/sal_debug/mcl/mcl.out>
.......................................
[mclIO] wrote native interchange 10133x2431 matrix with 10133 entries to stream </Users/seanleonard/Dropbox/projects/2017_ortho_play/pyparanoid/sal_debug/mcl/mcl.out>
[mcl] 2431 clusters found
[mcl] output is in /Users/seanleonard/Dropbox/projects/2017_ortho_play/pyparanoid/sal_debug/mcl/mcl.out

Please cite:
    Stijn van Dongen, Graph Clustering by Flow Simulation.  PhD thesis,
    University of Utrecht, May 2000.
       (  http://www.library.uu.nl/digiarchief/dip/diss/1895620/full.pdf
       or  http://micans.org/mcl/lit/svdthesis.pdf.gz)
OR
    Stijn van Dongen, A cluster algorithm for graphs. Technical
    Report INS-R0010, National Research Institute for Mathematics
    and Computer Science in the Netherlands, Amsterdam, May 2000.
       (  http://www.cwi.nl/ftp/CWIreports/INS/INS-R0010.ps.Z
       or  http://micans.org/mcl/lit/INS-R0010.ps.Z)

[mclIO] reading </Users/seanleonard/Dropbox/projects/2017_ortho_play/pyparanoid/sal_debug/mcl/mcl.out>
.......................................
[mclIO] read native interchange 10133x2431 matrix with 10133 entries
Writing fasta files and parsing descriptions...
2432 groups equal to or larger than 0 sequences.
Clustering sequences...
Sequential mode...
Traceback (most recent call last):
  File "/usr/local/bin/BuildGroups.py", line 551, in <module>
    main()
  File "/usr/local/bin/BuildGroups.py", line 531, in main
    cdhit_seqs()
  File "/usr/local/bin/BuildGroups.py", line 330, in cdhit_seqs
    CDHIT_RUN(f)
  File "/usr/local/bin/BuildGroups.py", line 340, in CDHIT_RUN
    proc = subprocess.Popen(cmds.split(),stdout=FNULL,stderr=FNULL)
  File "/usr/local/Cellar/python@2/2.7.14_3/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 390, in __init__
    errread, errwrite)
  File "/usr/local/Cellar/python@2/2.7.14_3/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 1025, in _execute_child
    raise child_exception
OSError: [Errno 2] No such file or directory
ryanmelnyk commented 6 years ago

OK, it's the same message as before...So all of the external programs I pipe the stdout and stderr to the bitbucket (/dev/null) to avoid cluttering up the terminal where it's running. I use the os.devnull function which should is platform agnostic but something is going wrong...

Can you open the python interpreter and run: import os os.devnull

Then from your terminal window run the command: echo fasdaadsvbkjl >> /dev/null

spleonard1 commented 6 years ago
| ~/Dropbox/projects/2017_ortho_play/pyparanoid @ Seans-MacBook-Pro (seanleonard) 
| => python2
Python 2.7.14 (default, Mar 22 2018, 14:43:05) 
[GCC 4.2.1 Compatible Apple LLVM 9.0.0 (clang-900.0.39.2)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> os.devnull
'/dev/null'
>>> exit()
________________________________________________________________________________
| ~/Dropbox/projects/2017_ortho_play/pyparanoid @ Seans-MacBook-Pro (seanleonard) 
| => echo fasdaadsvbkjl >> /dev/null
ryanmelnyk commented 6 years ago

try running the following as a script - i'm pretty perplexed about whats going on

import subprocess, os

cmds = "echo foo" FNULL = open(os.devnull,'w') proc = subprocess.Popen(cmds.split(),stdout=FNULL,stderr=FNULL) proc.wait() FNULL.close() print "echo OK"

cmds = "cd-hit" FNULL = open(os.devnull,'w') proc = subprocess.Popen(cmds.split(),stdout=FNULL,stderr=FNULL) proc.wait() FNULL.close() print "cd-hit OK"

spleonard1 commented 6 years ago
| => python2 test.py 
echo OK
Traceback (most recent call last):
  File "test.py", line 12, in <module>
    proc = subprocess.Popen(cmds.split(),stdout=FNULL,stderr=FNULL)
  File "/usr/local/Cellar/python@2/2.7.14_3/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 390, in __init__
    errread, errwrite)
  File "/usr/local/Cellar/python@2/2.7.14_3/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 1025, in _execute_child
    raise child_exception
OSError: [Errno 2] No such file or directory
spleonard1 commented 6 years ago

ok! It was on my end. I could run cd-hit fine, but "which cd-hit" returned nothing, see this:

| ~/Dropbox/projects/2017_ortho_play/pyparanoid/debug @ Seans-MacBook-Pro (seanleonard) 
| => cd-hit
        ====== CD-HIT version 4.7 (built on Jul 18 2018) ======

Usage: cd-hit [Options] 

Options

   -i   input filename in fasta format, required
   -o   output filename, required
   -c   sequence identity threshold, default 0.9
    this is the default cd-hit's "global sequence identity" calculated as:
    number of identical amino acids in alignment
    divided by the full length of the shorter sequence
   -G   use global sequence identity, default 1
    if set to 0, then use local sequence identity, calculated as :
    number of identical amino acids in alignment
    divided by the length of the alignment
    NOTE!!! don't use -G 0 unless you use alignment coverage controls
    see options -aL, -AL, -aS, -AS
   -b   band_width of alignment, default 20
   -M   memory limit (in MB) for the program, default 800; 0 for unlimitted;
   -T   number of threads, default 1; with 0, all CPUs will be used
   -n   word_length, default 5, see user's guide for choosing it
   -l   length of throw_away_sequences, default 10
   -t   tolerance for redundance, default 2
   -d   length of description in .clstr file, default 20
    if set to 0, it takes the fasta defline and stops at first space
   -s   length difference cutoff, default 0.0
    if set to 0.9, the shorter sequences need to be
    at least 90% length of the representative of the cluster
   -S   length difference cutoff in amino acid, default 999999
    if set to 60, the length difference between the shorter sequences
    and the representative of the cluster can not be bigger than 60
   -aL  alignment coverage for the longer sequence, default 0.0
    if set to 0.9, the alignment must covers 90% of the sequence
   -AL  alignment coverage control for the longer sequence, default 99999999
    if set to 60, and the length of the sequence is 400,
    then the alignment must be >= 340 (400-60) residues
   -aS  alignment coverage for the shorter sequence, default 0.0
    if set to 0.9, the alignment must covers 90% of the sequence
   -AS  alignment coverage control for the shorter sequence, default 99999999
    if set to 60, and the length of the sequence is 400,
    then the alignment must be >= 340 (400-60) residues
   -A   minimal alignment coverage control for the both sequences, default 0
    alignment must cover >= this value for both sequences 
   -uL  maximum unmatched percentage for the longer sequence, default 1.0
    if set to 0.1, the unmatched region (excluding leading and tailing gaps)
    must not be more than 10% of the sequence
   -uS  maximum unmatched percentage for the shorter sequence, default 1.0
    if set to 0.1, the unmatched region (excluding leading and tailing gaps)
    must not be more than 10% of the sequence
   -U   maximum unmatched length, default 99999999
    if set to 10, the unmatched region (excluding leading and tailing gaps)
    must not be more than 10 bases
   -B   1 or 0, default 0, by default, sequences are stored in RAM
    if set to 1, sequence are stored on hard drive
    !! No longer supported !!
   -p   1 or 0, default 0
    if set to 1, print alignment overlap in .clstr file
   -g   1 or 0, default 0
    by cd-hit's default algorithm, a sequence is clustered to the first 
    cluster that meet the threshold (fast cluster). If set to 1, the program
    will cluster it into the most similar cluster that meet the threshold
    (accurate but slow mode)
    but either 1 or 0 won't change the representatives of final clusters
   -sc  sort clusters by size (number of sequences), default 0, output clusters by decreasing length
    if set to 1, output clusters by decreasing size
   -sf  sort fasta/fastq by cluster size (number of sequences), default 0, no sorting
    if set to 1, output sequences by decreasing cluster size
   -bak write backup cluster file (1 or 0, default 0)
   -h   print this help

   Questions, bugs, contact Limin Fu at l2fu@ucsd.edu, or Weizhong Li at liwz@sdsc.edu
   For updated versions and information, please visit: http://cd-hit.org

   cd-hit web server is also available from http://cd-hit.org

   If you find cd-hit useful, please kindly cite:

   "CD-HIT: a fast program for clustering and comparing large sets of protein or nucleotide sequences", Weizhong Li & Adam Godzik. Bioinformatics, (2006) 22:1658-1659
   "CD-HIT: accelerated for clustering the next generation sequencing data", Limin Fu, Beifang Niu, Zhengwei Zhu, Sitao Wu & Weizhong Li. Bioinformatics, (2012) 28:3150-3152

________________________________________________________________________________
| ~/Dropbox/projects/2017_ortho_play/pyparanoid/debug @ Seans-MacBook-Pro (seanleonard) 
| => which cd-hit
________________________________________________________________________________
| ~/Dropbox/projects/2017_ortho_play/pyparanoid/debug @ Seans-MacBook-Pro (seanleonard) 

I checked my bash profile and like I fool I had used a relative path to the cd-hit directory... so I fixed my bash profile and your test script works fine, now.

ryanmelnyk commented 6 years ago

Ok cool that makes more sense now - let me know how it goes when you rebuild the database. If you have the initial mcl clustering from your original database saved you can use --mode extract for BuildGroups.py to skip the DIAMOND searches and clustering.

spleonard1 commented 6 years ago

Thanks! Appreciate you helping me through this. I'm going to close this issue to hide my shame in using a relative path ;)