todddeluca / reciprocal_smallest_distance

Reciprocal Smallest Distance (RSD) is a pairwise orthology algorithm that uses global sequence alignment and maximum likelihood evolutionary distance between sequences to accurately detects orthologs between genomes.
MIT License
13 stars 6 forks source link

Problem Running the Example #2

Closed ctross closed 10 years ago

ctross commented 10 years ago

I have installed all dependencies and added their .exe locations to the system path.

When I run the rds on the example data, I get an error. Any Ideas as to what is going on?

C:\Users\g\Python27>python Scripts/rsd_search -q C:/Users/g/recipr ocal_smallest_distance/examples/genomes/Mycoplasma_genitalium.aa/Mycoplasma_geni talium.aa --subject-genome=C:/Users/g/reciprocal_smallest_distance/exampl es/genomes/Mycobacterium_leprae.aa/Mycobacterium_leprae.aa -o Mycoplasma_genital ium.aa_Mycobacterium_leprae.aa_0.8_1e-5.orthologs.txt The system cannot find the path specified. Traceback (most recent call last): File "Scripts/rsd_search", line 5, in pkg_resources.run_script('reciprocal-smallest-distance==1.1.5', 'rsd_search' ) File "C:\Users\g\Python27\lib\site-packages\pkg_resources.py", line 488 , in run_script self.require(requires)[0].run_script(script_name, ns) File "C:\Users\g\Python27\lib\site-packages\pkg_resources.py", line 135 4, in run_script execfile(script_filename, namespace, namespace) File "c:\users\g\python27\lib\site-packages\reciprocal_smallest_distanc e-1.1.5-py2.7.egg\EGG-INFO\scripts\rsd_search", line 182, in main() File "c:\users\g\python27\lib\site-packages\reciprocal_smallest_distanc e-1.1.5-py2.7.egg\EGG-INFO\scripts\rsd_search", line 134, in main rsd.formatFastaArg(queryFastaPath) File "C:\Users\g\Python27\lib\site-packages\reciprocal_smallest_distanc e-1.1.5-py2.7.egg\rsd\rsd.py", line 672, in formatFastaArg formatForBlast(fastaFile) File "C:\Users\g\Python27\lib\site-packages\reciprocal_smallest_distanc e-1.1.5-py2.7.egg\rsd\rsd.py", line 67, in formatForBlast subprocess.check_call(cmd, shell=True) File "C:\Users\g\Python27\lib\subprocess.py", line 540, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command 'makeblastdb -in C:\Users\g\Python 27\tmp99f24fb913ca48358e41af48fe2d6612\Mycoplasma_genitalium.aa -dbtype prot -pa rse_seqids >/dev/null' returned non-zero exit status 1

ctross commented 10 years ago

I am running Windows 7, 64 bit.

todddeluca commented 10 years ago

This could be a Windows compatibility issue or something else. Are you sure you makeblastdb is in your path? Either way, it looks like redirecting to /dev/null via the shell is a cross-platform no-no that could be fixed.

There was a fork of RSD made to run on Azure, so I know it can be done fairly easily. Could you submit a pull request?

ctross commented 10 years ago

Yeah, makeblastdb is in my PATH. It looks like an error is being thrown by makeblastdb when data is fed in from the rsd_search command. If I run makeblastdb directly as:

makeblastdb -in C:\Users\g\Python 27\tmp99f24fb913ca48358e41af48fe2d6612\Mycoplasma_genitalium.aa -dbtype prot -pa

It appears to function normally.

I don't really know what Azure is, or how to access it. What should the pull request be about exactly?

Thanks for the help on this!

todddeluca commented 10 years ago

I've created a 'windows' branch, in which I removed the /dev/null redirect from various commands. Can you test if this change fixes the problem on your system? You should be able to test by doing something like the following:

Push the windows branch to github:

git push -u origin windows

Clone windows branch:

cd ~/tmp
git clone -b windows https://github.com/todddeluca/reciprocal_smallest_distance

Create a virtualenv:

virtualenv venv

Install the windows branch:

venv/bin/pip install -e reciprocal_smallest_distance/

Test rsd_search:

cd ~/tmp
venv/bin/rsd_search -q reciprocal_smallest_distance/examples/genomes/Mycoplasma_genitalium.aa/Mycoplasma_genitalium.aa --subject-genome=reciprocal_smallest_distance/examples/genomes/Mycobacterium_leprae.aa/Mycobacterium_leprae.aa -o Mycoplasma_genitalium.aa_Mycobacterium_leprae.aa_0.8_1e-5.orthologs.txt
ctross commented 10 years ago

Just got a chance to give this a try. Worked great on the example, and appears to have performed good on my own FASTA files. Thank you so much for the help with trouble shooting!

One quick question. In the output file there are essentially three columns, GeneID1, GeneID2, and a number which indicates the maximum likelihood estimate of "evolutionary distance" between GeneID1 and GeneID2... correct? What are the units of this measure?

Are these amino-acid substitutions per sequence? Substitution rates scaled by sequence length? I don't seem to ever see estimates greater than 1 or 2, with most being small fraction of 1, which seems a little on the low side if this number is in units of counts, even for a comparison between closely related species. Are these standardized in some way?

Thanks again for the help!

todddeluca commented 10 years ago

Great to hear that it works. I'll merge it into master and bump the version number.

Frankly, I do not know the units of the distance metric. Sorry for that. Here is a reference to the original RSD paper, which might help: http://bioinformatics.oxfordjournals.org/content/19/13/1710.full.pdf+html. If you wish to examine more RSD results, you can query for orthologs using a distance filter at http://roundup.hms.harvard.edu/. There you can build an intuition regarding the frequency of distances in various organisms. To learn more, you might also consider emailing Dennis Wall, one of the authors of the RSD algorithm.

On Wed, Mar 5, 2014 at 2:59 AM, Ctross notifications@github.com wrote:

Just got a chance to give this a try. Worked great on the example, and appears to have performed good on my own FASTA files. Thank you so much for the help with trouble shooting!

One quick question. In the output file there are essentially three columns, GeneID1, GeneID2, and a number which indicates the maximum likelihood estimate of "evolutionary distance" between GeneID1 and GeneID2... correct? What are the units of this measure?

Are these amino-acid substitutions per sequence? Substitution rates scaled by sequence length? I don't seem to ever see estimates greater than 1 or 2, with most being small fraction of 1, which seems a little on the low side if this number is in units of counts, even for a comparison between closely related species. Are these standardized in some way?

Thanks again for the help!

Reply to this email directly or view it on GitHubhttps://github.com/todddeluca/reciprocal_smallest_distance/issues/2#issuecomment-36718252 .

todddeluca commented 10 years ago

Version 1.1.6, which incorporates the Windows fixes, has been pushed to master. It is also available on PyPI.